Remix clone Hacker News

new | show | ask | jobs Github

	▲	c7b 3 hours ago
		You could fit a Q4 GLM5.2 in 512GB and still have some space for context (372-475GB for the model): https://unsloth.ai/docs/models/glm-5.2 But yeah, there's a bit of a dearth of models that could fully utilize memory in the 128-256GB bracket at the moment. But things move so fast in this space, I wouldn't base my decision on a generation of models that's just a few months old.
	▲	rnxrx 3 hours ago \| parent [-]
		It depends on what's meant by "fully utilized" but fp8 quants of Nemotron 3 Super, the latest Minimax, Cohere A+ and the Mistral small and (especially) medium variants all sit in that 128-256 category, especially with full context or even moderate concurrency. In fact, in a 192GB environment I work with (Hopper GPUs, fwiw) I was pushed into using 4-bit quants with a couple of those to get the model working with a reasonable context window (..but 256 would have rocked out).