As this is a dense model and it's pretty sizable, 4-bit quantization can be nearly lossless. With that, you can run this on a 3090/4090/5090. You can probably even go FP8 with 5090 (though there will be tradeoffs). Probably ~70 tok/s on a 5090 and roughly half that on a 4090/3090. With speculative decoding, you can get even faster (2-3x I'd say). Pretty amazing what you can get locally.

▲

Aurornis 2 hours ago | parent | next [-]

> As this is a dense model and it's pretty sizable, 4-bit quantization can be nearly lossless

The 4-bit quants are far from lossless. The effects show up more on longer context problems.

> You can probably even go FP8 with 5090 (though there will be tradeoffs)

You cannot run these models at 8-bit on a 32GB card because you need space for context. Typically it would be Q5 on a 32GB card to fit context lengths needed for anything other than short answers.

	▲	alex7o 9 minutes ago \| parent \| next [-]
		Turboquant on 4bit helps a lot as well for keeping context in vram, but int4 is definitely not lossless. But it all depends for some people this is sufficient
	▲	ekojs 2 hours ago \| parent \| prev [-]
		> You cannot run these models at 8-bit on a 32GB card because you need space for context You probably can actually. Not saying that it would be ideal but it can fit entirely in VRAM (if you make sure to quantize the attention layers). KV cache quantization and not loading the vision tower would help quite a bit. Not ideal for long context, but it should be very much possible. I addressed the lossless claim in another reply but I guess it really depends on what the model is used for. For my usecases, it's nearly lossless I'd say.

▲

zozbot234 2 hours ago | parent | prev | next [-]

4-bit quantization is almost never lossless especially for agentic work, it's the lowest end of what's reasonable. It's advocated as preferable to a model with fewer parameters that's been quantized with more precision.

	▲	ekojs 2 hours ago \| parent [-]
		Yeah, figure the 'nearly lossless' claim is the most controversial thing. But in my defense, ~97% recovery in benchmarks is what I consider 'nearly lossless'. When quantized with calibration data for a specialized domain, the difference in my internal benchmark is pretty much indistinguishable. But for agentic work, 4-bit quants can indeed fall a bit short in long-context usecase, especially if you quantize the attention layers.

▲

binary132 2 hours ago | parent | prev [-]

That seems awfully speculative without at least some anecdata to back it up.

	▲	arcanemachiner 2 hours ago \| parent \| next [-]
		Sure, go get some. This isn't the first open-weight LLM to be released. People tend to get a feel for this stuff over time. Let me give you some more baseless speculation: Based on the quality of the 3.5 27B and the 3.6 35B models, this model is going to absolutely crush it.
	▲	ekojs 2 hours ago \| parent \| prev [-]
		Not at all, I actually run ~30B dense models for production and have tested out 5090/3090 for that. There are gotchas of course, but the speed/quality claims should be roughly there.