Thank you for your work.

You have an answer on your page regarding "Should I pick 26B-A4B or 31B?", but can you please clarify if, assuming 24GB vRAM, I should pick a full precision smaller model or 4 bit larger model?

▲

petu 2 days ago | parent | next [-]

Try 26B first. 31B seems to have very heavy KV cache (maybe bugged in llama.cpp at the moment; 16K takes up 4.9GB).

edit: 31B cache is not bugged, there's static SWA cost of 3.6GB.. so IQ4_XS at 15.2GB seems like reasonable pair, but even then barely enough for 64K for 24GB VRAM. Maybe 8 bit KV quantization is fine now after https://github.com/ggml-org/llama.cpp/pull/21038 got merged, so 100K+ is possible.

> I should pick a full precision smaller model or 4 bit larger model?

4 bit larger model. You have to use quant either way -- even if by full precision you mean 8 bit, it's gonna be 26GB + overhead + chat context.

Try UD-Q4_K_XL.

▲

danielhanchen 2 days ago | parent [-]

Yes UD-Q4_K_XL works well! :)

▲

mixtureoftakes 2 days ago | parent [-]

what is the main difference between "normal" quants and the UD ones?

	▲	car 2 days ago \| parent [-]
		They explain it here: https://unsloth.ai/docs/basics/unsloth-dynamic-2.0-ggufs For the best quality reply, I used the Gemma-4 31B UD-Q8_K_XL quant with Unsloth Studio to summarize the URL with web search. It produced 4.9 tok/s (including web search) on an MacBook Pro M1 Max with 64GB. Here an excerpt of it's own words: Unsloth Dynamic 2.0 Quantization Dynamic 2.0 is not just a "bit-reduction" but an intelligent, per-layer optimization strategy. - Selective Layer Quantization: Instead of making every layer 4-bit, Dynamic 2.0 analyzes every single layer and selectively adjusts the quantization type. Some critical layers may be kept at higher precision, while less critical layers are compressed more. - Model-Specific Tailoring: The quantization scheme is custom-built for each model. For example, the layers selected for quantization in Gemma 3 are completely different from those in Llama 4. - High-Quality Calibration: They use a hand-curated calibration dataset of >1.5M tokens specifically designed to enhance conversational chat performance, rather than just optimizing for Wikipedia-style text. - Architecture Agnostic: While previous versions were mostly effective for MoE (Mixture of Experts) models, Dynamic 2.0 works for all architectures (both MoE and non-MoE).

▲

2 days ago | parent | prev | next [-]

[deleted]

▲

danielhanchen 2 days ago | parent | prev [-]

Thank you!

I presume 24B is somewhat faster since it's only 4B activated - 31B is quite a large dense model so more accurate!

▲

ryandrake 2 days ago | parent [-]

This is one of the more confusing aspects of experimenting with local models as a noob. Given my GPU, which model should I use, which quantization of that model should I pick (unsloth tends to offer over a dozen!) and what context size should I use? Overestimate any of these, and the model just won't load and you have to trial-and-error your way to finding a good combination. The red/yellow/green indicators on huggingface.co are kind of nice, but you only know for sure when you try to load the model and allocate context.

▲

danielhanchen 2 days ago | parent [-]

Definitely Unsloth Studio can help - we recommend specific quants (like Gemma-4) and also auto calculate the context length etc!

▲

ryandrake 2 days ago | parent [-]

Will have to try it out. I always thought that was more for fine-tuning and less for inference.

	▲	danielhanchen 2 days ago \| parent [-]
		Oh yes sadly we partially mis-communicated haha - there's both and synthetic data generation + exporting!