| ▲ | mixtureoftakes 2 days ago | |
what is the main difference between "normal" quants and the UD ones? | ||
| ▲ | car 2 days ago | parent [-] | |
They explain it here: https://unsloth.ai/docs/basics/unsloth-dynamic-2.0-ggufs For the best quality reply, I used the Gemma-4 31B UD-Q8_K_XL quant with Unsloth Studio to summarize the URL with web search. It produced 4.9 tok/s (including web search) on an MacBook Pro M1 Max with 64GB. Here an excerpt of it's own words: Unsloth Dynamic 2.0 Quantization Dynamic 2.0 is not just a "bit-reduction" but an intelligent, per-layer optimization strategy. - Selective Layer Quantization: Instead of making every layer 4-bit, Dynamic 2.0 analyzes every single layer and selectively adjusts the quantization type. Some critical layers may be kept at higher precision, while less critical layers are compressed more. - Model-Specific Tailoring: The quantization scheme is custom-built for each model. For example, the layers selected for quantization in Gemma 3 are completely different from those in Llama 4. - High-Quality Calibration: They use a hand-curated calibration dataset of >1.5M tokens specifically designed to enhance conversational chat performance, rather than just optimizing for Wikipedia-style text. - Architecture Agnostic: While previous versions were mostly effective for MoE (Mixture of Experts) models, Dynamic 2.0 works for all architectures (both MoE and non-MoE). | ||