| ▲ | Maxious 3 hours ago |
| ICYMI unsloth has had some major breakthroughs today with the Qwen3.5 local models https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks With the Qwen3.5 35B A3B at Q4 I've got 200k context running at 62.98 tokens per second on a local RTX5080 16GB. |
|
| ▲ | danielhanchen an hour ago | parent | next [-] |
| Oh I didn't expect this to be on HN haha - but yes for our new benchmarks for Qwen3.5, we devised a slightly different approach for quantization which we plan to roll out to all new models from now on! |
|
| ▲ | roxolotl an hour ago | parent | prev | next [-] |
| What method are you using to do that? I’ve been playing with llama.cpp a lot lately and trying to figure out the cleanest options for getting a solid context window on 32gb vram and 64gb system ram. |
| |
| ▲ | jychang 33 minutes ago | parent [-] | | 32GB vram is more than enough for Qwen 3.5 35b You can just load the Q4_K_XL model like normal, and put all tensors on GPU without any -ot or --cpu-moe flags. If you need a massive context for some reason where model+kv cache won't fit in 32gb, then use -ot to move the ffn moe experts for 1-2 layers into RAM. You'll get a speed hit (due to loading params from slower RAM instead of fast VRAM) but it'll work. |
|
|
| ▲ | Kayou 3 hours ago | parent | prev | next [-] |
| Wait, the Q4 quantization which is more than 20GB fits in your 16GB GPU ? I didn't know that was possible, I was always restricting myself to smaller model than the VRAM I had |
| |
| ▲ | segmondy 3 hours ago | parent | next [-] | | llama.cpp is designed for partial offloading, the most important part of the model will be loaded into the GPU and the rest on system ram. I run 500B+ models such as DeepSeek/KimiK2.5/GLM-5 without having that much GPU vram. | |
| ▲ | Koffiepoeder an hour ago | parent | prev | next [-] | | The A3B part in the name stands for `Active 3B`, so for the inference jobs a core 3B is used in conjunction with another subpart of the model, based on the task (MoE, mixture of experts). If you use these models mostly for related/similar tasks, that means you can make do with a lot less than the 35B params in active RAM. These models are therefore also sometimes called sparse models. | |
| ▲ | Maxious 3 hours ago | parent | prev | next [-] | | Yep. These Mixture of Experts models are well suited for paging in only the relevant data for a certain task https://huggingface.co/blog/moe There's some experiments of just removing or merging experts post training to shrink models even more https://bknyaz.github.io/blog/2026/moe/ | |
| ▲ | nurettin an hour ago | parent | prev [-] | | This is why they say "A3B" meaning only 3B is active at a time, limiting VRAM usage. |
|
|
| ▲ | mirekrusin 2 hours ago | parent | prev | next [-] |
| 2x RTX 4090, Q8, 256k context, 110 t/s |
|
| ▲ | RS-232 an hour ago | parent | prev | next [-] |
| That’s intriguing. I have the same card, maybe I should give it a go. Curious about your CPU/RAM/storage capacity as well. Any resources for configuring the local setup? My entire home media stack is a single compose file in a WSL distro so it would be cool if local LLM worked the same way. |
|
| ▲ | jychang 3 hours ago | parent | prev [-] |
| Not really breakthroughs, more like bugfixes for their broken first batch. |
| |
| ▲ | danielhanchen an hour ago | parent [-] | | No this is false - unsure if you saw our new blog - https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks which shows SOTA on nearly all bits, and we shared all our research as well | | |
| ▲ | jychang an hour ago | parent [-] | | Yeah, I saw that yesterday. The blog post does not explain why/how the Qwen 3.5 quants uploaded on 2/27 are different from the files uploaded on 2/24. Old 2/24 Q4_K_XL commit (pre bugfix files): https://huggingface.co/unsloth/Qwen3.5-35B-A3B-GGUF/commit/7... Questions for a postmortem that the blog post left unanswered: - Why the change? Is it just to improve PPL/KLD? Sure, we can assume PPL and KLD are not perfect benchmarks. If yes, then why change the quantization anyways? Or was the old 2/24 quant actually much worse performing in the real world?I presume the Q4_K_XL quant using mxfp4 was the issue? If the 2/24 files having a lower PPL is an actual issue due to low quality tensors, then why not just say that? - What were the main tensors that had the quantizations changed from 2/24 to 2/27? Did you now quantize attention tensors differently? Or perhaps ssm? T - What was it changed from? Was it changed from mxfp4 or q4_k to q8, or something else? A quick sentence in the blog post saying "ok, we've confirmed that using mxfp4 (or q3 or whatever) in the attention/ssm/biases/norms/etc is a bad idea, we had that in our old models on 2/24 and our new models today are better" that would make it clear. As it's written, it's trying to both say "PPL/KLD don't actually reflect real world quality" and "we changed our quant to increase PPL/KLD" at the same time, which seems contradictory. |
|
|