| ▲ | gcr 2 hours ago | |
There are two flavors of Qwen 3.6: - A 27B "dense" model - A 35B "Mixture of Experts" model, which activates only 3B parameters for each token. For your hardware, I strongly recommend `unsloth/Qwen3.6-35B-A3B-GGUF:Q4_K_M`. I have an M1 Max with 32GB VRAM from 2021 that can read at ~300-500 tokens/sec and write at ~30 tokens/sec with llama-cpp's default settings, which is plenty fast. The 27B model can read ~70tok/sec and write ~5tok/sec. The 35B MoE model technically takes slightly more memory but is much faster because it's doing 1/9th the work. It's not quite as "smart", but it's comparable. | ||
| ▲ | julianlam an hour ago | parent | next [-] | |
May I ask why the M instead of XL? Obviously bigger != better but I don't know what the differences are. | ||
| ▲ | pixelesque an hour ago | parent | prev [-] | |
Thank you - I'll give that a go! | ||