| ▲ | pixelesque 2 hours ago | |||||||||||||
Out of interest, what machine and model are you running it on? I tried the qwen3.6-27b Q6_k GUFF in llama.cpp and LM Studio on my M2 MacBook Pro 32GB machine last week, and I barely get a token a second with either. What sort of speed should I be expecting? I tried some of the Llama 3 34b (nous-capybara?) models two years ago with llama.cpp, and I seem to remember getting a few tokens a second then, so not sure if I've got something completely mis-configured, or I just have unreasonable expectations. Or maybe qwen 3.x is slower for some reason? (Is it mixture of experts?) I'm not expecting it to be instant, but what I'm currently seeing is not really usable. | ||||||||||||||
| ▲ | booty 2 minutes ago | parent | next [-] | |||||||||||||
The fact that it was this slow makes me suspect it's a matter of insufficient free RAM. The entire model needs to fit into RAM (and stay there the entire time) for acceptable performance.(not sure of exact diagnosis/fix, but definitely look in that direction if you're still having this issue when you give it another shot) | ||||||||||||||
| ▲ | gcr 2 hours ago | parent | prev | next [-] | |||||||||||||
There are two flavors of Qwen 3.6: - A 27B "dense" model - A 35B "Mixture of Experts" model, which activates only 3B parameters for each token. For your hardware, I strongly recommend `unsloth/Qwen3.6-35B-A3B-GGUF:Q4_K_M`. I have an M1 Max with 32GB VRAM from 2021 that can read at ~300-500 tokens/sec and write at ~30 tokens/sec with llama-cpp's default settings, which is plenty fast. The 27B model can read ~70tok/sec and write ~5tok/sec. The 35B MoE model technically takes slightly more memory but is much faster because it's doing 1/9th the work. It's not quite as "smart", but it's comparable. | ||||||||||||||
| ||||||||||||||
| ▲ | mft_ 2 hours ago | parent | prev | next [-] | |||||||||||||
The 27B model is dense, so is relatively slow. The 35B-A3B model is marginally weaker but being MoE is much faster - like ~4-8x faster in basic benchmarks on my M1 Max. For comparison, I just ran a couple of quick benchmarks (default settings) with llama-bench: Qwen3.6-35B-A3B at Q6_K_XL gave 858 t/s pp512 (prompt processing) and 43 t/s tg128 (token generation). Qwen3.6-27B at Q4_K_XL gave 103 t/s pp512 and 8 t/s tg128. | ||||||||||||||
| ||||||||||||||
| ▲ | Figs 2 hours ago | parent | prev | next [-] | |||||||||||||
27B is the dense one. Try the Qwen3.6-35B-A3B variants for the MoE release. That's what I'm running on a Framework Desktop and I get ~50 tok/s plus or minus a few. The dense one is similarly slow for me -- not sure what to expect on your hardware from the MoE but it should probably be much faster. | ||||||||||||||
| ||||||||||||||
| ▲ | KronisLV 2 hours ago | parent | prev [-] | |||||||||||||
> qwen3.6-27b Q6_k That's the dense model, you probably want a mixture-of-experts (MoE) one. Here's what you probably want instead: https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF | ||||||||||||||
| ||||||||||||||