| ▲ | dust42 8 hours ago | |||||||||||||
Unfortunately Qwen3-next is not well supported on Apple silicon, it seems the Qwen team doesn't really care about Apple. On M1 64GB Q4KM on llama.cpp gives only 20Tok/s while on MLX it is more than twice as fast. However, MLX has problems with kv cache consistency and especially with branching. So while in theory it is twice as fast as llama.cpp it often does the PP all over again which completely trashes performance especially with agentic coding. So the agony is to decide whether to endure half the possible speed but getting much better kv-caching in return. Or to have twice the speed but then often you have again to sit through prompt processing. But who knows, maybe Qwen gives them a hand? (hint,hint) | ||||||||||||||
| ▲ | ttoinou 8 hours ago | parent | next [-] | |||||||||||||
I can run nightmedia/qwen3-next-80b-a3b-instruct-mlx at 60-74 tps using LM Studio. What did you try ? What benefit do you get from KV Caching ? | ||||||||||||||
| ||||||||||||||
| ▲ | cgearhart 4 hours ago | parent | prev [-] | |||||||||||||
Any notes on the problems with MLX caching? I’ve experimented with local models on my MacBook and there’s usually a good speedup from MLX, but I wasn’t aware there’s an issue with prompt caching. Is it from MLX itself or LMstudio/mlx-lm/etc? | ||||||||||||||
| ||||||||||||||