Unfortunately Qwen3-next is not well supported on Apple silicon, it seems the Qwen team doesn't really care about Apple.

On M1 64GB Q4KM on llama.cpp gives only 20Tok/s while on MLX it is more than twice as fast. However, MLX has problems with kv cache consistency and especially with branching. So while in theory it is twice as fast as llama.cpp it often does the PP all over again which completely trashes performance especially with agentic coding.

So the agony is to decide whether to endure half the possible speed but getting much better kv-caching in return. Or to have twice the speed but then often you have again to sit through prompt processing.

But who knows, maybe Qwen gives them a hand? (hint,hint)

▲

ttoinou 8 hours ago | parent | next [-]

I can run nightmedia/qwen3-next-80b-a3b-instruct-mlx at 60-74 tps using LM Studio. What did you try ? What benefit do you get from KV Caching ?

	▲	dust42 8 hours ago \| parent [-]
		KV caching means that when you have 10k prompt, all follow up questions return immediately - this is standard with all inference engines. Now if you are not happy with the last answer, you maybe want to simply regenerate it or change your last question - this is branching of the conversation. Llama.cpp is capable of re-using the KV cache up to that point while MLX does not (I am using MLX server from MLX community project). I haven't tried with LMStudio. Maybe worth a try, thanks for the heads-up.

▲

cgearhart 4 hours ago | parent | prev [-]

Any notes on the problems with MLX caching? I’ve experimented with local models on my MacBook and there’s usually a good speedup from MLX, but I wasn’t aware there’s an issue with prompt caching. Is it from MLX itself or LMstudio/mlx-lm/etc?

	▲	anon373839 an hour ago \| parent \| next [-]
		There’s this issue/outstanding PR: https://github.com/lmstudio-ai/mlx-engine/pull/188#issuecomm...
	▲	dust42 2 hours ago \| parent \| prev [-]
		It is the buffer implementation. [u1 10kTok]->[a1]->[u2]->[a2]. If you branch between the assistant1 and user2 answers then MLX does reprocess the u1 prompt of let's say 10k tokens while llama.cpp does not. I just tested with GGUF and MLX of Qwen3-Coder-Next with llama.cpp and now with LMStudio. As I do branching very often, it is highly annoying for me to the point of being unusable. Q3-30B is much more usable then on Mac - but by far not as powerful.