| ▲ | ttoinou 7 hours ago | |
I can run nightmedia/qwen3-next-80b-a3b-instruct-mlx at 60-74 tps using LM Studio. What did you try ? What benefit do you get from KV Caching ? | ||
| ▲ | dust42 6 hours ago | parent [-] | |
KV caching means that when you have 10k prompt, all follow up questions return immediately - this is standard with all inference engines. Now if you are not happy with the last answer, you maybe want to simply regenerate it or change your last question - this is branching of the conversation. Llama.cpp is capable of re-using the KV cache up to that point while MLX does not (I am using MLX server from MLX community project). I haven't tried with LMStudio. Maybe worth a try, thanks for the heads-up. | ||