Remix clone Hacker News

new | show | ask | jobs Github

	▲	ttoinou 7 hours ago
		I can run nightmedia/qwen3-next-80b-a3b-instruct-mlx at 60-74 tps using LM Studio. What did you try ? What benefit do you get from KV Caching ?
	▲	dust42 6 hours ago \| parent [-]
		KV caching means that when you have 10k prompt, all follow up questions return immediately - this is standard with all inference engines. Now if you are not happy with the last answer, you maybe want to simply regenerate it or change your last question - this is branching of the conversation. Llama.cpp is capable of re-using the KV cache up to that point while MLX does not (I am using MLX server from MLX community project). I haven't tried with LMStudio. Maybe worth a try, thanks for the heads-up.