Remix clone Hacker News

new | show | ask | jobs Github

	▲	ThatPlayer 3 days ago
		I don't believe that's the same thing. That should be the generic offloading that ollama will do to any too big model, while this feature requires MoE models. https://github.com/ollama/ollama/issues/11772 is the feature request for similar on ollama. One comment in that thread mentions getting almost 30tk/s from gpt-oss-120b on a 3090 with llama.cpp compared to 8tk/s with ollama. This feature is limited to MoE models, but those seem to be gaining traction with gpt-oss, glm-4.5, and qwen3
	▲	imiric 3 days ago \| parent [-]
		Ah, I was not aware of that, thanks. I'll give it a try.