Remix clone Hacker News

new | show | ask | jobs Github

	▲	Kayou 3 hours ago
		Wait, the Q4 quantization which is more than 20GB fits in your 16GB GPU ? I didn't know that was possible, I was always restricting myself to smaller model than the VRAM I had
	▲	segmondy 3 hours ago \| parent \| next [-]
		llama.cpp is designed for partial offloading, the most important part of the model will be loaded into the GPU and the rest on system ram. I run 500B+ models such as DeepSeek/KimiK2.5/GLM-5 without having that much GPU vram.
	▲	Koffiepoeder an hour ago \| parent \| prev \| next [-]
		The A3B part in the name stands for `Active 3B`, so for the inference jobs a core 3B is used in conjunction with another subpart of the model, based on the task (MoE, mixture of experts). If you use these models mostly for related/similar tasks, that means you can make do with a lot less than the 35B params in active RAM. These models are therefore also sometimes called sparse models.
	▲	Maxious 3 hours ago \| parent \| prev \| next [-]
		Yep. These Mixture of Experts models are well suited for paging in only the relevant data for a certain task https://huggingface.co/blog/moe There's some experiments of just removing or merging experts post training to shrink models even more https://bknyaz.github.io/blog/2026/moe/
	▲	nurettin an hour ago \| parent \| prev [-]
		This is why they say "A3B" meaning only 3B is active at a time, limiting VRAM usage.