Remix clone Hacker News

new | show | ask | jobs Github

	▲	segmondy 3 hours ago
		llama.cpp is designed for partial offloading, the most important part of the model will be loaded into the GPU and the rest on system ram. I run 500B+ models such as DeepSeek/KimiK2.5/GLM-5 without having that much GPU vram.