Remix clone Hacker News

new | show | ask | jobs Github

	▲	xienze 3 hours ago
		It's probably a combination of things: * New models running in llama.cpp (what's under the hood of ollama et al) frequently require bug fixes. * The GGUF models that run in llama.cpp frequently require bug fixes (Unsloth is notorious for this -- they release GGUF models about 10 minutes after official .safetensors releases). * You're probably running a <Q8 quantization of the model, and a good chance <BF16 quantization for KV cache. This makes for compounding issues as context grows and tool calls multiply. Local models really are great but I think a major problem are the people in groups like r/localllama who run models at absurd quantization levels in order to cram them on their underpowered hardware and convince themselves that they're running SOTA at home. The best way to run these models is, frankly, a lot of VRAM and vLLM (which is what the people developing these models are almost certainly targeting).