Remix clone Hacker News

new | show | ask | jobs Github

	▲	kgeist 2 hours ago
		>$40k gets you almost-Opus GLM 5.2 is "almost Opus," and it needs at least 8xH200s for comfortable inference (so it's closer to $400k than $40k). They suggest using this modified model: >A REAP-pruned (≈22% of experts removed), Int8-mix NVFP4 quantized version of GLM-5.2, ≈594B parameters. I wonder how it behaves in practice outside of benchmarks. Qwen3.6, even at 6-bit quantization, often gets stuck in loops while reasoning. And here they've also removed some experts. I mean, sometimes an 8-bit or 16-bit small model can be smarter than a lobotomized large model. I heard the consensus is you shouldn't go below 8 bit for coding. Also, it's not clear what is left of the available context when you try to fit a lobotomized model into 4 RTX 6000s. Anything below 100k is barely usable because it often hits compaction before it's able to gather the necessary context P.S. found in the repos, 240k context
	▲	amelius 2 hours ago \| parent \| next [-]
		How does this work with scaling? I assume you can then somehow run several hundreds of prompts concurrently?
	▲	CamperBob2 an hour ago \| parent \| prev [-]
		[dead]