| ▲ | kgeist 2 hours ago | |
>$40k gets you almost-Opus GLM 5.2 is "almost Opus," and it needs at least 8xH200s for comfortable inference (so it's closer to $400k than $40k). They suggest using this modified model: >A REAP-pruned (≈22% of experts removed), Int8-mix NVFP4 quantized version of GLM-5.2, ≈594B parameters. I wonder how it behaves in practice outside of benchmarks. Qwen3.6, even at 6-bit quantization, often gets stuck in loops while reasoning. And here they've also removed some experts. I mean, sometimes an 8-bit or 16-bit small model can be smarter than a lobotomized large model. I heard the consensus is you shouldn't go below 8 bit for coding. Also, it's not clear what is left of the available context when you try to fit a lobotomized model into 4 RTX 6000s. Anything below 100k is barely usable because it often hits compaction before it's able to gather the necessary context P.S. found in the repos, 240k context | ||
| ▲ | amelius 2 hours ago | parent | next [-] | |
How does this work with scaling? I assume you can then somehow run several hundreds of prompts concurrently? | ||
| ▲ | CamperBob2 an hour ago | parent | prev [-] | |
[dead] | ||