Remix.run Logo
segmondy 3 hours ago

I run Q4_K_XL. All it takes to run to get about 6tk/sec is 512gb of ram and 2 3090 GPUs with llama.cpp -cmoe. I also have crappy DDR4, 2400mhz, 3200mhz will bring that speed up to about 9tk/sec. I also have ok 32core epyc CPU, a better 64core would bring it up to about 11tk/sec. I did a budget build before the crazy hardware cost and I regret it everyday. Nevertheless, it's fantastic being able to run this model at home. It's great for planning, one shot prompting once you have a plan or all the context you need. This entire hardware cost $2400 when it was built. If you're willing to be resourceful, you can find ways to run these models at home. I often get the silly question of why, and suggestions about how much I can save using cloud API, but the Fable drama has opened up eyes on why it's good for us to be independent. Thanks team unsloth, Q4_K_XL is solid, if you are going to grab a quant, make sure to get the K_XL variant if it can fit.

zozbot234 6 minutes ago | parent | next [-]

AIUI the llama.cpp implementation for this model is still quite half-baked due to missing the support for DSA sparse attention mechanism. This leads to running the model with a different mechanism that it has not been trained for, which has been shown to lead to lower quality and performance.

Anyway, I think GLM 5.2 in many ways is not as interesting as DeepSeek V4 series, which uses an even more advanced attention mechanism and can save a lot of memory capacity for KV cache, especially at larger contexts. Which in turn opens up wide batching especially on consumer platforms. GLM doesn't have that, in some ways it feels broadly similar to Kimi 2.6 wrt. the underlying performance architecture. Both are a bit too heavy to run reasonably at full quality on ordinary hardware.

nextaccountic an hour ago | parent | prev | next [-]

How can you combine CPU cores and multiple GPU? Are you running some layers in cpu, others in gpu #1, and others in gpu #2? What about the bandwidth and latency between them?

Or maybe the model itself only runs at gpus, and the cpu memory only store the weights for experts not corrently activated? If so, then what's the 32 or 64 cpu cores for?

I'm a big fan of fully utilizing one's hardware and it's kinda sad that it's not the norm to run things on either gpu, cpu or both, dynamically choosing at runtime, for everyday software

nodja 9 minutes ago | parent | next [-]

Pipeline parallelism. Instead of splitting layers by row/column. You split at the layer edges. So instead of having this huge bottleneck of bandwidth you only need to transfer about 4KB per token when changing devices on a model like Qwen 3 30BA3.

segmondy 27 minutes ago | parent | prev [-]

checkout llama.cpp, the entire point of the project is for us mere mortals and GPU poor.

fsuts 35 minutes ago | parent | prev | next [-]

6 tokens per second?

Can you put up with that? As seems very slow. I aim for 40t/s on a laptop and choose models that deliver that speed over larger slower ones

segmondy 24 minutes ago | parent [-]

I have been putting up with it forever. We are spoiled by MixtureOfExperts. Folks were delighted to run llama3-70B at such speed. We were happy with 15-20tk/sec with 8b models, and if you could run llama3-405B at 1tk/sec you were a god. To each their own. I can live with 6 high quality tokens. If I could get a Fable equivalent model, I'll gladly take 2tk/sec if that's what it took to run it locally.

manmal 19 minutes ago | parent | next [-]

But what is it doing for you that you couldn’t do yourself at that speed? I‘m really curious and on the fence of partly going local.

all2 9 minutes ago | parent [-]

Is think you would use it more like email and less like text messages, so the domain of communication shifts drastically. The other part is, you don't have to run just that model, you can offload a lot of chores to smaller models.

froh 20 minutes ago | parent | prev [-]

do you use caveman or similar?

edg5000 an hour ago | parent | prev | next [-]

Very cool. So it's not just about GPU VRAM which I incorrectly thought. I though you'd need 512 GB GPU VRAM. I don't think it cost only 2400; 512GB ram would be more expensive though back in the day. But not mortgage-grade 200.000 which I estimated myself (which assumed running in 100% VRAM; overkill for a single user probably).

segmondy 28 minutes ago | parent [-]

you can use system ram with a system like llama.cpp which offloads to system ram. token generation is a function of system bandwidth, the faster the bandwidth the better. so I'm on 8 channel 2400mhz. if I had a 12 ddr channel, I would get 1.5x the speed at 2400mhz. of course ddr5 is much faster, so a 12 ddr at 4800mhz will provide 3x the speed for token generation or roughly 18tk/sec. prompt processing is all about compute, so the more cpu cores you have the faster it can do PP.

redox99 3 hours ago | parent | prev [-]

That's crazy good for $2400.