Remix.run Logo
vntok 3 days ago

$2000 will get you 30~50 tokens/s on perfectly usable quantization levels (Q4-Q5), taken from any one among the top 5 best open weights MoE models. That's not half bad and will only get better!

master_crab 3 days ago | parent | next [-]

If you are running lightweight models like deepseek 32B. But anything more and it’ll drop. Also, costs have risen a lot in the last month for RAM and AI adjacent hardware. It’s definitely not 2k for the rig needed for 50 tokens a second

threeducks 3 days ago | parent | prev | next [-]

Could you explain how? I can't seem to figure it out.

DeepSeek-V3.2-Exp has 37B active parameters, GLM-4.7 and Kimi K2 have 32B active parameters.

Lets say we are dealing with Q4_K_S quantization for roughly half the size, we still need to move 16 GB 30 times per second, which requires a memory bandwidth of 480 GB/s, or maybe half that if speculative decoding works really well.

Anything GPU-based won't work for that speed, because PCIe 5 provides only 64 GB/s and $2000 can not afford enough VRAM (~256GB) for a full model.

That leaves CPU-based systems with high memory bandwidth. DDR5 would work (somewhere around 300 GB/s with 8x 4800MHz modules), but that would cost about twice as much for just the RAM alone, disregarding the rest of the system.

Can you get enough memory bandwidth out of DDR4 somehow?

int_19h 3 days ago | parent | prev [-]

That doesn't sound realistic to me. What is your breakdown on the hardware and the "top 5 best models" for this calculation?

vntok 4 hours ago | parent [-]

Look up AMD's Strix Halo mini-PC such as GMKtec's EVO-X2. I got the one with 128GB of unified RAM (~100GB VRAM) last year for 1900€ excl. VAT; it runs like a beast especially for SOTA/near-SOTA MoE models.