Remix.run Logo
xrd 8 hours ago

So close! My machine with 192GB RAM + RTX 3090 24GB can almost run this. It says it needs 24GB of VRAM and 256GB of RAM for MoE offloading.

https://unsloth.ai/docs/models/glm-5.2#usage-guide

In a prior thread, someone said it would take $500k in hardware:

https://news.ycombinator.com/item?id=48629970

elliotbnvl 7 hours ago | parent | next [-]

$500k is a vast overestimation. For massive concurrency at FP8 or even BF16 maybe.

NVFP4 at reasonable speeds (~120 tok/s) and concurrency is possible at a $80/90k figure with today's prices, maybe even less. That buys you 6 RTX 6000 PRO Blackwells, a decent CPU and motherboard, power supply. 576gb of VRAM.

You could do it for under $50k if you're OK with 40 tok/s decode, ~1200 tok/s prefill.

hbbio 6 hours ago | parent | next [-]

Yes, a single GB300 workstation also does it, probably even more than 120tok/s.

Official price 85k...

__m 7 hours ago | parent | prev | next [-]

How fast will the hardware become outdated? Are there big improvements expected in the next 3 years?

easygenes 6 hours ago | parent | next [-]

M5 Ultra will ship before end of year, likely. Though with current RAM shortage, likely max spec will be 256GB and in short supply.

In late 2027 or early 2028, Nvidia will release Vera Rubin DGX Spark, likely with double or better the performance of current Blackwell, though unclear if memory capacity will go up much from current 128GB. Two to four of those will run models like this decently.

In 2028 we should expect Vera Rubin RTX discrete lineup, including the replacement to the RTX PRO 6000. Likely memory spec will be minimum 128GB. Good chance of up to 200GB. Two to four of those will run NVFP4 models in this class very well.

jiqiren 3 hours ago | parent | next [-]

I hope all this speculation comes true. Right now this ram crunch is ridiculous.

6 hours ago | parent | prev [-]
[deleted]
Tepix 2 hours ago | parent | prev | next [-]

I think there is a gap right now for running large models such as GLM 5.2 in Q4 or Q8. My hope is on Intel Crescent Island 480GB cards. Let‘s see how expensive they‘ll be.

digitaltrees 5 hours ago | parent | prev | next [-]

I feel like the models are good enough for a decade of future work. So Once you have a working set up you can keep using it to do the work at the same level. There will be better stuff and may make that type of work obsolete but if you can do useful things it won’t be worth less.

segmondy 5 hours ago | parent | prev [-]

P40 was release 2016 and still selling like hotcakes!

easygenes 5 hours ago | parent | prev [-]

[dead]

mgambati 8 hours ago | parent | prev | next [-]

With 2 wouldn’t have good results. Ideal range for coding is at least Q8.

kibibu 8 hours ago | parent [-]

According to this very article, 4-bit dynamic is essentially lossless

Aurornis 7 hours ago | parent [-]

Watch out. Those claims are often made based on KL-divergence over some arbitrary corpus, not performance in the real world or benchmarks.

I’ve found that I need to go a couple steps past whatever quantizations are good enough in the KL-divergence testing to get good performance in real tasks with long context. So when Q4 is claimed to be lossless I end up with Q5 or Q6 for actual long-context tasks.

ijidak 6 hours ago | parent | prev | next [-]

Crossing my fingers that this boom jumpstarts 90's like improvements in computing hardware.

I feel like part of the reason for the relative stagnation in hardware over the last twenty years was simply the lack of use cases to justify hardware refreshes by businesses.

Most of the money and energy went to mobile for the last fifteen years.

Affordable local inference might be the gravy train the server, desktop, and laptop manufacturers need to get back in gear.

0xbadcafebee 4 hours ago | parent | next [-]

Definitely the stagnation was due to a lack of use cases, but this isn't a bad thing. We don't need most of the hardware advancement we got.

Business hardware got beefier because businesses demanded more data (or more specifically: the industry told businesses they needed more data), with no idea of what to actually do with it once they got it. To get all that data, bandwidth needed to be increased, with more iops to read/write it, more storage to keep it, and more memory and cpu to process it. But 99% of the data is junk. Companies have "data lakes" so big they need to come up with excuses to use the data, or risk somebody pointing out that they're spending a fortune hoarding bits.

Consumer hardware hasn't had a new use case since like 2012. Faster wifi for broadband & local file transfers, and higher-resolution video, are the only reasons one needed newer hardware. We actually got a resolution so high it makes no perceivable difference. And yeah we got faster CPUs and memory, but as soon as we did it got all eaten up by the most inefficient, wasteful software conceivable. Same use cases as 13 years ago, just more expensive, harder to use, and buggier. We should've gotten a new CPU architecture that was faster and more energy efficient. Finally it was delivered, but with a moat around the golden Apple.

Here we are two and a half decades into the Internet era, and my damn bluetooth earbuds and webcam microphone don't work half the time that I open a video conferencing app. Hardware can stay exactly like it is for the next few decades and I'd be happy. I just want software that works, and doesn't get continuously slower, forcing me to buy bigger hardware; or more draconian, locking me out of being able to use it how I want.

omnimus 38 minutes ago | parent [-]

The natural progression when performance is enough would be price. We were starting to see that but not anymore. I wonder if somebody is afraid the future where generally useful computation is cheap.

gruez 5 hours ago | parent | prev | next [-]

>I feel like part of the reason for the relative stagnation in hardware over the last twenty years was simply the lack of use cases to justify hardware refreshes by businesses.

No, we're running into limits of moore's law, and it's showing in prices for new nodes, where they're getting denser but not cheaper.

horsawlarway 4 hours ago | parent [-]

It's true we hit limits, but I feel like a lot of it was "limits" in the sense that the tradeoff stopped being worth the cost, so we optimized in other areas.

So we hit limits on clock speed in the early 2000s (ex - the 4ghz wall) but it also turned out that mobile as the driver for sales meant no one really cared much about clock speed compared to performance/watt.

Clock speed mattered, but only relative to how many watts it took to get it (and above 4ghz... too many watts).

But we've seen a 15x improvement over the last 20 years. Performance/Watt is WAY up.

My guess is that LLMs are going to drive another "improvement cycle" in areas that we didn't care much about before.

I've built about 10 personal desktop machines (1 every ~4 years) and I can honestly say that I didn't care much about memory bandwidth prior to 2021.

In the same way that I didn't care much about how many watts my pentium 4 was using in 2005.

But now... now I care a lot about memory bandwidth. I care about memory speeds and total system ram in a manner I really, really didn't before.

So I think we're going to see a big shift to machines built on unified ram with a crazy focus on squeezing memory bandwidth and total ram capacity as far as we can.

My bet is that we'll get a similar 10-15x improvement by 2040 in unified system ram designs.

I fully expect to see 2tb unified ram desktops and 200gb unified ram phones be relatively common on a 20 year timeline, assuming we see similar levels of geopolitical stability (ex - world war 3 throws a wrench into things).

BobbyTables2 3 hours ago | parent | prev | next [-]

Yeah, even Windows managed to not drive terribly dramatic upgrades in general computing (besides Windows’ absurd RAM usage and now requiring a TPM).

In the old days, Microsoft Entertainment Pack games were somewhat visibly taxing on some lower end systems.

linzhangrun 5 hours ago | parent | prev [-]

Physical limitation of the manufacturing process may be more significant factor, starting from the TSMC 10nm ten years ago

cheema33 8 hours ago | parent | prev | next [-]

I have the RAM, but not the VRAM. What kind of speed/tps could you expect from a 3090 with 24GBs of RAM? I am somewhat tempted to pick a GPU with 24GBs of RAM.

phamilton 6 hours ago | parent [-]

Generation is basically just memory bandwidth math.

Each token has to read all the active weights. I think that's around 40B parameters active. At a 4-bit quant that's 20GB. With 100GB/s (replace with whatever your bandwidth is) and you get 5 tokens per second.

SlavikCA 2 hours ago | parent [-]

And with MTP (or other speculation techniques) you can ~double that.

uberex 7 hours ago | parent | prev [-]

Funny I casually asked Gemini and it said 500k for unquantized with decent throughput.

stymaar 6 hours ago | parent | next [-]

This is why you shouldn't believe uncritically an answer from an LLM (neither should you do for any answer from a human either though).

andy_ppp 3 hours ago | parent [-]

But I did my research online and the sun cycle is every 11 years and something something global warming is a hoax every single year now.

colinsane 5 hours ago | parent | prev | next [-]

i asked gemini and it replied with "Error: 400 Your prompt was blocked by safety filters. Please revise and try again."

digitaltrees 3 hours ago | parent [-]

I asked and it said “403 forbidden - careful peon attempts to bypass the late stage capitalism api with your monetary offerings in exchange for you daily tokens will get you perma banned right to jail”.

j45 5 hours ago | parent | prev [-]

LLMs aren't discrete calcluators or estimators of things unless framed and guided to do so.