Remix.run Logo
bflesch 7 hours ago

In his newsletter Ed Zitron hammered down the point that GPUs depreciate quickly, but these kind of reliability issues are shocking to read. The GPUs are so common to fail that they hang out in a 24/7 slack channel with customers like Meta (who apparently can't set up a cluster themselves..).

Ed Zitron also called out the business model of GPU-as-a-service middleman companies like modal deeply unsustainable, and I also don't see how they can make a profit if they are only reselling public clouds. Assuming they are VC funded the VCs need returns for their funds.

Unlike fiber cable during the dot com boom the currently used GPUs eventually end up in the trash bin. These GPUs are treated like toilet paper, you use them and throw them away, nothing you will give to the next generation.

Who will be the one who marks down these "assets"? Who is providing money to buy the next batch of GPUs, now that billions are already spent?

Maybe we'll see a wave of retirements soon.

> It’s underappreciated how unreliable GPUs are. NVIDIA’s hardware is a marvel, the FLOPs are absurd. But the reliability is a drag. A memorable illustration of how AI/ML development is hampered by reliability comes from Meta’s paper detailing the training process for the LLaMA 3 models: “GPU issues are the largest category, accounting for 58.7% of all unexpected issues.” > Imagine the future we’ll enjoy when GPUs are as reliable as CPUs. The Llama3 team’s CPUs were the problem only 0.5% of the time. In my time at Modal we can’t remember finding a single degraded CPU core. > For our Enterprise customers we use a shared private Slack channel with tight SLAs. Slack is connected to Pylon, tracking issues from creation to resolution. Because Modal is built on top of the cloud giants and designed for dynamic compute autoscaling, we can replace bad GPUs pretty fast!

charles_irl 5 hours ago | parent | next [-]

> Ed Zitron also called out the business model of GPU-as-a-service middleman companies like modal deeply unsustainable, and I also don't see how they can make a profit if they are only reselling public clouds.

You got a link for that? I work on Modal and would be interested in seeing the argument!

We think building a proper software layer for multitenant demand aggregation on top of the public clouds is sufficient value-add to be a sustainable business (cf DBRX and Snowflake).

pphysch an hour ago | parent | next [-]

Snowflake and Databricks provide data storage and pipeline features and therefore have extraordinary lock-in potential, which allows them to have sustainable business models.

GPU compute is essentially fungible. That's quite a stretch to compare those business models. Snowflake and Databricks don't necessarily have the best "value-add" and they don't need to.

bflesch 2 hours ago | parent | prev [-]

It was on his last newsletter, but I can't link it right now.

pixl97 6 hours ago | parent | prev | next [-]

>These GPUs are treated like toilet paper, you use them and throw them away, nothing you will give to the next generation.

I'm guessing this may be highly dependant on what the bathtub curve looks like, and how much the provider wants to spend on cooling.

Of course with Nvidia being a near monopoly here, they might just not give a fuck and will pump out cards/servers with shitty reliability rates simply because people keep buying them and they don't suffer any economic loss or have to sit in front of a judge.

Be interesting to see what the error rate per TFLOP (no /s, we're looking at operations not time) is compared to older generation cards.

topaz0 5 hours ago | parent [-]

> Of course with Nvidia being a near monopoly here, they [...] will pump out cards/servers with shitty reliability rates simply because people keep buying them and they don't suffer any economic loss or have to sit in front of a judge.

Presumably this can't last that much longer, because the people that are buying/running these are already taking on loads of debt/venture capital to buy the past/current round of hardware without seeing much revenue from it. It's much harder to ask investors for multiples of your annual revenue just to maintain your current capabilities than it was a couple years ago to ask for many multiples of your revenue to expand your capabilities dramatically.

ares623 6 hours ago | parent | prev [-]

I suppose NVidia could invest in making their GPUs more reliable? But then that'll make everything else even more expensive lol. If only one of the companies on the chain can take one for the team.

touisteur 5 hours ago | parent | next [-]

And NVIDIA supposedly has the exact knowhow for reliablity, as their Jetson 'industrial' parts are qualified for 10-15 years at maximal temp. Of course Jetson is on another point of the flops and watts curve.

Just wondering, if reliability increases if you slow down your use of GPUs a bit. Like pausing more often and stopping chasing every bubble and nvlink-all-reduce optimization.

dsrtslnd23 3 hours ago | parent [-]

Jetson uses LPDDR though. H100 failures seem driven by HBM heat sensitivity and the 700W+ envelope. That is a completely different thermal density I guess.

zozbot234 2 hours ago | parent [-]

Reliability also depends strongly on current density and applied voltage, even more perhaps than on thermal density itself. So "slowing down" your average GPU use in a long-term sustainable way ought to improve those reliability figures via multiple mechanisms. Jetsons are great for very small-scale self-contained tasks (including on a performance-per-watt basis) but their limits are just as obvious, especially with the recently announced advances wrt. clustering the big server GPUs on a rack- and perhaps multi-rack level.

pqtyw 4 hours ago | parent | prev [-]

Why? Nvidia is already charging as much as they possibly can. Unlike most other components its almost unrelated to manufacturing costs

nradov 3 hours ago | parent | next [-]

Nope. Nvidia has often sold products at below the market price. This has created shortages where scalpers who are able to get some supply immediately resell above list price. It might seem stupid for Nvidia to leave money on the table that way but they don't want to burn relationships with customers by raising list prices (much).

gessha an hour ago | parent [-]

Why is the GB10 so expensive then ;(

ares623 4 hours ago | parent | prev [-]

Why make same money when more money possible?