Remix.run Logo
bluedino 8 hours ago

I help run a fleet of GPU servers, and I might see 1 DIMM or SSD failure for every 50-100 GPU failures.

I realize NVIDIA is just cranking them out as fast as they can, but the quality on them is terrible. They overheat, disappear after you reboot, they fall off the bus, memory failures, and then mix in all the software crashes your users generate...

Our current server vendor is actually good at replacing them, unlike our previous vendor, but the failure rates are just insane. If any other component failed this much we'd have the vendor buy the servers back.

thundergolfer 7 hours ago | parent | next [-]

Author here. That 1:50-100 ratio looks roughly right based on my research, but my numbers have GPUs faring even worse.

  Component                      Type       MTBF (yrs)  AFR
  ─────────────────────────────────────────────────────────

  SSD                            Hardware   ~100        ~1%
  RAM uncorrectable error        Hardware   ~75         ~1-4%
  NVIDIA A100 critical error†    Hardware   0.18 (65d)  -
  NVIDIA H100 critical error†    Hardware   0.15 (50d)  -
† “Critical error” refers to a NVIDIA Xid or sXid error which is not recoverable, requiring application and GPU reset.

Only a minority of GPU 'failures' appear to be permanent hardware problems, such as row remapping errors. A lot seem to be, like another comment says, a consequence of operating too close to the operational limit, tipping over it, and then requiring a power cycle.

layoric 6 hours ago | parent | next [-]

I'm quite surprised the A100 is not much better since the power levels for the Ampere cards I believe is a lot lower.

Does this mean even for a model that fits on a single server that trains for a few weeks will absolutely need a recovery process? Interested in peoples experiences around this.

formerly_proven 5 hours ago | parent [-]

GPU servers always have had crap reliability compared to a normal server (but sticking eight GPUs on a baseboard complicates things). As I understand it (not my domain), this (being a lack of widespread checkpointing and mpift support) is one of the motivating factors for why ML toolkits eschew MPI (besides accelerator-accelerator being an afterthought).

shrubble 7 hours ago | parent | prev | next [-]

If you rebooted every server after 35 days, would that get rid of many of the problems?

direwolf20 4 hours ago | parent [-]

It's an average time to failure, not a guarantee. Failures occur randomly.

jvalencia 6 hours ago | parent | prev [-]

I'm curious if running them at slightly lower voltage would fix it or if it's a software thing.

nickysielicki 4 hours ago | parent | prev | next [-]

Totally matches my experience, and it feels bizarre inside-looking-out that nobody else talks about it. Hardware from 2010-2020 was remarkably stable, and CPUs are still as stable as they were, but we've had this large influx of money spent on these chips that fall over if you look at them funny. I think it leads to a lot of people thinking, "we must be doing something wrong", because it's just outside of their mental model that hardware failures can occur at this rate. But that's just the world we live in.

It's a perfect storm: a lot of companies are doing HPC-style distributed computing for the first time, and lack experience in debugging issues that are unique to it. On top of that, the hardware is moving very fast and they're ill equipped to update their software and drivers at the rate required to have a good experience. On top of that, the stakes are higher because your cluster is only as strong as its weakest node, which means a single hardware failure can turn the entire multi-million dollar cluster into a paperweight, which adds more pressure and stress to get it all fixed. Updating your software means taking that same multi-million dollar cluster offline for several hours, which is seen as a cost rather than a good investment of time. And a lot of the experts in HPC-style distributed computing will sell you "supported" software, which is basically just paying for the privilege of using outdated software that lacks the bug fixes that your cards might desperately need. That model made sense in the 2010s, when linux (kernel and userspace) was less stable and you genuinely needed to lock your dependencies and let the bugs work themselves out. But that's the exact opposite of what you want to be doing in 2026.

You put all of this together, and it's difficult to be confident whether the hardware is bad, or going bad, or whether it's only manifesting because they're exposed to bugs, or maybe both. Yikes, it's no fun.

userbinator 12 minutes ago | parent | prev | next [-]

I wonder if GPUs are so dense that SEUs are even more common than in CPUs or RAM.

stingrae 10 minutes ago | parent | prev | next [-]

seems like it would be an issue for building datacenters in space/orbit

dlcarrier 8 hours ago | parent | prev | next [-]

They're also run far closer to the edge of their operational limits than CPUs, so you're far more likely to get one that barely passes manufacturing tests, then degrades just a little tiny bit and stops working.

jldugger 5 hours ago | parent | prev | next [-]

It's funny, I've been watching all the nvidia GTC keynotes from 2012-now to better understand the ecosystem and Jensen pretty clearly states a few times "its a miracle it works at all". Clearly he's intending to brag about defect rate on a 50 billion transistor chip but maybe he's more right than he realizes.

bigwheels 8 hours ago | parent | prev | next [-]

FWIW, NVIDIA enterprise hardware does come with good warranty and prompt RMA service.

A deep dive on why these beastly cards fail so frequently compared to all other common current day hardware would be fascinating!

nickysielicki 2 hours ago | parent | next [-]

> A deep dive on why these beastly cards fail so frequently compared to all other common current day hardware would be fascinating!

P=CV²f

indoordin0saur 7 hours ago | parent | prev [-]

I don't know much about the subject but GPUs were originally meant for gaming and would run for a few hours to several hours a day and then would get rest periods. The amount of power draw on them would also vary throughout the time they were being actively used. With constant 24/7 usage at max capacity is it just possible that they are being pushed beyond what they were originally engineered for?

userbinator 10 minutes ago | parent | next [-]

That changed before, once cryptocurrencies started getting popular.

ls65536 6 hours ago | parent | prev [-]

My intuition would be that constant usage (not exceeding maximum rated capacity/thermals/etc.) should generally result in less wear compared to the more frequent thermal cycling that you might expect from intermittent use, but maybe there's something else going on here too. I suppose this would depend on what exactly the cause of the failure is.

Either way, these are obviously being intentionally sold to be used for non-gaming-type workloads, so it wouldn't be a good argument to state that they're just being (ab)used beyond what they were inteded for...unless somehow they really are being pushed beyond design limits, but given the cost of these things I can't imagine anyone doing this willingly with a whole fleet of them.

greenavocado 6 hours ago | parent [-]

Electromigration may be a factor

zozbot234 5 hours ago | parent [-]

Electromigration decays exponentially with inverse temperature. If it's genuinely a factor, you're running that GPU way too hot.

jayd16 6 hours ago | parent | prev | next [-]

For comparison they have way more memory than 1 DIMM alone, and plenty of other things going on.

ecesena 4 hours ago | parent | prev [-]

Has anyone tried to "turn off some cores" (eg using multi-instance gpu feature) and see if/how that increases reliability?