Remix.run Logo
touisteur 5 hours ago

And NVIDIA supposedly has the exact knowhow for reliablity, as their Jetson 'industrial' parts are qualified for 10-15 years at maximal temp. Of course Jetson is on another point of the flops and watts curve.

Just wondering, if reliability increases if you slow down your use of GPUs a bit. Like pausing more often and stopping chasing every bubble and nvlink-all-reduce optimization.

dsrtslnd23 3 hours ago | parent [-]

Jetson uses LPDDR though. H100 failures seem driven by HBM heat sensitivity and the 700W+ envelope. That is a completely different thermal density I guess.

zozbot234 2 hours ago | parent [-]

Reliability also depends strongly on current density and applied voltage, even more perhaps than on thermal density itself. So "slowing down" your average GPU use in a long-term sustainable way ought to improve those reliability figures via multiple mechanisms. Jetsons are great for very small-scale self-contained tasks (including on a performance-per-watt basis) but their limits are just as obvious, especially with the recently announced advances wrt. clustering the big server GPUs on a rack- and perhaps multi-rack level.