| ▲ | touisteur 5 hours ago | |||||||
And NVIDIA supposedly has the exact knowhow for reliablity, as their Jetson 'industrial' parts are qualified for 10-15 years at maximal temp. Of course Jetson is on another point of the flops and watts curve. Just wondering, if reliability increases if you slow down your use of GPUs a bit. Like pausing more often and stopping chasing every bubble and nvlink-all-reduce optimization. | ||||||||
| ▲ | dsrtslnd23 3 hours ago | parent [-] | |||||||
Jetson uses LPDDR though. H100 failures seem driven by HBM heat sensitivity and the 700W+ envelope. That is a completely different thermal density I guess. | ||||||||
| ||||||||