| ▲ | thundergolfer 7 hours ago | |||||||
Author here. That 1:50-100 ratio looks roughly right based on my research, but my numbers have GPUs faring even worse.
† “Critical error” refers to a NVIDIA Xid or sXid error which is not recoverable, requiring application and GPU reset.Only a minority of GPU 'failures' appear to be permanent hardware problems, such as row remapping errors. A lot seem to be, like another comment says, a consequence of operating too close to the operational limit, tipping over it, and then requiring a power cycle. | ||||||||
| ▲ | layoric 6 hours ago | parent | next [-] | |||||||
I'm quite surprised the A100 is not much better since the power levels for the Ampere cards I believe is a lot lower. Does this mean even for a model that fits on a single server that trains for a few weeks will absolutely need a recovery process? Interested in peoples experiences around this. | ||||||||
| ||||||||
| ▲ | shrubble 7 hours ago | parent | prev | next [-] | |||||||
If you rebooted every server after 35 days, would that get rid of many of the problems? | ||||||||
| ||||||||
| ▲ | jvalencia 6 hours ago | parent | prev [-] | |||||||
I'm curious if running them at slightly lower voltage would fix it or if it's a software thing. | ||||||||