Remix.run Logo
grubbs 6 days ago

Interesting. I work in higher ed and we have thousands of GPUs under my team. Rarely ever seen a failure. Mostly when we put consumer grade GPUs in servers (Nvidia doesn't like this). True server-grade GPUs never have any problems.

ecshafer 6 days ago | parent | next [-]

IS this for some kind of HPC cluster? What kind of utilization are you at? For an AI company these GPUs are going to be at near 100% utilization 24/7. These kinds of loads destroy hardware quick.

bluedino 6 days ago | parent | prev [-]

Every site I've worked at has plenty of GPU failures. Not consumer grade either, H100/A100