Remix.run Logo
bluedino 6 days ago

You want enterprise support. GPU quality is atrocious, you will have your Dell tech in there replacing GPUs and fans all the time.

c0balt 6 days ago | parent | next [-]

Can second this, the amount of GPU failures we have with Lenovo systems on just <50 nodes is significantly higher than we expected. Having a Lenovo support person at least twice a month on premise at the middle of the bathtub curve is probably also costing them (and implicitly us) a good chunk of money.

grubbs 6 days ago | parent | prev [-]

Interesting. I work in higher ed and we have thousands of GPUs under my team. Rarely ever seen a failure. Mostly when we put consumer grade GPUs in servers (Nvidia doesn't like this). True server-grade GPUs never have any problems.

ecshafer 6 days ago | parent | next [-]

IS this for some kind of HPC cluster? What kind of utilization are you at? For an AI company these GPUs are going to be at near 100% utilization 24/7. These kinds of loads destroy hardware quick.

bluedino 6 days ago | parent | prev [-]

Every site I've worked at has plenty of GPU failures. Not consumer grade either, H100/A100