Remix.run Logo
nurettin 2 hours ago

> we were hit with this on a 256 gpu b200 cluster -- at day 66 all our jobs started randomly failing

ouch