Remix.run Logo
wincy 4 hours ago

Crazy, so if I understand correctly, something with B200s and nvlink is causing issues where after 66 days and 12 hours of uptime, nvidia-smi and other jobs start failing, timing out, then once you restart the cluster it starts working again.

They suspect jobs will work if you only use 1 B200, but one person power cycled so wasn’t able to test it. Hopefully they won’t have to wait another 66 days for further troubleshooting.

layla5alive 4 hours ago | parent [-]

Some 32-bit counter somewhere used when in NVLINK overflows?

themafia 3 hours ago | parent | next [-]

66 days + 12 hours are 5,745,600,000,000,000 ns. The log2 of this is 52.351...

Javascript and some other languages only have integer precision up to 52 bits then they switch to floating point.

Curious.

loeg 3 hours ago | parent [-]

It's 32 bits of milliseconds, right? Hm, no, that would overflow much sooner (49.7 days).

oasisaimlessly 3 hours ago | parent [-]

It's a uint32_t of 750 Hz "jiffies", which does overflow at ~66 days.

userbinator 26 minutes ago | parent [-]

While that seems like a convincing explanation, 750Hz is a rather odd value to use for a timer, and more importantly the overflow would be at 66d6h43m43s instead of the reported ~66d12h.

mook 4 hours ago | parent | prev [-]

Isn't 32bit counter 49 days? Assuming that one was counting milliseconds, at least.

Only remember that because that's the limit for Windows 95…

repiret 3 hours ago | parent [-]

100ns intervals. My favorite part of that story is how long after Windows 95 was released before anybody discovered the bug.

justsomehnguy an hour ago | parent [-]

That's because people actually powered off their computer after work/leisure sessions. Someone on an unlimited night dial-up could had discovered it well before "anybody" but it's not like there was a built-in function to actually send a crash report to Redmond.

https://i.sstatic.net/p9hUgGfg.png