Crazy, so if I understand correctly, something with B200s and nvlink is causing issues where after 66 days and 12 hours of uptime, nvidia-smi and other jobs start failing, timing out, then once you restart the cluster it starts working again.

They suspect jobs will work if you only use 1 B200, but one person power cycled so wasn’t able to test it. Hopefully they won’t have to wait another 66 days for further troubleshooting.

▲

layla5alive 4 hours ago | parent [-]

Some 32-bit counter somewhere used when in NVLINK overflows?

▲

themafia 3 hours ago | parent | next [-]

66 days + 12 hours are 5,745,600,000,000,000 ns. The log2 of this is 52.351...

Javascript and some other languages only have integer precision up to 52 bits then they switch to floating point.

Curious.

▲

loeg 3 hours ago | parent [-]

It's 32 bits of milliseconds, right? Hm, no, that would overflow much sooner (49.7 days).

▲

oasisaimlessly 3 hours ago | parent [-]

It's a uint32_t of 750 Hz "jiffies", which does overflow at ~66 days.

	▲	userbinator 26 minutes ago \| parent [-]
		While that seems like a convincing explanation, 750Hz is a rather odd value to use for a timer, and more importantly the overflow would be at 66d6h43m43s instead of the reported ~66d12h.

▲

mook 4 hours ago | parent | prev [-]

Isn't 32bit counter 49 days? Assuming that one was counting milliseconds, at least.

Only remember that because that's the limit for Windows 95…

▲

repiret 3 hours ago | parent [-]

100ns intervals. My favorite part of that story is how long after Windows 95 was released before anybody discovered the bug.

	▲	justsomehnguy an hour ago \| parent [-]
		That's because people actually powered off their computer after work/leisure sessions. Someone on an unlimited night dial-up could had discovered it well before "anybody" but it's not like there was a built-in function to actually send a crash report to Redmond. https://i.sstatic.net/p9hUgGfg.png