Very interesting. The Go toolchain has an (off by default) telemetry system. For Go 1.23, I added the runtime.SetCrashOutput function and used it to gather field reports containing stack traces for crashes in any running goroutine. Since we enabled it over a year ago in gopls, our LSP server, we have discovered hundreds of bugs.

Even with only about 1 in 1000 users enabling telemetry, it has been an invaluable source of information about crashes. In most cases it is easy to reconstruct a test case that reproduces the problem, and the bug is fixed within an hour. We have fixed dozens of bugs this way. When the cause is not obvious, we "refine" the crash by adding if-statements and assertions so that after the next release we gain one additional bit of information from the stack trace about the state of execution.

However there was always a stubborn tail of field reports that couldn't be explained: corrupt stack pointers, corrupt g registers (the thread-local pointer to the current goroutine), or panics dereferencing a pointer that had just passed a nil check. All of these point to memory corruption.

In theory anything is possible if you abuse unsafe or have a data race, but I audited every use of unsafe in the executable and am convinced they are safe. Proving the absence of data races is harder, but nonetheless races usually exhibit some kind of locality in what variable gets clobbered, and that wasn't the case here.

In some cases we have even seen crashes in non-memory instructions (e.g. MOV ZR, R1), which implicates misexecution: a fault in the CPU (or a bug in the telemetry bookkeeping, I suppose).

As a programmer I've been burned too many times by prematurely blaming the compiler or runtime for mistakes in one's own code, so it took a long time to gain the confidence to suspect the foundations in this case. But I recently did some napkin math (see https://github.com/golang/go/issues/71425#issuecomment-39685...) and came to the conclusion that the surprising number of inexplicable field reports--about 10/week among our users--is well within the realm of faulty hardware, especially since our users are overwhelmingly using laptops, which don't have parity memory.

I would love to get definitive confirmation though. I wonder what test the Firefox team runs on memory in their crash reporting software.

▲

aforwardslash a day ago | parent | next [-]

> In some cases we have even seen crashes in non-memory instructions (e.g. MOV ZR, R1), which implicates misexecution: a fault in the CPU (or a bug in the telemetry bookkeeping, I suppose).

Thats the thing. Bit flips impact everything memory-resident - that includes program code. You have no way of telling what instruction was actually read when executing the line your instrumentation may say corresponds to the MOV; or it may have been a legit memory operation, but instrumentation is reporting the wrong offset. There are some ways around it, but - generically - if a system runs a program bigger than the processor cache and may have bit flips - the output is useless, including whatever telemetry you use (because it is code executed from ram and will touch ram).

▲

adonovan 20 hours ago | parent [-]

Good point: I-cache is memory too. (Indeed it is SRAM, so its bits might be even more fragile than DRAM!)

▲

c-c-c-c-c 16 hours ago | parent [-]

Why would a 6T cell (SRAM) be more fragile than a 1T1C (DRAM) cell?

	▲	zinekeller 13 hours ago \| parent [-]
		Because it's SRAM, and therefore it still can lose its electrons because we're working with cells a few atoms thick? The loss is not necessarily in L1 (where it's replaced frequently), but in L3 which now has memory comparable to PCs in the early 2000s (and can have its data "stuck" in the same physical area for minutes).

▲

nitwit005 a day ago | parent | prev | next [-]

You might consider adding the CPU temperature to the report, if there's a reasonable way to get it (haven't tried inside a VM). Then you could at least filter out extremely hot hardware.

	▲	hedora a day ago \| parent [-]
		CPU model / stepping / microcode versions are probably at least as useful as temperature. I'd also try to get things like the actual DRAM timing + voltage vs. what the XMP extensions (or similar) advertise the manufacturer tested the memory at. I have at least one motherboard that just re-auto-overclocks itself into a flaky configuration if boot fails a few times in a row (which can happen due to loose power cords, or whatever).

▲

tczMUFlmoNk 7 hours ago | parent | prev | next [-]

> Even with only about 1 in 1000 users enabling telemetry

How do you know the number/proportion of users who run without telemetry enabled, since by definition you're not collecting their data?

(Not imputing any malice, genuinely curious.)

	▲	adonovan an hour ago \| parent [-]
		Good question. We don't know the true figure, but we extrapolate the denominator from estimates of the total number of Go users and the fraction of Go users that run gopls.

▲

jamesfinlayson 21 hours ago | parent | prev | next [-]

Interesting reading - I've occasionally seen some odd crashes in an iOS app that I'm partly responsible for. It's running some ancient version of New Relic that doesn't give stack traces but it does give line numbers and it's always on something that should never fail (decoding JSON that successfully decoded thousands of times per day).

I never dug too deeply but the app is still running on some out of support iPads so maybe it's random bit flips.

▲

sieep a day ago | parent | prev | next [-]

Ive been trying to push my boss towards more analytics/telemetry in production that focus on crashes, thanks for sharing.

▲

charcircuit 16 hours ago | parent | prev [-]

>All of these point to memory corruption.

Actually "dereferencing a pointer that had just passed a nil check" could be from a flow control fault where the branch fails to be taken correctly.