A common case is a pointer that points to unallocated address space triggers a segfault and when you look at the pointer you can see that it's valid except for one bit.

▲ dboreham a day ago | parent [-]

That tells you one bit was changed. It doesn't prove that single bit changed due to a hardware failure. It could have been changed by broken software.

▲ sfink 10 hours ago | parent | next [-]

[I work at Mozilla]

Yes, that's a confounding factor, and in fact the starting assumption when looking at a crash. Sometimes you can be pretty sure it's hardware. For example, if it's a crash on an illegal instruction in non-JITted code, the crash reporter can compare that page of data with the on-disk image that it's supposed to be a read-only copy of. Any mismatches there, especially if they're single bit flips, are much more likely to be hardware.

But I've also seen it several times when the person experiencing the crashes engages on the bug tracker. Often, they'll get weird sporadic but fairly frequent crashes when doing a particular activity, and so they'll initially be absolutely convinced that we have a bug there. But other people aren't reporting the same thing. They'll post a bunch of their crash reports, and when we look at them, they're kind of all over the place (though as they say, almost always while doing some particular thing). Often it'll be something like a crash in the garbage collector while watching a youtube video, and the crashes are mostly the same but scattered in their exact location in the code. That's a good signal to start suspecting bad memory: the GC scans lots of memory and does stuff that is conditional on possibly faulty data. We'll start asking them to run a memory test, at least to rule out hardware problems. When people do it in this situation, it almost always finds a problem. (Many people won't do it, because it's a pain and they're understandably skeptical that we might be sandbagging them and ducking responsibility for a bug. So we don't start proposing it until things start feeling fishy.)

But anyway, that's just anecdata from individual investigations. gsvelto's post is about what he can see at scale.

▲ LeifCarrotson a day ago | parent | prev [-]

Broken software causes null pointer references and similar logic errors. It would be extremely unusual to have an inadvertent

    ptr ^= (1 << rand_between(0,64));

that got inserted in the code by accident. That's just not the way that we write software.

▲

vlovich123 a day ago | parent [-]

Except no one is claiming the bit flip is the pointer vs the data being pointed to or a non pointer value. Given how we write software there’s a lot more bits not in pointer values that still end up “contributing “ to a pointer value. Eg some offset field that’s added to a pointer has a bit flip, the resulting pointer also has a bit flip. But the offset field could have accidentally had a mask applied or a bit set accidentally due to the closeness of & and && or | and ||.

▲

rockdoe 19 hours ago | parent [-]

I think that if you hit the crash in the same line of code many times, you can safely assume it's your own bug and not a memory issue.

If it's only hit once by a random person, memory starts being more likely.

(Unless that LOC is scanning memory or smth)

▲

vlovich123 11 hours ago | parent [-]

Deduplicating and identifying the source of a crash point is surprisingly hard, to the point that “it’s the only crash of its kind” could be a bug in your logic for linking issues.

Also, in an unsafe language all bets are off. A memory clobber, UAF or race condition can generate quite strange and ephemeral crashes. Even if the majority of time it generates the “same” failure mode, it can still sporadically generate a rare execution trace. It’s best to stop thinking of these as deterministic processes and more as a distribution of possible outcomes.

	▲	gcp 10 hours ago \| parent [-]
		Deduplicating and identifying the source of a crash point is surprisingly hard, to the point that “it’s the only crash of its kind” could be a bug in your logic for linking issues. This is a bit vague to really reply to very specifically, but yes, this is hard. Which is why quite some people work in this area. It's rather valuable to do so at Firefox-scale. Even if the majority of time it generates the “same” failure mode, it can still sporadically generate a rare execution trace. This doesn't matter that much because the "same" failure mode already allows you to see the bug and fix it.