Remix.run Logo
thegrim33 2 days ago

A 5 part thread where they say they're "now 100% positive" the crashes are from bitflips, yet not a single word is spent on how they're supposedly detecting bitflips other than just "we analyze memory"?

rincebrain a day ago | parent | next [-]

The simplest way to do this, what I believe memtest86 and friends do, is to write a fixed pattern over a region of memory and then read it back later and see if it changed; then you write patterns that require flipping the bits that you wrote before, and so on.

Things like [1] will also tell you that something corrupted your memory, and if you see a nontrivial (e.g. lots of bits high and low) magic number that has only a single bit wrong, it's probably not a random overwrite - see the examples in [2].

There's also a fun prior example of experiments in this at [3], when someone camped on single-bit differences of a bunch of popular domains and examined how often people hit them.

edit: Finally, digging through the Mozilla source, I would imagine [4] is what they're using as a tester when it crashes.

[1] - https://github.com/mozilla-firefox/firefox/commit/917c4a6bfa...

[2] - https://bugzilla.mozilla.org/show_bug.cgi?id=1762568

[3] - https://media.defcon.org/DEF%20CON%2019/DEF%20CON%2019%20pre...

[4] - https://github.com/mozilla-firefox/firefox/blob/main/toolkit...

wging a day ago | parent | next [-]

[4] looks like it's only a runner for the actual testing, which is a separate crate: https://github.com/mozilla/memtest

(see: https://github.com/mozilla-firefox/firefox/blob/main/toolkit..., which points to a specific commit in that repo - turns out to be tip of main)

rendaw a day ago | parent | prev [-]

That would tell you if there's a bitflip in your test, but not if there's a bitflip in normal program code causing a crash, no? IIUC GP's questions was how do they actually tell after a crash that that crash was caused by a bitflip.

rincebrain a day ago | parent [-]

The example I gave in there is of adding sentinel values in your data, so you can check the constants in your data structures later and go "oh, this is overwritten with garbage" versus "oh, this is one or two bits off". I would imagine plumbing things like that through most common structures is what was done there, though I haven't done the archaeology to find out, because Firefox is an enormous codebase to try and find one person's commits from several years ago in.

kevincox 10 hours ago | parent | next [-]

This doesn't always protect against out-of-bounds writes. Although if these sentinel values are in read only memory mappings it probably gets pretty close. (Especially if you consider kernel memory corruption a "bitflip".)

patrulek 20 hours ago | parent | prev [-]

But it would be also possible that sentinel value used for comparison changed because of bitflip, not data structure used by program.

tredre3 2 days ago | parent | prev | next [-]

> last year we deployed an actual memory tester that runs on user machines after the browser crashes.

He doesn't explain anything indeed but presumably that code is available somewhere.

hedora a day ago | parent | next [-]

That, and 50% of the machines where their heuristics say it is a hardware error fail basic memory tests.

I've seen a lot of confirmed bitflips with ECC systems. The vast majority of machines that are impacted are impacted by single event upsets (not reproducible).

(I worded that precisely but strangely because if one machine has a reproducible problem, it might hit it a billion times a second. That means you can't count by "number of corruptions".)

My take is that their 10% estimate is a lower bound.

thatguy27 2 days ago | parent | prev [-]

[flagged]

hexyl_C_gut a day ago | parent | prev | next [-]

It sounds like they don't know that the crashes are from bitflips but those crashes are from people with flaky memory which probably caused the crash?

wmf a day ago | parent | prev | next [-]

A common case is a pointer that points to unallocated address space triggers a segfault and when you look at the pointer you can see that it's valid except for one bit.

dboreham a day ago | parent [-]

That tells you one bit was changed. It doesn't prove that single bit changed due to a hardware failure. It could have been changed by broken software.

sfink 10 hours ago | parent | next [-]

[I work at Mozilla]

Yes, that's a confounding factor, and in fact the starting assumption when looking at a crash. Sometimes you can be pretty sure it's hardware. For example, if it's a crash on an illegal instruction in non-JITted code, the crash reporter can compare that page of data with the on-disk image that it's supposed to be a read-only copy of. Any mismatches there, especially if they're single bit flips, are much more likely to be hardware.

But I've also seen it several times when the person experiencing the crashes engages on the bug tracker. Often, they'll get weird sporadic but fairly frequent crashes when doing a particular activity, and so they'll initially be absolutely convinced that we have a bug there. But other people aren't reporting the same thing. They'll post a bunch of their crash reports, and when we look at them, they're kind of all over the place (though as they say, almost always while doing some particular thing). Often it'll be something like a crash in the garbage collector while watching a youtube video, and the crashes are mostly the same but scattered in their exact location in the code. That's a good signal to start suspecting bad memory: the GC scans lots of memory and does stuff that is conditional on possibly faulty data. We'll start asking them to run a memory test, at least to rule out hardware problems. When people do it in this situation, it almost always finds a problem. (Many people won't do it, because it's a pain and they're understandably skeptical that we might be sandbagging them and ducking responsibility for a bug. So we don't start proposing it until things start feeling fishy.)

But anyway, that's just anecdata from individual investigations. gsvelto's post is about what he can see at scale.

LeifCarrotson a day ago | parent | prev [-]

Broken software causes null pointer references and similar logic errors. It would be extremely unusual to have an inadvertent

    ptr ^= (1 << rand_between(0,64));
that got inserted in the code by accident. That's just not the way that we write software.
vlovich123 a day ago | parent [-]

Except no one is claiming the bit flip is the pointer vs the data being pointed to or a non pointer value. Given how we write software there’s a lot more bits not in pointer values that still end up “contributing “ to a pointer value. Eg some offset field that’s added to a pointer has a bit flip, the resulting pointer also has a bit flip. But the offset field could have accidentally had a mask applied or a bit set accidentally due to the closeness of & and && or | and ||.

rockdoe 19 hours ago | parent [-]

I think that if you hit the crash in the same line of code many times, you can safely assume it's your own bug and not a memory issue.

If it's only hit once by a random person, memory starts being more likely.

(Unless that LOC is scanning memory or smth)

vlovich123 11 hours ago | parent [-]

Deduplicating and identifying the source of a crash point is surprisingly hard, to the point that “it’s the only crash of its kind” could be a bug in your logic for linking issues.

Also, in an unsafe language all bets are off. A memory clobber, UAF or race condition can generate quite strange and ephemeral crashes. Even if the majority of time it generates the “same” failure mode, it can still sporadically generate a rare execution trace. It’s best to stop thinking of these as deterministic processes and more as a distribution of possible outcomes.

gcp 10 hours ago | parent [-]

Deduplicating and identifying the source of a crash point is surprisingly hard, to the point that “it’s the only crash of its kind” could be a bug in your logic for linking issues.

This is a bit vague to really reply to very specifically, but yes, this is hard. Which is why quite some people work in this area. It's rather valuable to do so at Firefox-scale.

Even if the majority of time it generates the “same” failure mode, it can still sporadically generate a rare execution trace.

This doesn't matter that much because the "same" failure mode already allows you to see the bug and fix it.

hrmtst93837 20 hours ago | parent | prev [-]

I think claiming '100% positive' without explaining how you detect bitflips is a red flag, because credible evidence looks like ECC error counters and machine check events parsed by mcelog or rasdaemon, reproducible memtest86 failures, or software page checksums that mismatch at crash time.

Ask them to publish raw MCE and ECC dumps with timestamps correlated to crashes, or reproduce the failure with controlled fault injection or persistent checksums, because without that this reads like a hypothesis dressed up as a verdict.

gcp 10 hours ago | parent [-]

I don't think Firefox has the access permissions needed to read MCE status, and the vast majority of our users don't have ECC, let alone they're going to run memtest86(+) after a Firefox crash.

If they did, we wouldn't be having this discussion to begin with!