Remix.run Logo
colechristensen 19 hours ago

Bit flips do not only happen inside RAM

Also, in a game, there is a tremendously large chance that any particular bit flip will have exactly 0 effect on anything. Sure you can detect them, but one pixel being wrong for 1/60th of a second isn't exactly ... concerning.

The chance for a bit flip to affect a critical path that is noticeable by the player is very low, and quite a bit lower if you design your game to react gracefully. There's a whole practice of writing code for radiation hardened environments that largely consists of strategies for recovering from an impossible to reach state.

PunchyHamster 15 hours ago | parent | next [-]

> The chance for a bit flip to affect a critical path that is noticeable by the player is very low, and quite a bit lower if you design your game to react gracefully.

Nobody does

> There's a whole practice of writing code for radiation hardened environments that largely consists of strategies for recovering from an impossible to reach state.

And again, nobody except stuff that goes to space and few critical machines does. The closest normal user will get to code written like that are probably car ECUs, there are even automotive targeted MCUs that not only run ecc but also 2 cores in parallel and crash if they disagree

colechristensen 6 hours ago | parent [-]

Sure they do, you just have to think about it a different way.

It boils down to exception handling, you don't expect all of your bugs or security vulnerabilities to be known and write your code to be able to react to unplanned states without crashing. Bugs or security vulnerabilities can look a lot like a cosmic ray... a buffer overflow putting garbage in unexpected memory locations vs a cosmic ray putting garbage in unexpected memory locations... a lot of the mitigations are quite the same.

colinb 17 hours ago | parent | prev | next [-]

> code for radiation hardened environments

I’m aware of code that detects bit flips via unreasonable value detection (“this counter cannot be this high so quickly”). What else is there?

gmueckl 17 hours ago | parent | next [-]

For safety critical systems, one strategy is to store at least two copies of important data and compare them regularly. If they don't match, you either try to recover somehow or go into a safe state, depending on the context.

d1sxeyes 17 hours ago | parent | next [-]

At least three copies, so you can recover based on consensus.

Dylan16807 17 hours ago | parent | next [-]

If your pieces of important data are very tiny, that's probably your best option.

If they're hundreds of bytes or more, then two copies plus two hashes will do a better job.

d1sxeyes 12 hours ago | parent | next [-]

Ah, true! You just restore the one that matches its hash. Elegant.

rixed 8 hours ago | parent | prev [-]

A single hash should be enough.

Dylan16807 5 hours ago | parent [-]

Yes, but what's easier depends on layout. "Consensus" makes me think of multiple entire nodes, and in that situation you can have a nice symmetry by making each node store one copy and one small hash.

If you're doing something that's more centralized then one hash might be simpler, but if you're centralized then you should probably use your own error correction codes instead of having multiple copies.

qznc 8 hours ago | parent | prev | next [-]

In many cases the system is perfectly safe when it shuts off. Two is enough for that.

pizza 13 hours ago | parent | prev [-]

“never go to sea with two chronometers, take one or three”

DennisP 8 hours ago | parent [-]

Seems like chronometers would be a case where two are better than one, because the mistakes are analog. If they don't exactly agree, just take the average. You'll have more error than if you were lucky enough to take the better chronometer, but less than if you had taken only the worse one. Minimizing the worst case is probably the best way to stay off the rocks.

Helmut10001 15 hours ago | parent | prev [-]

I use ZFS even on consumer devices, these days. Parity checks all the way!

vntok 17 hours ago | parent | prev | next [-]

You can have voting systems in place, where at least 2 out of 3 different code paths have to produce the same output for it to be accepted. This can be done with multiple systems (by multiple teams/vendors) or more simply with multiple tries of the same path, provided you fully reload the input in between.

qznc 17 hours ago | parent | prev [-]

The simplest one is a watchdog: If something stops with regular notifications, then restart stuff.

gmueckl 17 hours ago | parent [-]

A watchdog guards against unresponsive software. It doesn't protect against bad data directly. Not all bad data makes a system freeze.

19 hours ago | parent | prev | next [-]
[deleted]
Helmut10001 19 hours ago | parent | prev [-]

Interesting, I was not aware! Do you have a statistics for the bit flips in RAM %? My feeling would be its the majority of bit flips that happen, but I can be wrong.

Tomte 17 hours ago | parent | next [-]

IEC 61508 estimates a soft error rate of about 700 to 1200 FIT (Failure in Time, i.e. 1E-9 failures/hour).

That was in the 2000s though, and for embedded memory above 65nm. I would expect smaller sizes to be more error-prone.

colechristensen 19 hours ago | parent | prev [-]

It would be quite hard to gather that data and would be highly dependent on hardware and source of bit flip.

But there's volatile and nonvolatile memory all over in a computer and anywhere data is in flight be it inside the CPU or in any wires, traces, or other chips along the data path can be subject to interference, cosmic rays, heat or voltage related errors, etc.

ZiiS 18 hours ago | parent [-]

It should be fairly easy to see statistically if ECC helps, people do run Firefox on it.

The number of bits in registers, busses, cache layers is very small compared to the number in RAM. Obviously they might be hotter or more likely to flip.

bpye 17 hours ago | parent [-]

I believe caches and maybe registers often have ECC too though I'm sure there are still gaps.