Remix.run Logo
addaon 5 hours ago

I’d really, really like to know what microcontroller family this was found on. Assuming that this is a safety processor (lockstep, ECC, etc) it suggests that ECC was insufficient for the level of bit flips they’re seeing — and if the concern is data corruption, not unintended restart, it means it’s enough flips in one word to be undetectable. The environment they’re operating in isn’t that different from everyone else, so unless they ate some margin elsewhere (bad voltage corner or something), this can definitely be relevant to others. Also would be interesting to know if it’s NVM or SRAM that’s effected.

RealityVoid an hour ago | parent | next [-]

See my other comments in the other threads. This does not have EDAC. I was as surprised as you but it doesn't seems to be an MCU but a composition of several distinct chips. That flight computer was designed in the 90's and updated in 2002 with a new hw variant that does have edac. So yes, for this kind of thing, I can buy that a bit flip happened.

You can see much more data in the report:

https://www.atsb.gov.au/sites/default/files/media/3532398/ao...

TehCorwiz 4 hours ago | parent | prev | next [-]

An early revision of the Raspberry Pi 2 would crash if you hit it with a bright light like a camera flash. Specifically a xenon flash.

https://forums.raspberrypi.com/viewtopic.php?t=99167

https://forums.raspberrypi.com/viewtopic.php?f=28&t=99042

https://www.raspberrypi.com/news/xenon-death-flash-a-free-ph...

https://www.youtube.com/watch?v=wyptwlzRqaI

russdill an hour ago | parent | next [-]

Completely unrelated and due to a design failure by the rpi folks.

mlyle 2 hours ago | parent | prev [-]

Yah, but that's a case of the package not being opaque enough.

anonymousiam an hour ago | parent | prev | next [-]

proper SEU mitigation goes far beyond ECC. Satellites fly higher than the A320, and they (at least the ones I know about) use Triple Modular Redundancy: https://en.wikipedia.org/wiki/Triple_modular_redundancy

https://en.wikipedia.org/wiki/Single-event_upset

For manned spaceflight, NASA ups N from 3 to 5.

Other mitigations include completely disabling all CPU caches (with a big performance hit), and continuously refreshing the ECC RAM in background.

There are also a bunch of hardware mitigations to prevent "latch up" of the digital circuits.

jayanmn 4 hours ago | parent | prev [-]

I am worried about a software fix for what looks like hardware problem.

themerone 3 hours ago | parent | next [-]

Gracefully handling hardware faults is a software problem. The Air France Flight 447 crash was the result of bad software and bad hardware.

vel0city 3 hours ago | parent | next [-]

I'm reminded of the Apollo moon landing where the computer was rapidly rebooting and being in an OK-ish state to continue to be useful almost immediately

CrossVR an hour ago | parent [-]

It wasn't rebooting, it ran out of memory and started aborting lower priority tasks. It was a excellent example of robust programming in the face of unexpected usage scenarios.

idkfasayer an hour ago | parent | prev [-]

[dead]

afavour 3 hours ago | parent | prev | next [-]

It could be as simple as storing multiple copies of the relevant data and adding a checksum, something like that.

Hardware fix is the ultimate solution but it might be possible to paper over with software.

2 hours ago | parent [-]
[deleted]
kachapopopow 3 hours ago | parent | prev | next [-]

software fixes are totally fine since the chance of two redundant pairs failing within the time it takes to correct these errors is more zero's than there are atoms in the universe. (each pilot has a redundant computer and because there's two pilots there's two redundant pairs)

4 hours ago | parent | prev [-]
[deleted]