Remix.run Logo
RealityVoid 3 days ago

I thought the same, but in a deeper dive into the postmortem, I think it's not a cop out from their side. The report is actually really well done ( I personally was impressed). The reasons it probably was a bit flip is that the CPU did not have edac on it in this instance so bit flips are expected. The consensus mechanism failed in this case and that is what they are updating, because even though the module gave wrong data because of presumably bit flips, the consensus should have prevented the dive.

RachelF 3 days ago | parent | next [-]

I would argue that designing avionics without EDAC is negligent design by Airbus.

Most modern servers at least implement ECC on their RAM. I would expect flight electronics to be designed to a higher standard.

15155 3 days ago | parent | next [-]

Multi-module consensus is a form of EDAC - it's exceptionally unlikely that multiple units will fail identically simultaneously.

skylurk 3 days ago | parent [-]

Sure, until management sells a version with reduced redundancy to Ethiopia and Indonesia. Swiss cheese model and all that.

ahartmetz 3 days ago | parent [-]

>version with reduced redundancy

Not going to happen. The potentially huge cost to their reputation alone makes it not worth it, the modification would cost money and make logistics more difficult, and the plane couldn't be used (or sold) worldwide anymore.

skylurk 2 days ago | parent [-]

I think you are being sarcastic, but just in case:

https://www.nytimes.com/2019/05/05/business/boeing-737-max-w...

https://www.boeing.com/737-max-updates/mcas/

serial_dev 2 days ago | parent [-]

The links are Boeing and this article and thread are about Airbus.

Two different companies.

Boeing had tons of failures recently, flight search services started adding filters for the airplane because people didn’t want to fly with them.

Airbus is doing better for now, hopefully it will stay that way.

skylurk 2 days ago | parent [-]

Sorry, I didn't mean to be taking shots at any airplane company. I just disagree that multi-module consensus is a reliable form of EDAC. I gave a human factor example, but there are technical reasons too.

RealityVoid 2 days ago | parent [-]

> I just disagree that multi-module consensus is a reliable form of EDAC.

I wonder why you disagree about this? The only reason I can thing of is: - same sw with same hw with same lifecycle would probably have the same issue. (vendor diversity would fix this) - The consensus building unit is still a possible single point of failure.

Any other reasons you might doubt it as a methodology? It seems to have worked pretty well for Airbus and the failure rate is pretty low, so... It obviously is functional.

Modern units I'm sure have ECC, AND redundace as well.

skylurk 2 days ago | parent [-]

Yes exactly, birds of a feather fail together... an A380 has three primary flight control computers, but still carries another entirely dissimilar set of three flight control computers as backup.

RealityVoid a day ago | parent [-]

Well, the diversity would cover the issue with random HW failures, not the case your SW has a bug in it. As to the SW, they _sometimes_ have vendor diversity.

Regardless, there are multiple fronts you need to tackle to have high reliability so you should use all techniques at your disposal.

p_l 2 days ago | parent | prev [-]

Until relatively recently, ECC on server RAM was because of chip failures and to lesser extent electro magnetic interference.

Good part selection and different EMI environment meant the calculated risks from not having ECC were considered too low to care and the idea that they might have to deal with radiation outside of flying near nuclear explosion arrived after the specific devices got designed.

thegrim33 3 days ago | parent | prev | next [-]

Isn't a major feature of consensus algorithms for them to be tolerant to failures? Even basic algorithms take error handling into account and shouldn't be taken out by a bit flip in any one component.

RealityVoid 2 days ago | parent [-]

Yes. To clarify, my understanding of _this_ particular incident was wrong because it was based on reading the report of a previous incident.

But for the 2008 incident I read and linked the report, that was what happened. The ADIRU unit did probably get a SEU event and that should have been mitigated by the design of the ELAC unit. The ELAC unit failed to mitigate it so that's the part that they probaby fixed.

N19PEDL2 3 days ago | parent | prev [-]

Do you happen to have a link to that report?

RealityVoid 3 days ago | parent [-]

Sure.

https://www.atsb.gov.au/sites/default/files/media/3532398/ao...

My reaction was initially that it was a cop out, but looking a bit in the report and thinking things through, I think that, yes, it's most likely a bit flip.

meatmanek 3 days ago | parent [-]

This is for a similar incident that happened in 2008, not the Jetblue incident from October of this year.

RealityVoid 3 days ago | parent [-]

Oh my god, you are correct. I read the technical details and did not bother to check it's the same issue. I am mortified. Apologies.