Remix.run Logo
darkwater 3 hours ago

The fact that there is not a single root cause but several ones makes me instinctively think this is a good report, because it's not what the "bosses" (and even less politicians) like to hear.

red_admiral an hour ago | parent | next [-]

Yes, a lot of modern engineering is good enough that single-cause failures are very rare indeed. That means that failures themselves are rare, but when they do happen, they're most likely to have multiple causes.

How to explain that to non-engineers is another problem.

drob518 3 hours ago | parent | prev | next [-]

Frequently, when you see these massive failures, the root cause is an alignment of small weaknesses that all come together on a specific day. See, for instance, the space shuttle O-ring incident, Three-Mile Island, Fukushima, etc. These are complex systems with lots of moving parts and lots of (sometimes independent) people managing them. In a sense, the complexity it the common root cause.

linuxguy2 2 hours ago | parent | next [-]

It's like the Swiss Cheese model where every system has "holes" or vulnerabilities, several layers, and a major incident only occurs when a hole aligns through all the layers.

https://en.wikipedia.org/wiki/Swiss_cheese_model

Ringz 2 hours ago | parent [-]

I use this model all the time. It's very helpful for explaining the multifactorial genesis of catastrophes to ordinary people.

anonymars 2 hours ago | parent [-]

Also perhaps worth a read:

https://devblogs.microsoft.com/oldnewthing/20080416-00/?p=22...

"You’ve all experienced the Fundamental Failure-Mode Theorem: You’re investigating a problem and along the way you find some function that never worked. A cache has a bug that results in cache misses when there should be hits. A request for an object that should be there somehow always fails. And yet the system still worked in spite of these errors. Eventually you trace the problem to a recent change that exposed all of the other bugs. Those bugs were always there, but the system kept on working because there was enough redundancy that one component was able to compensate for the failure of another component. Sometimes this chain of errors and compensation continues for several cycles, until finally the last protective layer fails and the underlying errors are exposed."

jacquesm 8 minutes ago | parent [-]

I've had that multiple times. As well as the closely related 'that can't possibly have ever worked' and sure enough it never did. Forensics in old codebases with modern tools is always fun.

roenxi an hour ago | parent | prev | next [-]

> See, for instance, the space shuttle O-ring incident

That wasn't really a result of an alignment of small weaknesses though. One of the reasons that whole thing was of particular interest was Feynman's withering appendix to the report where he pointed out that the management team wasn't listening to the engineering assessments of the safety of the venture and were making judgement calls like claiming that a component that had failed in testing was safe.

If a situation is being managed by people who can't assess technical risk, the failures aren't the result of many small weaknesses aligning. It wasn't an alignment of small failures as much as that a component that was well understood to be a likely point of failure had probably failed. Driven by poor management.

> Fukushima

This one too. Wasn't the reactor hit by a wave that was outside design tolerance? My memory was that they were hit by an earthquake that was outside design spec, then a tsunami that was outside design spec. That isn't a number of small weaknesses coming together. If you hit something with forces outside design spec then it might break. Not much of a mystery there. From a similar perspective if you design something for a 1:500 year storm then 1/500th of them might easily fail every year to storms. No small alignment of circumstances needed.

amelius 3 hours ago | parent | prev [-]

It usually starts with a broken coffee machine.

ragebol 3 hours ago | parent | prev | next [-]

Yep, sounds like "This was bound to happen at some point"

cucumber3732842 3 hours ago | parent [-]

Which on some level is exactly "what the bosses and politicians want to hear"

When it's everybody's fault it's nobody's fault.

darkwater an hour ago | parent | next [-]

In some ways, yes, but yet it's what reality is. There was probably some last factor kicking in that triggered the cascade, but there were probably many non-happy-paths not properly covered by working backup/fallback strategies. So a report could totally still tell "it's X fault", pointing the finger there. Government would blame the owner of X, some public statement about fixing X would be made and then the ones working in the field should internally push toi improve/fix their own (reduced) scope.

I don't know what will come of this report in the next months/years, I will keep an eye on it though, since I live in Spain :)

drob518 3 hours ago | parent | prev [-]

Exactly.

OgsyedIE 3 hours ago | parent | prev [-]

There are ways to aggregate these into a single resilience score for policy makers with only moderate loss of detail but it's unpopular.