Remix.run Logo
crote 8 hours ago

I strongly recommend watching/reading the entire report, or the summary by Sal Mercogliano of What's Going On In Shipping [0].

Yes, the loose wire was the immediate cause, but there was far more going wrong here. For example:

- The transformer switchover was set to manual rather than automatic, so it didn't automatically fail over to the backup transformer.

- The crew did not routinely train transformer switchover procedures.

- The two generators were both using a single non-redundant fuel pump (which was never intended to supply fuel to the generators!), which did not automatically restart after power was restored.

- The main engine automatically shut down when the primary coolant pump lost power, rather than using an emergency water supply or letting it overheat.

- The backup generator did not come online in time.

It's a classic Swiss Cheese model. A lot of things had to go wrong for this accident to happen. Focusing on that one wire isn't going to solve all the other issues. Wires, just like all other parts, will occasionally fail. One wire failure should never have caused an incident of this magnitude. Sure, there should probably be slightly better procedures for checking the wiring, but next time it'll be a failed sensor, actuator, or controller board.

If we don't focus on providing and ensuring a defense-in-depth, we will sooner or later see another incident like this.

[0]: https://www.youtube.com/watch?v=znWl_TuUPp0

Aurornis 8 hours ago | parent | next [-]

Thanks for the summary for those of us who can't watch video right now.

There are so many layers of failures that it makes you wonder how many other operations on those ships are only working because those fallbacks, automatic switchovers, emergency supplies, and backup systems save the day. We only see the results when all of them fail and the failure happens to result in some external problem that means we all notice.

arjie 8 hours ago | parent | next [-]

It seems to just be standard "normalization of deviance" to use the language of safety engineering. You have 5 layers of fallbacks, so over time skipping any of the middle layers doesn't really have anything fail. So in time you end up with a true safety factor equal only to the last layer. Then that fails and looking back "everything had to go wrong".

As Sidney Dekker (of Understanding Human Error fame) says: Murphy's Law is wrong - everything that can go wrong will go right. The problem arises from the operators all assuming that it will keep going right.

I remember reading somewhere that part of Qantas's safety record came from the fact that at one time they had the highest number of minor issues. In some sense, you want your error detection curve to be smooth: as you get closer to catastrophe, your warnings should get more severe. On this ship, it appeared everything was A-OK till it bonked a bridge.

bombcar 7 hours ago | parent [-]

This is the most pertinent thing to learn from these NTSB crash investigations - it's not what went wrong at the final disaster, but all the things that went wrong that didn't detect that they were down to one layer of defense.

Your car engaging auto brake to prevent a collision shouldn't be a "whew, glad that didn't happen" and more a "oh shit, I need to work on paying attention more."

aidenn0 32 minutes ago | parent | next [-]

I had to disable the auto-brake from RCT[1] sensors because of too many false-positives (like 3 a week) in my car.

1: rear-cross-traffic i.e. when backing up and cars are coming from the side.

dmurray 6 hours ago | parent | prev [-]

Why then does the NTSB point blame so much at the single wiring issue? Shouldn't they have the context to point to the 5 things that went wrong in the Swiss cheese and not pat themselves on the back with having found the almost-irrelevant detail of

> Our investigators routinely accomplish the impossible, and this investigation is no different...Finding this single wire was like hunting for a loose rivet on the Eiffel Tower.

In the software world, if I had an application that failed when a single DNS query failed, I wouldn't be pointing the blame at DNS and conducting a deep dive into why this particular query timed out. I'd be asking why a single failure was capable of taking down the app for hundreds or thousands of other users.

plorg 5 hours ago | parent | next [-]

That seems like a difference between the report and the press release. I'm sure it doesn't help that the current administration likes quick, pat answers.

The YouTube animation they published notes that this also wasn't just one wire - they found many wires on the ship that were terminated and labeled in the same (incorrect) way, which points to an error at the ship builder and potentially a lack of adequate documentation or training materials from the equipment manufacturer, which is why WAGO received mention and notice.

bombcar 4 hours ago | parent [-]

It’s also immediately actionable and other similar ships can investigate their wires

toast0 5 hours ago | parent | prev [-]

The faulty wire is the root cause. If it didn't trigger the sequence of events, all of the other things wouldn't have happened. And it's kind of a tricky thing to find, so that's an exciting find.

The flushing pump not restarting when power resumed did also cause a blackout in port the day before the incident. But you know, looking into why you always have two blackouts when you have one is something anybody could do; open the main system breaker, let the crew restore it and that flushing pump will likely fail in the same way every time... but figuring out why and how the breaker opened is neat, when it's not something obvious.

nothercastle 3 hours ago | parent [-]

Operators always like to just clear the fault and move on they have extremely high pressure to make schedule and low incentive to work safely

crote 8 hours ago | parent | prev | next [-]

Oh, it gets even worse!

The NTSB also had some comments on the ship's equivalent of a black box. Turns out it was impossible to download the data while it was still inside the ship, the manufacturer's software was awful and the various agencies had a group chat to share 3rd party software(!), the software exported thousands of separate files, audio tracks were mixed to the point of being nearly unusable, and the black box stopped recording some metrics after power loss "because it wasn't required to" - despite the data still being available.

At least they didn't have anything negative to say about the crew: they reacted timely and adequately - they just didn't stand a chance.

nothercastle 3 hours ago | parent | next [-]

It’s pretty common for black boxes to be load shed during an emergency. Kind of funny how that was allowed for a long time.

MengerSponge 2 hours ago | parent | prev [-]

"they reacted timely and adequately" and yet: they're indefinitely restricted (detained isn't the right word, but you get it) to Baltimore, while the ship is free to resume service.

haddonist 2 hours ago | parent | prev | next [-]

One of the things Sal Mercogliano stressed is that the crew (and possibly other crews of the same line) modified systems in order to save time.

Rather than doing the process of purging high-sulphur fuel that can't be used in USA waters, they had it set so that some of the generators were fed from USA-approved fuel, resulting in redundancy & automatic failover being compromised.

It seems probable that the wire failure would not have caused catastrophic overall loss of power if the generators had been in the normal configuration.

6 hours ago | parent | prev | next [-]
[deleted]
dboreham 4 hours ago | parent | prev [-]

Also the zeroth failure mode: someone built a bridge that will collapse if any of the many many large ships that sail beneath it can't steer itself with high precision.

foobar1962 3 hours ago | parent [-]

Ships were a lot smaller when the bridge was designed and built.

renhanxue 7 hours ago | parent | prev | next [-]

The fuel pump not automatically restarting on power loss may actually have been an intentional safety feature to prevent scenarios like pumping fuel into a fire in or around the generators. Still part of the Swiss cheese model, of course.

crote 7 hours ago | parent [-]

It wasn't. They were feeding generators 1 & 2 with the pump intended for flushing the lines while switching between different fuel types.

The regular fuel pumps were set up to automatically restart, which is why a set of them came online to feed generator 3 (which automatically spinned up after 1 & 2 failed, and wasn't tied to the fuel-line-flushing pump) after the second blackout.

ChrisMarshallNY 7 hours ago | parent | prev | next [-]

I have found that 99% of all network problems are bad wires.

I remember that the IT guys at my old company, used to immediately throw out every ethernet cable, and replace them with ones right out of the bag; first thing.

But these ships tend to be houses of cards. They are not taken care of properly, and run on a shoestring budget. Many of them look like floating wrecks.

gerdesj 5 hours ago | parent | next [-]

If I see a RJ45 plug with a broken locking thingie, or bare wires (not just bare copper - any internal wire), I chop the plug off.

If I come across a CATx (solid core) cable being used as a really long patch lead then I lose my shit or perhaps get a backbox and face plate and modules out along with a POST tool.

I don't look after floating fires.

jmonty900 6 hours ago | parent | prev | next [-]

I recently had a home network outage. The last thing I tested was the in-wall wiring because I just didn't think that would be the cause. It was. Wiring fails!

potato3732842 5 hours ago | parent | prev [-]

If I had a nickle for every time someone clobbered some critical connectivity with an ill-advised switch configuration I wouldn't have to work for a living.

And the physical layer issues I do see are related to ham fisted people doing unrelated work in the cage.

Actual failures are pretty damn rare.

kfarr 6 hours ago | parent | prev | next [-]

Another case study to add to the maritime chapter of this timeless classic: https://www.amazon.com/Normal-Accidents-Living-High-Risk-Tec...

Like you said (and illustrated well in the book) it's never just 1 thing, these incidents happen when multiple systems interact and often reflect a the disinvestment in comprehensive safety schemes.

rolph 3 hours ago | parent | prev | next [-]

ive been in an environment like that.

"nuisance" issues like that are deferred bcz they are not really causing a problem, so maintenance spends time on problems with things that make money, rather than what some consider spit n polish on things that have no prior failures.

FridayoLeary 5 hours ago | parent | prev | next [-]

Just insane how much criminal negligence went on. Even boeing hardly comes close. What needs to change is obviously a major review of how ships are allowed to operate near bridges and other infrastructure. And far stricter safety standards like aircraft face.

pstuart 8 hours ago | parent | prev | next [-]

Hopefully the lesson from this will be received by operators: it's way cheaper to invest in personnel, training, and maintenance than to let the shit hit the fan.

stackskipton 8 hours ago | parent | next [-]

Why? It's cost them 100M (https://www.justice.gov/archives/opa/pr/us-reaches-settlemen...) but rebuilding the bridge is going to be 5.2Billion so if gundecking all this maintenance for 20+ years has saved more then 100M, they will do it again.

xp84 7 hours ago | parent | next [-]

From your article - this answered a question I had:

> The settlement does not include any damages for the reconstruction of the Francis Scott Key Bridge. The State of Maryland built, owned, maintained, and operated the bridge, and attorneys on the state’s behalf filed their own claim for those damages. Pursuant to the governing regulation, funds recovered by the State of Maryland for reconstruction of the bridge will be used to reduce the project costs paid for in the first instance by federal tax dollars.

Barbing 4 hours ago | parent [-]

So was the bridge self-insured?

stevenjgarner 6 hours ago | parent | prev | next [-]

Isn't there a big liability insurance payout on this towards the 5.2 Billion, and if so won't the insurer be more motivated to mandate compliance?

nothercastle 3 hours ago | parent [-]

Yes the insurer will likely be able to charge more.

toast0 8 hours ago | parent | prev | next [-]

The vessel owner may possibly be able to recover some of that from the manufacturer, as the wiring was almost certainly a manufacturing error, and maybe some of the configurations that continued the blackout were manufacturer choices as well.

potato3732842 5 hours ago | parent [-]

At the end of the day we all just pay for it in terms of insurance costs priced into our goods.

usefulcat 3 hours ago | parent | next [-]

What would be a better solution?

mjevans 19 minutes ago | parent [-]

Regulations to require work is done correctly the first time. Also inspections.

I like a government that pays workers to look out for my safety.

genter 4 hours ago | parent | prev [-]

But it's important to "punish" (via punitive fines) the right people, so that they will put some effort into not making that mistake again.

lazide 8 hours ago | parent | prev [-]

Actually, to be even more cynical….

If everyone saved $100M by doing this and it only cost one shipper $100M, then of course everyone else would do it and just hope they aren’t the one who has bad enough luck to hit the bridge.

And statistically, almost all of them will be okay!

nothercastle 3 hours ago | parent | prev [-]

It’s not thought. These situations are extremely rare. When they happen it just close the company and shed liability.

p3rls 7 hours ago | parent | prev [-]

[dead]