| ▲ | tptacek 2 hours ago | |||||||||||||||||||||||||||||||||||||||||||||||||
I recoil from that last statement not because I have a rooting interest in Cloudflare but because the last several years of working at Fly.io have drilled Richard Cook's "How Complex Systems Fail"† deep into my brain, and what you said runs aground of Cook #18: Failure free operations require experience with failure. If the exact same thing happens again at Cloudflare, they'll be fair game. But right now I feel people on this thread are doing exactly, precisely, surgically and specifically the thing Richard Cook and the Cook-ites try to get people not to do, which is to see complex system failures as predictable faults with root causes, rather than as part of the process of creating resilient systems. | ||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | otterley 2 hours ago | parent [-] | |||||||||||||||||||||||||||||||||||||||||||||||||
Suppose they did have the cellular architecture today, but every other fact was identical. They'd still have suffered the failure! But it would have been contained, and the damage would have been far less. Fires happen every day. Smoke alarms go off, firefighters get called in, incident response is exercised, and lessons from the situation are learned (with resulting updates to the fire and building codes). Yet even though this happens, entire cities almost never burn down anymore. And we want to keep it that way. As Cook points out, "Safety is a characteristic of systems and not of their components." | ||||||||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||