Remix.run Logo
tptacek 2 hours ago

We've written at long, tedious length about how hard this problem is.

otterley 2 hours ago | parent [-]

Have a link?

tptacek 2 hours ago | parent [-]

Most recently, a few weeks ago (but you'll find more just a page or two into the blog):

https://fly.io/blog/corrosion/

otterley 2 hours ago | parent [-]

It's great that you're working on regionalization. Yes, it is hard, but 100x harder if you don't start with cellular design in mind. And as I said in the root of the thread, this is a sign that CloudFlare needs to invest in it just like you have been.

tptacek 2 hours ago | parent [-]

I recoil from that last statement not because I have a rooting interest in Cloudflare but because the last several years of working at Fly.io have drilled Richard Cook's "How Complex Systems Fail"† deep into my brain, and what you said runs aground of Cook #18: Failure free operations require experience with failure.

If the exact same thing happens again at Cloudflare, they'll be fair game. But right now I feel people on this thread are doing exactly, precisely, surgically and specifically the thing Richard Cook and the Cook-ites try to get people not to do, which is to see complex system failures as predictable faults with root causes, rather than as part of the process of creating resilient systems.

https://how.complexsystems.fail/

otterley 2 hours ago | parent [-]

Suppose they did have the cellular architecture today, but every other fact was identical. They'd still have suffered the failure! But it would have been contained, and the damage would have been far less.

Fires happen every day. Smoke alarms go off, firefighters get called in, incident response is exercised, and lessons from the situation are learned (with resulting updates to the fire and building codes).

Yet even though this happens, entire cities almost never burn down anymore. And we want to keep it that way.

As Cook points out, "Safety is a characteristic of systems and not of their components."

HumanOstrich an hour ago | parent | next [-]

What variant of cellular architecture are you referring to? Can you give me a link or few? I'm fascinated by it and I've led a team to break up a monolithic solution running on AWS to a cellular architecture. The results were good, but not magic. The process of learning from failures did not stop, but it did change (for the better).

No matter what architecture, processes, software, frameworks, and systems you use, or how exhaustively you plan and test for every failure mode, you cannot 100% predict every scenario and claim "cellular architecture fixes this". This includes making 100% of all failures "contained". Not realistic.

otterley an hour ago | parent [-]

If your AWS service is properly regionalized, that’s the minimum amount of cellular architecture required. Did your service ever fail in multiple regions simultaneously?

Cellular architecture within a region is the next level and is more difficult, but is achievable if you adhere to the same principles that prohibit inter-regional coupling:

https://docs.aws.amazon.com/wellarchitected/latest/reducing-...

https://docs.aws.amazon.com/wellarchitected/latest/reducing-...

HumanOstrich an hour ago | parent [-]

You didn't really put any thought into what I said. Thanks for the links.

otterley 44 minutes ago | parent [-]

It wasn't worth thinking about. I'm not going to defend myself against arguments and absolute claims I didn't make. The key word here is mitigation, not perfection.

hedora 15 minutes ago | parent [-]

> If your AWS service is properly regionalized, that’s the minimum amount of cellular architecture required

Amazon has had multi-region outages due to pushing bad configs, so it’s extremely difficult to believe whatever you are proposing solves that exact problem by relying on multi-regions.

Come to think of it, Cloudflare’s outage today is another good counterexample.

tptacek 2 hours ago | parent | prev [-]

Pretty sure he's making my point (or, rather, me his) there. (I'm never going to turn down an opportunity to nerd out about Cookism).