I'm becoming concerned with the rate at which major software systems seem to be failing as of late. For context, last year I only logged four outages that actually disrupted my work; this quarter alone I'm already on my fourth, all within the past few weeks. This is, of course, just an anecdote and not evidence of any wider trend (not to mention that I might not have even logged everything last year), but it was enough to nudge me into writing this today (helped by the fact that I suddenly had some downtime). Keep in mind, this isn't necessarily specific to this outage, just something that's been on my mind enough to warrant writing about it.

It feels like resiliency is becoming a bit of a lost art in networked software. I've spent a good chunk of this year chasing down intermittent failures at work, and I really underestimated how much work goes into shrinking the "blast radius", so to speak, of any bug or outage. Even though we mostly run a monolith, we still depend on a bunch of external pieces like daemons, databases, Redis, S3, monitoring, and third-party integrations, and we generally assume that these things are present and working in most places, which wasn't always the case. My response was to better document the failure conditions, and once I did, realize that there was many more than we initially thought. Since then we've done things like: move some things to a VPS instead of cloud services, automate deployment more than we already had, greatly improve the test suite and docs to include these newly considered failure conditions, and generally cut down on moving parts. It was a ton of effort, but the payoff has finally shown up: our records show fewer surprises which means fewer distractions and a much calmer system overall. Without that unglamorous work, things would've only grown more fragile as complexity crept in. And I worry that, more broadly, we're slowly un-learning how to build systems that stay up even when the inevitable bug or failure shows up.

For completeness, here are the outages that prompted this: the AWS us-east-1 outage in October (took down the Lightspeed R series API), the Azure Front Door outage (prevented Playwright from downloading browsers for tests), today’s Cloudflare outage (took down Lightspeed’s website, which some of our clients rely on), and the Github outage affecting basically everyone who uses it as their git host.

▲

HardwareLust 5 hours ago | parent | next [-]

It's money, of course. No one wants to pay for resilience/redundancy. I've launched over a dozen projects going back to 2008, clients simply refuse to pay for it, and you can't force them. They'd rather pinch their pennies, roll the dice and pray.

▲

stinkbeetle 4 hours ago | parent | next [-]

> It's money, of course.

100%

> No one wants to pay for resilience/redundancy. I've launched over a dozen projects going back to 2008, clients simply refuse to pay for it, and you can't force them. They'd rather pinch their pennies, roll the dice and pray.

Well, fly by night outfits will do that. Bigger operations like GitHub will try to do the math on what an outage costs vs what better reliability costs, and optimize accordingly.

Look at a big bank or a big corporation's accounting systems, they'll pay millions just for the hot standby mainframes or minicomputers that, for most of them, would never be required.

▲

solid_fuel 36 minutes ago | parent | next [-]

> Bigger operations like GitHub will try to do the math on what an outage costs vs what better reliability costs, and optimize accordingly.

Used to, but it feels like there is no corporate responsibility in this country anymore. These monopolies have gotten so large that they don't feel any impact from these issues. Microsoft is huge and doesn't really have large competitors. Google and Apple aren't really competing in the source code hosting space in the same way GitHub is.

▲

Jenk 3 hours ago | parent | prev [-]

I've worked at many big banks and corporations. They are all held together with the proverbial sticky tape, bubblegum, and hope.

They do have multiple layers of redundancies, and thus have the big budgets, but they won't be kept hot, or there will be some critical flaws that all of the engineers know about but they haven't been given permission/funding to fix, and are so badly managed by the firm, they dgaf either and secretly want the thing to burn.

There will be sustained periods of downtime if their primary system blips.

They will all still be dependent on some hyper-critical system that nobody really knows how it works, the last change was introduced in 1988 and it (probably) requires a terminal emulator to operate.

	▲	stinkbeetle 3 hours ago \| parent [-]
		I've worked on software used by these and have been called in to help support from time to time. One customer which is a top single digit public company by market cap (they may have been #1 at the time, a few years ago) had their SAP systems go down once every few days. This wasn't causing a real monetary problem for them because their hot standby took over. They weren't using mainframes, just "big iron" servers, but each one would have been north of $5 million for the box alone, I guess on a 5ish year replacement schedule. Then there's all the networking, storage, licensing, support, and internal administration costs for it which would easily cost that much again. Now people will say SAP systems are made entirely of dict tape and bubblegum. But it all worked. This system ran all their sales/purchasing sites and portals and was doing a million dollars every couple of minutes so that all paid for itself many times over during the course of that bug. Cold standby would not have cut it. Especially since these big systems take many minutes to boot and HANA takes even longer to load from storage.

▲

lopatin 4 hours ago | parent | prev | next [-]

I agree that it's all money.

That's why it's always DNS right?

> No one wants to pay for resilience/redundancy

These companies do take it seriously, on the software side, but when it comes to configurations, what are you going to do:

Either play it by ear, or literally double your cloud costs for a true, real prod-parallel to mitigate that risk. It looks like even the most critical and prestigious companies in the world are doing the former.

	▲	macintux 3 hours ago \| parent [-]
		> Either play it by ear, or literally double your cloud costs for a true, real prod-parallel to mitigate that risk. There's also the problem that doubling your cloud footprint to reduce the risk of a single point of failure introduces new risks: more configuration to break, new modes of failure when both infrastructures are accidentally live and processing traffic, etc. Back when companies typically ran their own datacenters (or otherwise heavily relied on physical devices), I was very skeptical about redundant switches, fearing the redundant hardware would cause more problems than it solved.

▲

ForHackernews 4 hours ago | parent | prev [-]

Why should they? Honestly most of what we do simply does not matter that much. 99.9% uptime is fine in 99.999% of cases.

	▲	porridgeraisin 2 hours ago \| parent [-]
		This is true. But unfortunately the exact same process is used even for critical stuff (the crowdstrike thing for example). Maybe there needs to be a separate swe process for those things as well, just like there is for aviation. This means not using the same dev tooling, which is a lot of effort.

▲

roxolotl 2 hours ago | parent | prev | next [-]

To agree with the comments it seems likely it's money which has begun to result in a slow "un-learning how to build systems that stay up even when the inevitable bug or failure shows up."

▲

suddenlybananas 5 hours ago | parent | prev [-]

To be deliberately provocative, LLMs are being more and more widely used.

▲

zdragnar 2 hours ago | parent | next [-]

Word on the street is github was already a giant mess before the rise of LLMs, and it has not improved with the move to MS.

	▲	dsagent 2 hours ago \| parent [-]
		They are also in the process of moving most of the infra from on-prem to Azure. I'm sure will see more issues over the next couple months. https://thenewstack.io/github-will-prioritize-migrating-to-a...

▲

blibble 4 hours ago | parent | prev [-]

imagine what it'll be like in 10 years time

Microsoft: the film Idiocracy was not supposed to be a manual