It's money, of course. No one wants to pay for resilience/redundancy. I've launched over a dozen projects going back to 2008, clients simply refuse to pay for it, and you can't force them. They'd rather pinch their pennies, roll the dice and pray.

▲

stinkbeetle 3 hours ago | parent | next [-]

> It's money, of course.

100%

> No one wants to pay for resilience/redundancy. I've launched over a dozen projects going back to 2008, clients simply refuse to pay for it, and you can't force them. They'd rather pinch their pennies, roll the dice and pray.

Well, fly by night outfits will do that. Bigger operations like GitHub will try to do the math on what an outage costs vs what better reliability costs, and optimize accordingly.

Look at a big bank or a big corporation's accounting systems, they'll pay millions just for the hot standby mainframes or minicomputers that, for most of them, would never be required.

▲

solid_fuel 33 minutes ago | parent | next [-]

> Bigger operations like GitHub will try to do the math on what an outage costs vs what better reliability costs, and optimize accordingly.

Used to, but it feels like there is no corporate responsibility in this country anymore. These monopolies have gotten so large that they don't feel any impact from these issues. Microsoft is huge and doesn't really have large competitors. Google and Apple aren't really competing in the source code hosting space in the same way GitHub is.

▲

Jenk 3 hours ago | parent | prev [-]

I've worked at many big banks and corporations. They are all held together with the proverbial sticky tape, bubblegum, and hope.

They do have multiple layers of redundancies, and thus have the big budgets, but they won't be kept hot, or there will be some critical flaws that all of the engineers know about but they haven't been given permission/funding to fix, and are so badly managed by the firm, they dgaf either and secretly want the thing to burn.

There will be sustained periods of downtime if their primary system blips.

They will all still be dependent on some hyper-critical system that nobody really knows how it works, the last change was introduced in 1988 and it (probably) requires a terminal emulator to operate.

	▲	stinkbeetle 3 hours ago \| parent [-]
		I've worked on software used by these and have been called in to help support from time to time. One customer which is a top single digit public company by market cap (they may have been #1 at the time, a few years ago) had their SAP systems go down once every few days. This wasn't causing a real monetary problem for them because their hot standby took over. They weren't using mainframes, just "big iron" servers, but each one would have been north of $5 million for the box alone, I guess on a 5ish year replacement schedule. Then there's all the networking, storage, licensing, support, and internal administration costs for it which would easily cost that much again. Now people will say SAP systems are made entirely of dict tape and bubblegum. But it all worked. This system ran all their sales/purchasing sites and portals and was doing a million dollars every couple of minutes so that all paid for itself many times over during the course of that bug. Cold standby would not have cut it. Especially since these big systems take many minutes to boot and HANA takes even longer to load from storage.

▲

lopatin 4 hours ago | parent | prev | next [-]

I agree that it's all money.

That's why it's always DNS right?

> No one wants to pay for resilience/redundancy

These companies do take it seriously, on the software side, but when it comes to configurations, what are you going to do:

Either play it by ear, or literally double your cloud costs for a true, real prod-parallel to mitigate that risk. It looks like even the most critical and prestigious companies in the world are doing the former.

	▲	macintux 2 hours ago \| parent [-]
		> Either play it by ear, or literally double your cloud costs for a true, real prod-parallel to mitigate that risk. There's also the problem that doubling your cloud footprint to reduce the risk of a single point of failure introduces new risks: more configuration to break, new modes of failure when both infrastructures are accidentally live and processing traffic, etc. Back when companies typically ran their own datacenters (or otherwise heavily relied on physical devices), I was very skeptical about redundant switches, fearing the redundant hardware would cause more problems than it solved.

▲

ForHackernews 4 hours ago | parent | prev [-]

Why should they? Honestly most of what we do simply does not matter that much. 99.9% uptime is fine in 99.999% of cases.

	▲	porridgeraisin 2 hours ago \| parent [-]
		This is true. But unfortunately the exact same process is used even for critical stuff (the crowdstrike thing for example). Maybe there needs to be a separate swe process for those things as well, just like there is for aviation. This means not using the same dev tooling, which is a lot of effort.