Remix.run Logo
SteveNuts 6 hours ago

I have a serious question, not trying to start a flame war.

A. Are these major issues with cloud/SaaS tools becoming more common, or is it just that they get a lot more coverage now? It seems like we see major issues across AWS, GCP, Azure, Github, etc. at least monthly now and I don't remember that being the case in the past.

B. If it's becoming more common, what are the reasons? I can think of a few, but I don't know the answer, so if anyone in-the-know has insight I'd appreciate it.

Operations budget cuts/layoffs? Replacing critical components/workflows with AI? Just overall growing pains, where a service has outgrown what it was engineered for?

Thanks

wnevets 6 hours ago | parent | next [-]

> A. Are these major issues with cloud/SaaS tools becoming more common, or is it just that they get a lot more coverage now? It seems like we see major issues across AWS, GCP, Azure, Github, etc. at least monthly now and I don't remember that being the case in the past.

FWIW Microsoft is convinced moving Github to Azure will fix these outages

Lammy 6 hours ago | parent | next [-]

Everything old is new again.

https://www.zdnet.com/article/ms-moving-hotmail-to-win2000-s...

https://jimbojones.livejournal.com/23143.html

codethief 5 hours ago | parent [-]

From the second link:

> In 2002, the amusement continued when a network security outfit discovered an internal document server wide open to the public internet in Microsoft's supposedly "private" network, and found, among other things, a whitepaper[0] written by the hotmail migration team explaining why unix is superior to windows.

Hahaha, that whitepaper is pure gold!

[0]: https://web.archive.org/web/20040401182755/http://www.securi...

hotsauceror 9 minutes ago | parent [-]

And 25 years later, a significant portion of the issues in that whitepaper remain unresolved. They were still shitting on people like Jeffrey Snover who were making attempts to provide more scalable management technologies. Such a clown show.

tombert an hour ago | parent | prev | next [-]

Microsoft is a company that hasn't even figured out how to get system updating working consistently on their premier operating system in three decades. It seems unlikely to me that somehow moving to Azure is going to make anything more stable.

einsteinx2 6 hours ago | parent | prev | next [-]

The same Azure that just had a major outage this month?

bovermyer 6 hours ago | parent | prev [-]

Microsoft is also convinced that its works are a net benefit for humanity, so I would take that with a grain of salt.

andrewstuart2 6 hours ago | parent [-]

I think it would be pretty hard to argue against that point of view, at least thus far. If DOS/Windows hadn't become the dominant OS someone would have, and a whole generation of engineers cut their teeth on their parents' windows PCs.

tombert an hour ago | parent | next [-]

If Microsoft hadn't tried to actively kill all its competition then there's a good chance that we'd have a much better internet. Microsoft is bigger than just an operating system, they're a whole corporation.

Instead they actively tried to murder open standards [1] that they viewed as competitive and normalized the antitrust nightmare that we have now.

I think by nearly any measure, Microsoft is not a net good. They didn't invent the operating system, there were lots of operating systems that came out in the 80's and 90's, many of which were better than Windows, that didn't have the horrible anticompetitive baggage attached to them.

[1] https://en.wikipedia.org/wiki/Embrace,_extend,_and_extinguis...

cdaringe 6 hours ago | parent | prev | next [-]

There are some pretty zany alternative realities in the Multiverses I’ve visited. Xerox Parc never went under and developed computing as a much more accessible commodity. Another, Bell labs invented a whole category of analog computers that’s supplanted our universe’s digital computing era. There’s one where IBM goes directly to super computers in the 80s. While undoubtedly Microsoft did deliver for many of us, I am a hesitant to say that that was the only path. Hell, Steve Jobs existed in the background for a long while there!

bilegeek 5 hours ago | parent | next [-]

I wish things had gone differently too, but a couple of nitpicks:

1.) It's already a miracle Xerox PARC escaped their parent company's management for as long as they did.

3.) IBM was playing catch-up on the supercomputer front since the CDC 6400 in 1964. Arguably, they did finally catch up in the mid-late 80's with the 3090.

noir_lord 6 hours ago | parent | prev | next [-]

AT&T sold Unix machines (actually a rebadged Olivetti for the hardware) and Microsoft has Xenix when windows wasn't a thing.

So many weird paths we could have gone down it's almost strange Microsoft won.

andrewstuart2 3 hours ago | parent | prev [-]

Yeah, I'm absolutely not saying it was the only path. It's just the path that happened. If not MS maybe it would have been Unix and something else. Either way most everyone today uses UX based on Xerox Parc's which was generously borrowed by, at this point, pretty much everyone.

switchbak 6 hours ago | parent | prev | next [-]

DOS and Windows kept computing behind for a VERY long time, not sure what you're trying to argue here?

krabizzwainch 6 hours ago | parent | prev | next [-]

What’s funny is that we were some bad timing away from IBM giving the DOS money to Gary Kildall and we’d all be working with CP/M derivatives!

Gary was on a flight when IBM called up the Digital Research looking for an OS for the IBM-PC. Gary’s wife, Dorothy, wouldn’t sign an NDA without it going through Gary, and supposedly they never got negotiations back on track.

goda90 6 hours ago | parent | prev | next [-]

What if that alternate someone had been better than DOS/Windows and then engineers cut their teeth on that instead?

andrewstuart2 3 hours ago | parent [-]

Then my comment may have been about a different OS. Or I might never have been born. Who knows?

bovermyer 6 hours ago | parent | prev | next [-]

I'm not convinced of your first point. Just because something seems difficult to avoid given the current context does not mean it was the only path available.

Your second point is a little disingenuous. Yes, Microsoft and Windows have been wildly successful from a cultural adoption standpoint. But that's not the point I was trying to argue.

andrewstuart2 3 hours ago | parent [-]

My first comment is simply pointing out that there's always a #1 in anything you can rank. Windows happened to be what won. And I learned how to use a computer on Windows. Do I use it now? No. But I learned on it as did most people whose parents wanted a computer.

tombert an hour ago | parent [-]

The comment you were replying to was about Microsoft.

Even if Windows weren't a dogshit product, which it is, Microsoft is a lot more than just an operating system. In the 90's they actively tried to sabotage any competition in the web space, and held web standards back by refusing to make Internet Explorer actually work.

hobs 4 hours ago | parent | prev [-]

And how does it follow that microsoft is the good guy in a future where we did it with some other operating system? You could argue that their system was so terrible that its displacement of other options harmed us all with the same level of evidence.

junon 6 hours ago | parent | prev | next [-]

Been on GitHub for a long time. It feels like they're more often. It used to be yearly if at all that GitHub was noticably impacted. Now it's monthly, and recently, seemingly weekly.

0x457 5 hours ago | parent | next [-]

Definitely not how I remember. First, I remember seeing unicorn page multiple times a day some weeks. There were also time when webhook delivery didn't work, so circle ci users couldn't kick off any builds.

What change is how many services GitHub can be having issues.

chadac 6 hours ago | parent | prev | next [-]

I suspect that the Azure migration is influencing this one. Just a bunch of legacy stuff being moved around along with Azure not really being the most reliable on top... I can't imagine it's easy.

zackify 6 hours ago | parent | prev | next [-]

there has been 5 between actions and push pull issues just this month. it is more often

cmrdporcupine 6 hours ago | parent | prev [-]

In the early days of GitHub (like before 2010) outages were extremely common.

bovermyer 6 hours ago | parent | next [-]

I agree, for what that's worth.

However, this is an unexpected bell curve. I wonder if GitHub is seeing more frequent adversarial action lately. Alternatively, perhaps there is a premature reliance on new technology at play.

cmrdporcupine 6 hours ago | parent [-]

I pulled my project off github and onto codeberg a couple months ago but this outage still screws me over because I have a Cargo.toml w/ git dependency into github.

I was trying to do a 1.0 release today. Codeberg went down for "10 minutes maintenance" multiple times while I was running my CI actions.

And then github went down.

Cursed.

netghost 6 hours ago | parent | prev | next [-]

I think it was generally news when there were upages and the site was up. Similar with twitter for that matter.

junon 6 hours ago | parent | prev [-]

Not from my recollection. Not like this. BitBucket on the other hand had a several day outage at one point. That one I do recall.

sampullman 6 hours ago | parent [-]

I remember periods of time when GitHub was down every few weeks, my impression is that it's become more stable over the years.

6 hours ago | parent [-]
[deleted]
kkarpkkarp 6 hours ago | parent | prev | next [-]

> If it's becoming more common, what are the reasons?

Someone answered this morning, while Cloudflare outage, it's AI vibe coding and I tend to think there is something true in this. At some point there might be some tiny grain of AI engaged which starts the avalanche ending like this.

AIorNot 6 hours ago | parent | prev | next [-]

well layoffs across tech probably havent helped

https://techrights.org/n/2025/08/12/Microsoft_Can_Now_Stop_R...

ever since Musk greenlighted firing people again.. CEOs can't wait to pull the trigger

smsm42 6 hours ago | parent | prev | next [-]

It certainly feels that way, though it may be an instance of availability bias. Not sure what's causing it - maybe extra load from AI bots (certainly a lot of smaller sites complain about it, maybe major providers feel the pain too), maybe some kind of general quality erosion... It's certainly something that is waiting for a serious research.

tingletech 5 hours ago | parent | prev | next [-]

Years ago on hackernews I saw a link about probability describing a statistical technique that one could use to answer a question about if a specific type of event was becoming more common or not. Maybe related to the birthday paradox? The gist that I remember is that sometimes a rare event will seem to be happening more often, when in reality there is some cognitive bias that makes it non-intuitive to make that decision without running the numbers. I think it was a blog post that went through a few different examples, and maybe only one of them was actually happening more often.

ambicapter 5 hours ago | parent [-]

If the events are independent, you could use a binomial distribution. Not sure if you can consider these kinds of events to be independent, though.

pm90 6 hours ago | parent | prev | next [-]

Github isn't in the same reliability class as the hyperscalars or cloudflare; its comically bad now, to the point that at a previous job we invested in building a readonly cache layer specifically to prevent github outages from bringing our system down.

grayhatter 5 hours ago | parent | prev | next [-]

End of year, pre-holiday break, code/project completion for perf review rush.

Be good to your Stability reliability engineers for the next few months... it's downtime season!

Wowfunhappy 6 hours ago | parent | prev | next [-]

I’m more interested in how this and the Cloudflare outage occurred on the same day. Is it really just a coincidence?

dlenski 5 hours ago | parent | prev | next [-]

> Are these major issues with cloud/SaaS tools becoming more common, or is it just that they get a lot more coverage now?

I think that "more coverage" is part of it, but also "more centralization." More and more of the web is centralized around a tiny number of cloud providers, because it's just extremely time-intensive and cost-prohibitive for all but the largest and most specialized companies to run their own datacenters and servers.

Three specific examples: Netflix and Dropbox do run their own datacenters and servers; Strava runs on AWS.

> If it's becoming more common, what are the reasons? I can think of a few, but I don't know the answer, so if anyone in-the-know has insight I'd appreciate it.

I worked at AWS from 2020-2024, and saw several of these outages so I guess I'm "in the know."

My somewhat-cynical take is that a lot of these services have grown enormously in complexity, far outstripping the ability of their staff to understand them or maintain them:

- The OG developers of most of these cloud services have moved on. Knowledge transfer within AWS is generally very poor, because it's not incentivized, and has gotten worse due to remote work and geographic dispersion of service teams.

- Managers at AWS are heavily incentivized to develop "new features" and not to improve the reliability, or even security, of their existing offerings. (I discovered numerous security vulnerabilities in the very-well-known service that I worked for, and was regularly punished-rather-than-rewarded for trying to get attention and resources on this. It was a big part of what drove me to leave Amazon. I'm still sitting on a big pile of zero-day vulnerabilities in ______ and ______.)

- Cloud services in most of the world are basically a 3-way oligopoly between AWS, Microsoft/Azure, and Google. The costs of switching from one provider to another are often ENORMOUS due to a zillion fiddly little differences and behavior quirks ("bugs"). It's not apparent to laypeople — or even to me — that any of these providers are much more or less reliable than the others.

averageRoyalty 6 hours ago | parent | prev | next [-]

I suspect there is more tech out there. 20 years ago we didn't have smartphones. 10 years ago, 20mbit on mobile was a good connection. Gigabit is common now, infrastructure no longer has the hurdles it used to, AI makes coding and design much easier, phones are ubiquitous and usage of them at all times (in the movies, out and dinner, driving) has become super normalised.

I suspect (although have not researched) that global traffic is up, by throughput but also by session count.

This contributes to a lot more awareness. Slack being down wasn't impactful when most tech companies didn't use Slack. An AWS outage was less relevant when the 10 apps (used to be websites) you use most didn't rely on a single AZ in AWS or you were on your phone less.

I think as a society it just has more impact than it used to.

myth_drannon 6 hours ago | parent | prev | next [-]

Looking around, I noticed that many senior, experienced individuals were laid off, sometimes replaced by juniors/contractors without institutional knowledge or experience. That's especially evident in ops/support, where the management believes those departments should have a smaller budget.

sunshine-o 6 hours ago | parent | prev | next [-]

1/ Most of the big corporations moved to big cloud providers in the last 5 years. Most of them started 10 years ago but it really accelerated in the last 5 years. So there is for sure more weight and complexity on cloud providers, and more impact when something goes wrong.

2/ Then we cannot expect big tech to stay as sharp as in the 2000s and 2010s.

There was a time banks had all the smart people, then the telco had them, etc. But people get older, too comfortable, layers of bad incentive and politics accumulate and you just become a dysfunctional big mess.

swed420 5 hours ago | parent | prev | next [-]

> B. If it's becoming more common, what are the reasons?

Among other mentioned factors like AI and layoffs: mass brain damage caused by never-ending COVID re-infections.

Since vaccines don't prevent transmission, and each re-infection increases the chances of long COVID complications, the only real protection right now is wearing a proper respirator everywhere you go, and basically nobody is doing that anymore.

There are tons of studies to back this line of reasoning.

__MatrixMan__ 6 hours ago | parent | prev | next [-]

I think it's cancer, and it's getting worse.

xmprt 6 hours ago | parent | prev [-]

One possibility is increased monitoring. In the past, issues that happened weren't reported because they went under the radar. Whereas now, those same issues which only impact a small percentage of users would still result in a status update and postmortem. But take this with a grain of salt because it's just a theory and doesn't reflect any actual data.

A lot of people are pointing to AI vibe coding as the cause, but I think more often than not, incidents happen due to poor maintenance of legacy code. But I guess this may be changing soon as AI written code starts to become "legacy" faster than regular code.

Kostic 6 hours ago | parent [-]

At least with GitHub it's hard to hide when you get "no healthy upstream" on a git push.