| ▲ | GitHub: Git operation failures(githubstatus.com) |
| 330 points by wilhelmklopp 5 hours ago | 271 comments |
| |
|
| ▲ | aeldidi 3 hours ago | parent | next [-] |
| I'm becoming concerned with the rate at which major software systems seem to be failing as of late. For context, last year I only logged four outages that actually disrupted my work; this quarter alone I'm already on my fourth, all within the past few weeks. This is, of course, just an anecdote and not evidence of any wider trend (not to mention that I might not have even logged everything last year), but it was enough to nudge me into writing this today (helped by the fact that I suddenly had some downtime). Keep in mind, this isn't necessarily specific to this outage, just something that's been on my mind enough to warrant writing about it. It feels like resiliency is becoming a bit of a lost art in networked software. I've spent a good chunk of this year chasing down intermittent failures at work, and I really underestimated how much work goes into shrinking the "blast radius", so to speak, of any bug or outage. Even though we mostly run a monolith, we still depend on a bunch of external pieces like daemons, databases, Redis, S3, monitoring, and third-party integrations, and we generally assume that these things are present and working in most places, which wasn't always the case. My response was to better document the failure conditions, and once I did, realize that there was many more than we initially thought. Since then we've done things like: move some things to a VPS instead of cloud services, automate deployment more than we already had, greatly improve the test suite and docs to include these newly considered failure conditions, and generally cut down on moving parts. It was a ton of effort, but the payoff has finally shown up: our records show fewer surprises which means fewer distractions and a much calmer system overall. Without that unglamorous work, things would've only grown more fragile as complexity crept in. And I worry that, more broadly, we're slowly un-learning how to build systems that stay up even when the inevitable bug or failure shows up. For completeness, here are the outages that prompted this: the AWS us-east-1 outage in October (took down the Lightspeed R series API), the Azure Front Door outage (prevented Playwright from downloading browsers for tests), today’s Cloudflare outage (took down Lightspeed’s website, which some of our clients rely on), and the Github outage affecting basically everyone who uses it as their git host. |
| |
| ▲ | HardwareLust 3 hours ago | parent | next [-] | | It's money, of course. No one wants to pay for resilience/redundancy. I've launched over a dozen projects going back to 2008, clients simply refuse to pay for it, and you can't force them. They'd rather pinch their pennies, roll the dice and pray. | | |
| ▲ | stinkbeetle 2 hours ago | parent | next [-] | | > It's money, of course. 100% > No one wants to pay for resilience/redundancy. I've launched over a dozen projects going back to 2008, clients simply refuse to pay for it, and you can't force them. They'd rather pinch their pennies, roll the dice and pray. Well, fly by night outfits will do that. Bigger operations like GitHub will try to do the math on what an outage costs vs what better reliability costs, and optimize accordingly. Look at a big bank or a big corporation's accounting systems, they'll pay millions just for the hot standby mainframes or minicomputers that, for most of them, would never be required. | | |
| ▲ | Jenk 2 hours ago | parent [-] | | I've worked at many big banks and corporations. They are all held together with the proverbial sticky tape, bubblegum, and hope. They do have multiple layers of redundancies, and thus have the big budgets, but they won't be kept hot, or there will be some critical flaws that all of the engineers know about but they haven't been given permission/funding to fix, and are so badly managed by the firm, they dgaf either and secretly want the thing to burn. There will be sustained periods of downtime if their primary system blips. They will all still be dependent on some hyper-critical system that nobody really knows how it works, the last change was introduced in 1988 and it (probably) requires a terminal emulator to operate. | | |
| ▲ | stinkbeetle 2 hours ago | parent [-] | | I've worked on software used by these and have been called in to help support from time to time. One customer which is a top single digit public company by market cap (they may have been #1 at the time, a few years ago) had their SAP systems go down once every few days. This wasn't causing a real monetary problem for them because their hot standby took over. They weren't using mainframes, just "big iron" servers, but each one would have been north of $5 million for the box alone, I guess on a 5ish year replacement schedule. Then there's all the networking, storage, licensing, support, and internal administration costs for it which would easily cost that much again. Now people will say SAP systems are made entirely of dict tape and bubblegum. But it all worked. This system ran all their sales/purchasing sites and portals and was doing a million dollars every couple of minutes so that all paid for itself many times over during the course of that bug. Cold standby would not have cut it. Especially since these big systems take many minutes to boot and HANA takes even longer to load from storage. |
|
| |
| ▲ | lopatin 3 hours ago | parent | prev | next [-] | | I agree that it's all money. That's why it's always DNS right? > No one wants to pay for resilience/redundancy These companies do take it seriously, on the software side, but when it comes to configurations, what are you going to do: Either play it by ear, or literally double your cloud costs for a true, real prod-parallel to mitigate that risk. It looks like even the most critical and prestigious companies in the world are doing the former. | | |
| ▲ | macintux an hour ago | parent [-] | | > Either play it by ear, or literally double your cloud costs for a true, real prod-parallel to mitigate that risk. There's also the problem that doubling your cloud footprint to reduce the risk of a single point of failure introduces new risks: more configuration to break, new modes of failure when both infrastructures are accidentally live and processing traffic, etc. Back when companies typically ran their own datacenters (or otherwise heavily relied on physical devices), I was very skeptical about redundant switches, fearing the redundant hardware would cause more problems than it solved. |
| |
| ▲ | ForHackernews 3 hours ago | parent | prev [-] | | Why should they? Honestly most of what we do simply does not matter that much. 99.9% uptime is fine in 99.999% of cases. | | |
| ▲ | porridgeraisin an hour ago | parent [-] | | This is true. But unfortunately the exact same process is used even for critical stuff (the crowdstrike thing for example). Maybe there needs to be a separate swe process for those things as well, just like there is for aviation. This means not using the same dev tooling, which is a lot of effort. |
|
| |
| ▲ | roxolotl an hour ago | parent | prev | next [-] | | To agree with the comments it seems likely it's money which has begun to result in a slow "un-learning how to build systems that stay up even when the inevitable bug or failure shows up." | |
| ▲ | suddenlybananas 3 hours ago | parent | prev [-] | | To be deliberately provocative, LLMs are being more and more widely used. | | |
| ▲ | zdragnar an hour ago | parent | next [-] | | Word on the street is github was already a giant mess before the rise of LLMs, and it has not improved with the move to MS. | | | |
| ▲ | blibble 3 hours ago | parent | prev [-] | | imagine what it'll be like in 10 years time Microsoft: the film Idiocracy was not supposed to be a manual |
|
|
|
| ▲ | mandus 5 hours ago | parent | prev | next [-] |
| Good thing git was designed as a decentralized revision control system, so you don’t really need GitHub. It’s just a nice convenience |
| |
| ▲ | jimbokun 5 hours ago | parent | next [-] | | As long as you didn't go all in on GitHub Actions. Like my company has. | | |
| ▲ | esafak 4 hours ago | parent | next [-] | | Then your CI host is your weak point. How many companies have multi-cloud or multi-region CI? | |
| ▲ | IshKebab 4 hours ago | parent | prev [-] | | Do you think you'd get better uptime with your own solution? I doubt it. It would just be at a different time. | | |
| ▲ | wavemode 4 hours ago | parent | next [-] | | Uptime is much, much easier at low scale than at high scale. The reason for buying centralized cloud solutions is not uptime, it's to safe the headache of developing and maintaining the thing. | | |
| ▲ | tyre 4 hours ago | parent [-] | | My reason for centralized cloud solutions is also uptime. Multi-AZ RDS is 100% higher availability than me managing something. | | |
| ▲ | wavemode 3 hours ago | parent [-] | | Well, just a few weeks ago we weren't able to connect to RDS for several hours. That's way more downtime than we ever had at the company I worked for 10 years ago, where the DB was just running on a computer in the basement. Anecdotal, but ¯\_(ツ)_/¯ | | |
| ▲ | sshine 2 hours ago | parent [-] | | An anecdote that repeats. Most software doesn’t need to be distributed. But it’s the growth paradigm where we build everything on principles that can scale to world-wide low-latency accessibility. A UNIX pipe gets replaced with a $1200/mo. maximum IOPS RDS channel, bandwidth not included in price. Vendor lock-in guaranteed. |
|
|
| |
| ▲ | jakewins 4 hours ago | parent | prev | next [-] | | “Your own solution” should be that CI isn’t doing anything you can’t do on developer machines. CI is a convenience that runs your Make or Bazel or Just or whatever you prefer builds, that your production systems work fine without. I’ve seen that work first hand to keep critical stuff deployable through several CI outages, and also has the upside of making it trivial to debug “CI issues”, since it’s trivial to run the same target locally | | |
| ▲ | CGamesPlay an hour ago | parent [-] | | Yes, this, but it’s a little more nuanced because of secrets. Giving every employee access to the production deploy key isn’t exactly great OpSec. |
| |
| ▲ | tcoff91 4 hours ago | parent | prev | next [-] | | Compared to 2025 github yeah I do think most self-hosted CI systems would be more available. Github goes down weekly lately. | | |
| ▲ | Aperocky 4 hours ago | parent [-] | | Aren't they halting all work to migrate to azure? Does not sound like an easy thing to do and feels quite easy to cause unexpected problems. | | |
| ▲ | macintux an hour ago | parent [-] | | I recall the Hotmail acquisition and the failed attempts to migrate the service to Windows servers. |
|
| |
| ▲ | deathanatos 3 hours ago | parent | prev | next [-] | | Yes. I've quite literally run a self-hosted CI/CD solution, and yes, in terms of total availability, I believe we outperformed GHA when we did so. We moved to GHA b/c nobody ever got fired ^W^W^W^W leadership thought eng running CI was not a good use of eng time. (Without much question into how much time was actually spent on it… which was pretty close to none. Self-hosted stuff has high initial cost for the setup … and then just kinda runs.) Ironically, one of our self-hosted CI outages was caused by Azure — we have to get VMs from somewhere, and Azure … simply ran out. We had to swap to a different AZ to merely get compute. The big upside to a self-hosted solution is that when stuff breaks, you can hold someone over the fire. (Above, that would be me, unfortunately.) With Github? Nobody really cares unless it is so big, and so severe, that they're more or less forced to, and even then, the response is usually lackluster. | |
| ▲ | prescriptivist 4 hours ago | parent | prev | next [-] | | It's fairly straightforward to build resilient, affordable and scalable pipelines with DAG orchestrators like tekton running in kubernetes. Tekton in particular has the benefit of being low level enough that it can just be plugged into the CI tool above it (jenkins, argo, github actions, whatever) and is relatively portable. | |
| ▲ | davidsainez 4 hours ago | parent | prev | next [-] | | Doesn’t have to be an in house system, just basic redundancy is fine. eg a simple hook that pushes to both GitHub and gitlab | |
| ▲ | nightski 4 hours ago | parent | prev [-] | | I mean yes. We've hosted internal apps that have four nines reliability for over a decade without much trouble. It depends on your scale of course, but for a small team it's pretty easy. I'd argue it is easier than it has ever been because now you have open source software that is containerized and trivial to spin up/maintain. The downtime we do have each year is typically also on our terms, not in the middle of a work day or at a critical moment. |
|
| |
| ▲ | __MatrixMan__ 4 hours ago | parent | prev | next [-] | | This escalator is temporarily stairs, sorry for the convenience. | | |
| ▲ | Akronymus 4 hours ago | parent | next [-] | | Tbh, I personally don't trust a stopped escalator. Some of the videos of brake failures on them scared me off of ever going on them. | | |
| ▲ | collingreen 4 hours ago | parent [-] | | You've ruined something for me. My adult side is grateful but the rest of me is throwing a tantrum right now. I hope you're happy with what you've done. | | |
| ▲ | rvnx 4 hours ago | parent | next [-] | | I read a book about elevators accidents; don't. | | |
| ▲ | yjftsjthsd-h 3 hours ago | parent [-] | | elevators accidents or escalator accidents? | | |
| ▲ | rvnx 3 hours ago | parent [-] | | elevators.
for escalators, make sure not to watch videos of people falling in "the hole". |
|
| |
| ▲ | Akronymus 3 hours ago | parent | prev [-] | | I am genuinly sorry about that. And no, I am not happy about what I've done. |
|
| |
| ▲ | fishpen0 3 hours ago | parent | prev [-] | | Not really comparable at any compliance or security oriented business. You can't just zip the thing up and sftp it over to the server. All the zany supply chain security stuff needs to happen in CI and not be done by a human or we fail our dozens of audits | | |
| ▲ | __MatrixMan__ 3 hours ago | parent [-] | | Why is it that we trust those zany processes more than each other again? Seems like a good place to inject vulnerabilities to me... |
|
| |
| ▲ | lopatin 4 hours ago | parent | prev | next [-] | | The issue is that GitHub is down, not that git is down. | |
| ▲ | ElijahLynn 5 hours ago | parent | prev | next [-] | | You just lose the "hub" of connecting others and providing a way to collaborate with others with rich discussions. | | |
| ▲ | parliament32 5 hours ago | parent [-] | | All of those sound achievable by email, which, coincidently, is also decentralized. | | |
| ▲ | Aurornis 5 hours ago | parent | next [-] | | Some of my open source work is done on mailing lists through e-mail It's more work and slower. I'm convinced half of the reason they keep it that way is because the barrier to entry is higher and it scares contributors away. | |
| ▲ | awesome_dude 4 hours ago | parent | prev [-] | | Wait, email is decentralised? You mean, assuming everyone in the conversation is using different email providers. (ie. Not the company wide one, and not gmail... I think that covers 90% of all email accounts in the company...) |
|
| |
| ▲ | Conscat 4 hours ago | parent | prev | next [-] | | I'm on HackerNews because I can't do my job right now. | | | |
| ▲ | keybored 4 hours ago | parent | prev | next [-] | | I don’t use GitHub that much. I think the thing about “oh no you have centralized on GitHub” point is a bit exaggerated.[1] But generally, thinking beyond just pushing blobs to the Internet, “decentralization” as in software that lets you do everything that is Not Internet Related locally is just a great thing. So I can never understand people who scoff at Git being decentralized just because “um, actually you end up pushing to the same repository”. It would be great to also have the continuous build and test and whatever else you “need” to keep the project going as local alternatives as well. Of course. [1] Or maybe there is just that much downtime on GitHub now that it can’t be shrugged off | |
| ▲ | ramon156 5 hours ago | parent | prev | next [-] | | SSH also down | | |
| ▲ | gertlex 5 hours ago | parent | next [-] | | My pushing was failing for reasons I hadn't seen before. I then tried my sanity check of `ssh git@github.com` (I think I'm supposed to throw a -t flag there, but never care to), and that worked. But yes ssh pushing was down, was my first clue. My work laptop had just been rebooted (it froze...) and the CPU was pegged by security software doing a scan (insert :clown: emoji), so I just wandered over to HN and learned of the outage at that point :) | |
| ▲ | kragen 5 hours ago | parent | prev | next [-] | | SSH works fine for me. I'm using it right now. Just not to GitHub! | |
| ▲ | blueflow 5 hours ago | parent | prev [-] | | SSH is as decentralized as git - just push to your own server? No problem. | | |
| ▲ | jimbokun 5 hours ago | parent [-] | | Well sure but you can't get any collaborators commits that were only pushed to GitHub before it went down. Well you can with some effort. But there's certainly some inconvenience. |
|
| |
| ▲ | stevage 4 hours ago | parent | prev [-] | | Curious whether you actually think this, or was it sarcasm? | | |
| ▲ | 0x457 4 hours ago | parent [-] | | It was sarcasm, but git itself is Decentralized VCS. Technically speaking, every git checkout is a repo of itself. GitHub doesn't stop me from having the entire repo history up to last pull, and I still can push either to the company backup server or my coworker directly. However, since we use github.com fore more than just a git hosting it is SPOF in most cases, and we treat it as a snow day. | | |
|
|
|
| ▲ | grepfru_it an hour ago | parent | prev | next [-] |
| There was a comment on another GitHub thread that I replied to. I got a response saying it’s absurd how unreliable Gh is when people depend on it for CI/CD. And I think this is the problem. At GitHub the developers think it’s only a problem because their ci/cd is failing. Oh no, we broke GitHub actions, the actions runners team is going to be mad at us! Instead of, oh no, we broke GitHub actions, half the world is down! That larger view held only by a small sliver of employees is likely why reliability is not a concern. That leads to the every team for themselves mentality. “It’s not our problem, and we won’t make it our problem so we don’t get dinged at review time” (ok that is Microsoft attitude leaking) Then there’s their entrenched status. Real talk, no one is leaving GitHub. So customers will suck it up and live with it while angry employees grumble on an online forum. I saw this same attitude in major companies like Verio and Verisign in the early 2000s. “Yeah we’re down but who else are you going to go to? Have a 20% discount since you complained. We will only be 1% less profitable this quarter due to it” The kang and kodos argument personified. These views are my own and not related to my employer or anyone associated with me. |
|
| ▲ | SteveNuts 5 hours ago | parent | prev | next [-] |
| I have a serious question, not trying to start a flame war. A. Are these major issues with cloud/SaaS tools becoming more common, or is it just that they get a lot more coverage now? It seems like we see major issues across AWS, GCP, Azure, Github, etc. at least monthly now and I don't remember that being the case in the past. B. If it's becoming more common, what are the reasons? I can think of a few, but I don't know the answer, so if anyone in-the-know has insight I'd appreciate it. Operations budget cuts/layoffs?
Replacing critical components/workflows with AI?
Just overall growing pains, where a service has outgrown what it was engineered for? Thanks |
| |
| ▲ | wnevets 5 hours ago | parent | next [-] | | > A. Are these major issues with cloud/SaaS tools becoming more common, or is it just that they get a lot more coverage now? It seems like we see major issues across AWS, GCP, Azure, Github, etc. at least monthly now and I don't remember that being the case in the past. FWIW Microsoft is convinced moving Github to Azure will fix these outages | | |
| ▲ | Lammy 5 hours ago | parent | next [-] | | Everything old is new again. https://www.zdnet.com/article/ms-moving-hotmail-to-win2000-s... https://jimbojones.livejournal.com/23143.html | | |
| ▲ | codethief 4 hours ago | parent [-] | | From the second link: > In 2002, the amusement continued when a network security outfit discovered an internal document server wide open to the public internet in Microsoft's supposedly "private" network, and found, among other things, a whitepaper[0] written by the hotmail migration team explaining why unix is superior to windows. Hahaha, that whitepaper is pure gold! [0]: https://web.archive.org/web/20040401182755/http://www.securi... |
| |
| ▲ | einsteinx2 5 hours ago | parent | prev | next [-] | | The same Azure that just had a major outage this month? | |
| ▲ | bovermyer 5 hours ago | parent | prev [-] | | Microsoft is also convinced that its works are a net benefit for humanity, so I would take that with a grain of salt. | | |
| ▲ | andrewstuart2 5 hours ago | parent [-] | | I think it would be pretty hard to argue against that point of view, at least thus far. If DOS/Windows hadn't become the dominant OS someone would have, and a whole generation of engineers cut their teeth on their parents' windows PCs. | | |
| ▲ | cdaringe 5 hours ago | parent | next [-] | | There are some pretty zany alternative realities in the Multiverses I’ve visited. Xerox Parc never went under and developed computing as a much more accessible commodity. Another, Bell labs invented a whole category of analog computers that’s supplanted our universe’s digital computing era. There’s one where IBM goes directly to super computers in the 80s. While undoubtedly Microsoft did deliver for many of us, I am a hesitant to say that that was the only path. Hell, Steve Jobs existed in the background for a long while there! | | |
| ▲ | bilegeek 4 hours ago | parent | next [-] | | I wish things had gone differently too, but a couple of nitpicks: 1.) It's already a miracle Xerox PARC escaped their parent company's management for as long as they did. 3.) IBM was playing catch-up on the supercomputer front since the CDC 6400 in 1964. Arguably, they did finally catch up in the mid-late 80's with the 3090. | |
| ▲ | noir_lord 4 hours ago | parent | prev | next [-] | | AT&T sold Unix machines (actually a rebadged Olivetti for the hardware) and Microsoft has Xenix when windows wasn't a thing. So many weird paths we could have gone down it's almost strange Microsoft won. | |
| ▲ | andrewstuart2 an hour ago | parent | prev [-] | | Yeah, I'm absolutely not saying it was the only path. It's just the path that happened. If not MS maybe it would have been Unix and something else. Either way most everyone today uses UX based on Xerox Parc's which was generously borrowed by, at this point, pretty much everyone. |
| |
| ▲ | switchbak 4 hours ago | parent | prev | next [-] | | DOS and Windows kept computing behind for a VERY long time, not sure what you're trying to argue here? | |
| ▲ | krabizzwainch 4 hours ago | parent | prev | next [-] | | What’s funny is that we were some bad timing away from IBM giving the DOS money to Gary Kildall and we’d all be working with CP/M derivatives! Gary was on a flight when IBM called up the Digital Research looking for an OS for the IBM-PC. Gary’s wife, Dorothy, wouldn’t sign an NDA without it going through Gary, and supposedly they never got negotiations back on track. | |
| ▲ | goda90 5 hours ago | parent | prev | next [-] | | What if that alternate someone had been better than DOS/Windows and then engineers cut their teeth on that instead? | | |
| ▲ | andrewstuart2 an hour ago | parent [-] | | Then my comment may have been about a different OS. Or I might never have been born. Who knows? |
| |
| ▲ | bovermyer 5 hours ago | parent | prev | next [-] | | I'm not convinced of your first point. Just because something seems difficult to avoid given the current context does not mean it was the only path available. Your second point is a little disingenuous. Yes, Microsoft and Windows have been wildly successful from a cultural adoption standpoint. But that's not the point I was trying to argue. | | |
| ▲ | andrewstuart2 an hour ago | parent [-] | | My first comment is simply pointing out that there's always a #1 in anything you can rank. Windows happened to be what won. And I learned how to use a computer on Windows. Do I use it now? No. But I learned on it as did most people whose parents wanted a computer. |
| |
| ▲ | hobs 2 hours ago | parent | prev [-] | | And how does it follow that microsoft is the good guy in a future where we did it with some other operating system? You could argue that their system was so terrible that its displacement of other options harmed us all with the same level of evidence. |
|
|
| |
| ▲ | junon 5 hours ago | parent | prev | next [-] | | Been on GitHub for a long time. It feels like they're more often. It used to be yearly if at all that GitHub was noticably impacted. Now it's monthly, and recently, seemingly weekly. | | |
| ▲ | chadac 5 hours ago | parent | next [-] | | I suspect that the Azure migration is influencing this one. Just a bunch of legacy stuff being moved around along with Azure not really being the most reliable on top... I can't imagine it's easy. | |
| ▲ | 0x457 4 hours ago | parent | prev | next [-] | | Definitely not how I remember. First, I remember seeing unicorn page multiple times a day some weeks. There were also time when webhook delivery didn't work, so circle ci users couldn't kick off any builds. What change is how many services GitHub can be having issues. | |
| ▲ | zackify 5 hours ago | parent | prev | next [-] | | there has been 5 between actions and push pull issues just this month. it is more often | |
| ▲ | cmrdporcupine 5 hours ago | parent | prev [-] | | In the early days of GitHub (like before 2010) outages were extremely common. | | |
| ▲ | bovermyer 5 hours ago | parent | next [-] | | I agree, for what that's worth. However, this is an unexpected bell curve. I wonder if GitHub is seeing more frequent adversarial action lately. Alternatively, perhaps there is a premature reliance on new technology at play. | | |
| ▲ | cmrdporcupine 4 hours ago | parent [-] | | I pulled my project off github and onto codeberg a couple months ago but this outage still screws me over because I have a Cargo.toml w/ git dependency into github. I was trying to do a 1.0 release today. Codeberg went down for "10 minutes maintenance" multiple times while I was running my CI actions. And then github went down. Cursed. |
| |
| ▲ | netghost 4 hours ago | parent | prev | next [-] | | I think it was generally news when there were upages and the site was up.
Similar with twitter for that matter. | |
| ▲ | junon 5 hours ago | parent | prev [-] | | Not from my recollection. Not like this. BitBucket on the other hand had a several day outage at one point. That one I do recall. | | |
| ▲ | sampullman 5 hours ago | parent [-] | | I remember periods of time when GitHub was down every few weeks, my impression is that it's become more stable over the years. |
|
|
| |
| ▲ | kkarpkkarp 5 hours ago | parent | prev | next [-] | | > If it's becoming more common, what are the reasons? Someone answered this morning, while Cloudflare outage, it's AI vibe coding and I tend to think there is something true in this. At some point there might be some tiny grain of AI engaged which starts the avalanche ending like this. | |
| ▲ | AIorNot 5 hours ago | parent | prev | next [-] | | well layoffs across tech probably havent helped https://techrights.org/n/2025/08/12/Microsoft_Can_Now_Stop_R... ever since Musk greenlighted firing people again.. CEOs can't wait to pull the trigger | |
| ▲ | smsm42 5 hours ago | parent | prev | next [-] | | It certainly feels that way, though it may be an instance of availability bias. Not sure what's causing it - maybe extra load from AI bots (certainly a lot of smaller sites complain about it, maybe major providers feel the pain too), maybe some kind of general quality erosion... It's certainly something that is waiting for a serious research. | |
| ▲ | tingletech 4 hours ago | parent | prev | next [-] | | Years ago on hackernews I saw a link about probability describing a statistical technique that one could use to answer a question about if a specific type of event was becoming more common or not. Maybe related to the birthday paradox? The gist that I remember is that sometimes a rare event will seem to be happening more often, when in reality there is some cognitive bias that makes it non-intuitive to make that decision without running the numbers. I think it was a blog post that went through a few different examples, and maybe only one of them was actually happening more often. | | |
| ▲ | ambicapter 3 hours ago | parent [-] | | If the events are independent, you could use a binomial distribution. Not sure if you can consider these kinds of events to be independent, though. |
| |
| ▲ | pm90 5 hours ago | parent | prev | next [-] | | Github isn't in the same reliability class as the hyperscalars or cloudflare; its comically bad now, to the point that at a previous job we invested in building a readonly cache layer specifically to prevent github outages from bringing our system down. | |
| ▲ | grayhatter 4 hours ago | parent | prev | next [-] | | End of year, pre-holiday break, code/project completion for perf review rush. Be good to your Stability reliability engineers for the next few months... it's downtime season! | |
| ▲ | Wowfunhappy 5 hours ago | parent | prev | next [-] | | I’m more interested in how this and the Cloudflare outage occurred on the same day. Is it really just a coincidence? | |
| ▲ | dlenski 3 hours ago | parent | prev | next [-] | | > Are these major issues with cloud/SaaS tools becoming more common, or is it just that they get a lot more coverage now? I think that "more coverage" is part of it, but also "more centralization." More and more of the web is centralized around a tiny number of cloud providers, because it's just extremely time-intensive and cost-prohibitive for all but the largest and most specialized companies to run their own datacenters and servers. Three specific examples: Netflix and Dropbox do run their own datacenters and servers; Strava runs on AWS. > If it's becoming more common, what are the reasons? I can think of a few, but I don't know the answer, so if anyone in-the-know has insight I'd appreciate it. I worked at AWS from 2020-2024, and saw several of these outages so I guess I'm "in the know." My somewhat-cynical take is that a lot of these services have grown enormously in complexity, far outstripping the ability of their staff to understand them or maintain them: - The OG developers of most of these cloud services have moved on. Knowledge transfer within AWS is generally very poor, because it's not incentivized, and has gotten worse due to remote work and geographic dispersion of service teams. - Managers at AWS are heavily incentivized to develop "new features" and not to improve the reliability, or even security, of their existing offerings. (I discovered numerous security vulnerabilities in the very-well-known service that I worked for, and was regularly punished-rather-than-rewarded for trying to get attention and resources on this. It was a big part of what drove me to leave Amazon. I'm still sitting on a big pile of zero-day vulnerabilities in ______ and ______.) - Cloud services in most of the world are basically a 3-way oligopoly between AWS, Microsoft/Azure, and Google. The costs of switching from one provider to another are often ENORMOUS due to a zillion fiddly little differences and behavior quirks ("bugs"). It's not apparent to laypeople — or even to me — that any of these providers are much more or less reliable than the others. | |
| ▲ | averageRoyalty 4 hours ago | parent | prev | next [-] | | I suspect there is more tech out there. 20 years ago we didn't have smartphones. 10 years ago, 20mbit on mobile was a good connection. Gigabit is common now, infrastructure no longer has the hurdles it used to, AI makes coding and design much easier, phones are ubiquitous and usage of them at all times (in the movies, out and dinner, driving) has become super normalised. I suspect (although have not researched) that global traffic is up, by throughput but also by session count. This contributes to a lot more awareness. Slack being down wasn't impactful when most tech companies didn't use Slack. An AWS outage was less relevant when the 10 apps (used to be websites) you use most didn't rely on a single AZ in AWS or you were on your phone less. I think as a society it just has more impact than it used to. | |
| ▲ | myth_drannon 5 hours ago | parent | prev | next [-] | | Looking around, I noticed that many senior, experienced individuals were laid off, sometimes replaced by juniors/contractors without institutional knowledge or experience. That's especially evident in ops/support, where the management believes those departments should have a smaller budget. | |
| ▲ | sunshine-o 5 hours ago | parent | prev | next [-] | | 1/ Most of the big corporations moved to big cloud providers in the last 5 years. Most of them started 10 years ago but it really accelerated in the last 5 years.
So there is for sure more weight and complexity on cloud providers, and more impact when something goes wrong. 2/ Then we cannot expect big tech to stay as sharp as in the 2000s and 2010s. There was a time banks had all the smart people, then the telco had them, etc. But people get older, too comfortable, layers of bad incentive and politics accumulate and you just become a dysfunctional big mess. | |
| ▲ | swed420 4 hours ago | parent | prev | next [-] | | > B. If it's becoming more common, what are the reasons? Among other mentioned factors like AI and layoffs: mass brain damage caused by never-ending COVID re-infections. Since vaccines don't prevent transmission, and each re-infection increases the chances of long COVID complications, the only real protection right now is wearing a proper respirator everywhere you go, and basically nobody is doing that anymore. There are tons of studies to back this line of reasoning. | |
| ▲ | __MatrixMan__ 5 hours ago | parent | prev | next [-] | | I think it's cancer, and it's getting worse. | |
| ▲ | xmprt 5 hours ago | parent | prev [-] | | One possibility is increased monitoring. In the past, issues that happened weren't reported because they went under the radar. Whereas now, those same issues which only impact a small percentage of users would still result in a status update and postmortem. But take this with a grain of salt because it's just a theory and doesn't reflect any actual data. A lot of people are pointing to AI vibe coding as the cause, but I think more often than not, incidents happen due to poor maintenance of legacy code. But I guess this may be changing soon as AI written code starts to become "legacy" faster than regular code. | | |
| ▲ | Kostic 5 hours ago | parent [-] | | At least with GitHub it's hard to hide when you get "no healthy upstream" on a git push. |
|
|
|
| ▲ | captainkrtek 3 hours ago | parent | prev | next [-] |
| Reflecting on the last decade, with my career spanning big tech and startups, I've seen a common arch: Small and scrappy startup -> taking on bigger customers for greater profits / ARR -> re-architecting for "enterprise" customers and resiliency / scale -> more idealism in engineering -> profit chasing -> product bloat -> good engineers leave -> replaced by other engineers -> failures expand. This may be an acceptable lifecycle for individual companies as they each follow the destiny of chasing profits ultimately. Now picture it though for all the companies we've architected on top of (AWS, CloudFlare, GCP, etc.) Even within these larger organizations, they are comprised of multiple little businesses (eg: EC2 is its own business effectively - people wise, money wise) Having worked at a $big_cloud_provider for 7 yrs, I saw this internally on a service level. What started as a foundational service, grew in scale, complexity, and architected for resiliency, slowly eroded its engineering culture to chase profits. Fundamental services becoming skeletons of their former selves, all while holding up the internet. There isn't a singular cause here, and I can't say I know what's best, but it's concerning as the internet becomes more centralized into a handful of players. tldr: how much of one's architecture and resiliency is built on the trust of "well (AWS|GCP|CloudFlare) is too big to fail" or "they must be doing things really well"? The various providers are not all that different from other tech companies on the inside. Politics, pressure, profit seeking. |
| |
| ▲ | Esophagus4 2 hours ago | parent [-] | | Well said. I definitely agree (you’re absolutely right!) that the product will get worse through that re-architecting for enterprise transition. But the small product also would not be able to handle any real amount of growth as it was, because it was a mess of tech debt and security issues and manual one-off processes and fragile spaghetti code that only Jeff knows because he wrote it in a weekend, and now he’s gone. So by definition, if a service is large enough to serve a zillion people, it is probably big and bloated and complex. I’m not disagreeing with you, I liked your comment and I’m just rambling. I have worked with several startups and was surprised at how poorly their tech scaled (and how riddled with security issues they were) as we got into it. Nothing will shine a flashlight on all the stress cracks of a system like large-scale growth on the web. | | |
| ▲ | captainkrtek 2 hours ago | parent [-] | | > So by definition, if a service is large enough to serve a zillion people, it is probably big and bloated and complex. Totally agree with your take as well. I think the unfortunate thing is that there can exist a "goldie locks zone" to this, where the service is capable of serving a zillion people AND is well architected. Unfortunately it can't seem to last forever. I saw this in my career. More product SKUs were developed, new features/services defined by non-technical PMs, MBAs entered the chat, sales became the new focus over availability, and the engineering culture that made this possible eroded day by day. The years I worked in this "goldie locks zone" I'd attribute to: - strong technical leadership at the SVP+ level that strongly advocated for security, availability, then features (in that order). - a strong operational culture. Incidents were exciting internally, post mortems shared at a company wide level, no matter how small. - recognition for the engineers who chased ambulances and kept things running, beyond their normal job, this inspired others to follow in their footsteps. |
|
|
|
| ▲ | chrsstrm 5 hours ago | parent | prev | next [-] |
| I thought I was going crazy when I couldn't push changes but now it seems it's time to just call it for the day. Back at it tomorrow. |
| |
| ▲ | Mossly 5 hours ago | parent | next [-] | | Seeing auth succeed but push fail was an exercise in hair pulling. | |
| ▲ | curioussquirrel 5 hours ago | parent | prev | next [-] | | Same, even started adding new ssh keys to no avail... (I was getting some nondescript user error first, then unhealthy upstream) | | |
| ▲ | chrsstrm 4 hours ago | parent [-] | | Would love to see a global counter for the number of times ‘ssh -T git@github.com’ was invoked. |
| |
| ▲ | peciulevicius 5 hours ago | parent | prev [-] | | same, i've started pulling my hair out, was about to nuke my setup and set it up all from scratch | | |
| ▲ | keepamovin 5 hours ago | parent [-] | | lol same. Hilarious when this shit goes down that we all rely on like running water. I'm assuming GitHub was hacked by the NSA because someone uploaded "the UFO files" or sth. |
|
|
|
| ▲ | kennysmoothx 5 hours ago | parent | prev | next [-] |
| FYI in an emergency you can edit files directly on Github without the need to use git. Edit: ugh... if you rely on GH Actions for workflows though actions/checkout@v4 is also currently experiencing the git issues, so no dice if you depend on that. |
| |
| ▲ | ruuda 5 hours ago | parent | next [-] | | FYI in an emergency you can `git push` to and `git pull` from any SSH-capable host without the need to use GitHub. | | |
| ▲ | cluckindan 5 hours ago | parent | next [-] | | FYI in an emergency you can SSH to your server and edit files and the DB directly. Where is your god now, proponents of immutable filesystems?! | | |
| ▲ | egeozcan 5 hours ago | parent | next [-] | | FYI in an emergency, you can buy a plane ticket and send someone to access the server directly. I actually had the privilege of being sent to the server. | | |
| ▲ | noir_lord 4 hours ago | parent [-] | | Had a coworker have to drive across the country once to hit a power button (many years ago). Because my suggestion they have a spare ADSL connection for out of channel stuff was an unnecessary expense... Til he broke the firewall knocked a bunch of folks offline across a huge physical site and locked himself out of everything. The spare line got fitted the next month. |
| |
| ▲ | BadBadJellyBean 5 hours ago | parent | prev [-] | | I love when people do that because they always say "I will push the fix to git later". They never do and when we deploy a version from git things break. Good times. I started packing things into docker containers because of that. Makes it a bit more of a hassle to change things in production. | | |
| ▲ | noir_lord 4 hours ago | parent [-] | | Depends on the org, the big ones I've worked for regular Devs even seniors don't have anything like the level of access to be able to pull a stunt like that. At the largest place I did have prod creds for everything because sometimes they are necessary and I had the seniority (sometimes you do need them in a "oh crap" scenario). They where all setup on a second account in my work Mac which had a danger will Robinson wallpaper because I know myself, far far too easy to mentally fat finger when you have two sets of creds. |
|
| |
| ▲ | lenerdenator 5 hours ago | parent | prev [-] | | I'm actually getting "ERROR: no healthy upstream" on `git pull`. They done borked it good. | | |
| ▲ | avree 4 hours ago | parent [-] | | If your remote is set to a git@github.com remote, it won't work. They're just pointing out that you could use git to set origin/your remote to a different ssh capable server, and push/pull through that. |
|
| |
| ▲ | rco8786 5 hours ago | parent | prev | next [-] | | Yup, we were just trying to hotfix prod and ran into this. What is happening to the internet lately. | |
| ▲ | shrikant 5 hours ago | parent | prev | next [-] | | We're not using Github Actions, but CircleCI is also failing git operations on Github (it doesn't recognise our SSH keys). | |
| ▲ | vielite1310 5 hours ago | parent | prev | next [-] | | True that, and this time Github AI actually have a useful answer to check for githubstatus.com | |
| ▲ | lopatin 5 hours ago | parent | prev [-] | | Can you create a branch through GitHub UI? | | |
| ▲ | hobofan 5 hours ago | parent [-] | | Yes. Just start editing a file and when you hit the "commit changes" button it will ask you what name to use for the branch. |
|
|
|
| ▲ | _jab 5 hours ago | parent | prev | next [-] |
| GitHub is pretty easily the most unreliable service I've used in the past five years. Is GitLab better in this regard? At this point my trust in GitHub is essentially zero - they don't deserve my money any longer. |
| |
| ▲ | ecshafer 5 hours ago | parent | next [-] | | We self host gitlab, so its very stable. But Gitlab also kind of is enterprise software. It hits every feature checkbox, but they aren't well integrated, and they are kind of half way done. I don't think its as smooth of an experience as Github personally, or as feature rich. But Gitlab can self host your project repos, cicd, issues, wikis, etc. and it does it at least okay. | | |
| ▲ | input_sh 4 hours ago | parent [-] | | I would argue GitLab CI/CD is miles ahead of the dumpster fire that is GitHub Actions. Also the homepage is actually useful, unlike GitHub's. |
| |
| ▲ | tottenhm 4 hours ago | parent | prev | next [-] | | Frequently use both `github.com` and self-hosted Gitlab. IMHO, it's just... different. Self-hosted Gitlab periodically blocks access for auto-upgrades.
Github.com upgrades are usually invisible. Github.com is periodically hit with the broad/systemic cloud-outage.
Self-hosted Gitlab is more decentralized infra, so you don't have the systemic outages. With self-hosted Gitlab, you likely to have to deal with rude bots on your own.
Github.com has an ops team that deals with the rude bots. I'm sure the list goes on. (shrug) | |
| ▲ | noosphr 5 hours ago | parent | prev | next [-] | | You can make it as reliable as you want by hosting it on prem. | | |
| ▲ | jakub_g 4 hours ago | parent | next [-] | | > as reliable as you want We self-host GitLab but the team owning it is having hard time scaling it. From my understanding talking to them, the design of gitaly makes it very hard to scale it beyond certain repo size and # of pushes per day (for reference: our repos are GBs in size, ~1M commits, hundreds of merges per day) | |
| ▲ | themafia 5 hours ago | parent | prev [-] | | Flashbacks to me pushing hard for GitLab self hosting a few months ago. The rest of the team did not feel the lift was worth it. I utterly hate being at the mercy of a third party with an after thought of a "status page" to stare at. |
| |
| ▲ | geoffbp an hour ago | parent | prev | next [-] | | Gitlab has regular issues (we use Saas) and the support isn’t great. They acknowledge problems, but the same ones happen again and again. It’s very hard to get anything on their roadmap etc. | |
| ▲ | jakub_g 4 hours ago | parent | prev | next [-] | | My company self-hosts GitLab. Gitaly (the git server) is a weekly source of incidents, it doesn't scale well (CPU/memory spikes which end up taking down the web interface and API). However we have pretty big monorepos with hundreds of daily committers, probably not very representative. | |
| ▲ | yoyohello13 5 hours ago | parent | prev | next [-] | | We've been self hosting GitLab for 5 years and it's the most reliable service in our organization. We haven't had a single outage. We use Gitlab CI and security scanning extensively. | | |
| ▲ | markbnj 4 hours ago | parent [-] | | Ditto, self-hosted for over eight years at my last job. SCM server and 2-4 runners depending on what we needed. Very impressive stability and when we had to upgrade their "upgrade path" tooling was a huge help. |
| |
| ▲ | loloquwowndueo 5 hours ago | parent | prev | next [-] | | Forgejo, my dudes. | | | |
| ▲ | JonChesterfield 5 hours ago | parent | prev | next [-] | | Couldn't log into it this morning when cloudflare was down so there's that. | |
| ▲ | cactusfrog 4 hours ago | parent | prev | next [-] | | There’s this Gitlab incident https://www.youtube.com/watch?v=tLdRBsuvVKc | |
| ▲ | tapoxi 4 hours ago | parent | prev [-] | | Another GitLab self-hosting user here, we've run it on Kubernetes for 6 years. It's never gone down for us, maybe an hour of downtime yearly as we upgrade Postgres to a new version. |
|
|
| ▲ | cjonas 5 hours ago | parent | prev | next [-] |
| I didn't really want to work today anyways. First cloudflare, now this... Seems like a sign to get some fresh air |
| |
| ▲ | dlahoda 5 hours ago | parent [-] | | we depend too much on usa centralized tech. we need more soverenity and decentralization. | | |
| ▲ | worldsavior 5 hours ago | parent | next [-] | | How is this related to them being located in the USA? | |
| ▲ | lorenzleutgeb 5 hours ago | parent | prev | next [-] | | Please check out radicle.dev, helping hands always welcome! | | |
| ▲ | letrix 4 hours ago | parent [-] | | > Repositories are replicated across peers in a decentralized manner You lost me there | | |
| ▲ | hungariantoast 4 hours ago | parent [-] | | "Replicated across peers in a decentralized manner" could just as easily be written about regular Git. Radicle just seems to add a peer-to-peer protocol on top that makes it less annoying to distribute a repository. So I don't get why the project has "lost you", but I also suspect you're the kind of person any project could readily afford to lose as a user. | | |
| ▲ | lorenzleutgeb 3 hours ago | parent [-] | | What this is trying to say:
- "peers": participants in the network are peers, i.e. both ends of a connection run the same code, in contrast to a client-and-server architecture, where both sides often run pretty different code. To exemplify: The code GitHub's servers run is very different from the code that your IDE with Git integration runs.
- "replicated across peers": the Git objects in the repository, and "social artifacts" like discussions in issues and revisions in patches, is copied to other peers. This copy is kept up to date by doing Git fetches for you in the background.
- "in a decentralized manner": Every peer/node in the network gets to locally decide which repositories they intend to replicate, i.e. you can talk to your friends and replicate their cool projects. And when you first initialize a repository, you can decide to make it public (which allows everyone to replicate it), or private (which allows a select list of nodes identified by their public key to replicate). There's no centralized authority which may tell you which repositories to replicate or not. I do realize that we're trying to pack quite a bit of information in this sentence/tagline. I think it's reasonably well phrased, but for the uninitiated might require some "unpacking" on their end. If we "lost you" on that tagline, and my explanation or that of hungariantoast (which is correct as well) helped you understand, I would appreciate if you could criticize more constructively and suggest a better way to introduce these features in a similarly dense tagline, or say what else you would think is a meaningful but short explanation of the project. If you don't care to do that, that's okay, but Radicle won't be able to improve just based on "you lost me there". In case you actually understood the sentence just fine and we "lost you" for some other reason, I would appreciate if you could elaborate on the reason. |
|
|
| |
| ▲ | CivBase 5 hours ago | parent | prev [-] | | The sad part is both the web and git were developed as decentralized technologies, both of which we foolishly centralized later. The underlying tech is still decentralized, but what good does that do when we've made everything that uses it dependent on a few centralized services? |
|
|
|
| ▲ | lol768 5 hours ago | parent | prev | next [-] |
| > We are seeing failures for some git http operations and are investigating It's not just HTTPS, I can't push via SSH either. I'm not convinced it's just "some" operations either; every single one I've tried fails. |
| |
| ▲ | deathanatos 2 hours ago | parent | next [-] | | I'm convinced the people who write status pages are incapable of escaping the phrasing "Some users may be experiencing problems". Too much attempting to save face by PR types, instead of just being transparent with information (… which is what would actually save face…) And that's if you get a status page update at all. | |
| ▲ | olivia-banks 5 hours ago | parent | prev [-] | | A friend of mine was able to get through a few minutes ago, apparently. Everyone else I know is still fatal'ing. |
|
|
| ▲ | JonChesterfield 2 hours ago | parent | prev | next [-] |
| What's the local workaround for this? Git is distributed, it should be possible to put something between our servers and github which pulls from github when it's running and otherwise serves whatever it used to have. A cache of some sort. I've found the five year old https://github.com/jonasmalacofilho/git-cache-http-server which is the same sort of idea. I've run a git instance on a local machine which I pull from, where a cron job fetches from upstream into it, which solved the problem of cloning llvm over a slow connection, so it's doable on a per-repo basis. I'd like to replace it globally though because CI looks like "pull from loads of different git repos" and setting it up once per-repo seems dreadful. Once per github/gitlab would be a big step forward. |
|
| ▲ | shooker435 5 hours ago | parent | prev | next [-] |
| https://www.githubstatus.com/incidents/5q7nmlxz30sk it's up now (the incident, not the outage) |
|
| ▲ | h4kunamata 36 minutes ago | parent | prev | next [-] |
| With Microsoft (Behind GitHub) going full AI mode, expected things to get worse. I worked for one of the largest company in my country, they had "catch-up" with GitHub and it is not longer about GitHub as you folks are used to but AI aka CoPilot. We are seeing major techs such as but not limited to Google, AWS and Azure going under after making public that their code is 30% AI generated (Google). Even Xbox(Microsoft) and its gaming studio got destroyed (COD BO7) for heavily dependency on AI. Don't you find it coincidence all of these system outage worldwide happening right after they proudly shared heavily dependency on AI?? Companies aren't using AI/ML to improve processes but to replace people, full stop.
The AI stock market is having a massive meltdown as we speak with indications that the AI bubble went live. If you as a company wanna keep your productivity at 99.99% from now on: * GitLab: Self-hosted GitLab/runners
* Datacenter: AWS/GCP/Azure is no longer a safe option or cheaper, we have data center companies such as Equinix which have a massive backup plan in place.
I have visited one, they are prepared for a nuclear war and I am not even being dramatic.
If I was starting a new company in 2025, I would go back to datacenter over AWS/GCP/Azure
* Self-host everything you can, and no, it does not require 5 days in the office to manage all of that. |
| |
| ▲ | amazingman 33 minutes ago | parent [-] | | I didn't see a case made for self-hosting as the better option, instead I see that proposition being assumed true. Why would it be better for my company to roll its own CI/CD? |
|
|
| ▲ | laurentiurad 5 hours ago | parent | prev | next [-] |
| A lot of failures lately during the aI ReVoLuTiOn. |
| |
|
| ▲ | personjerry 5 hours ago | parent | prev | next [-] |
| Looks like Gemini 3 figured out the best way to save costs on its compute time was to shut down github! |
|
| ▲ | arbol 5 hours ago | parent | prev | next [-] |
| I'm also getting this. Cannot pull or push but can authenticate with SSH myrepo git:(fix/context-types-settings) gp
ERROR: user:1234567:user
fatal: Could not read from remote repository.
myrepo git:(fix/context-types-settings) ssh -o ProxyCommand=none git@github.com
PTY allocation request failed on channel 0
Hi user! You've successfully authenticated, but GitHub does not provide shell access.
Connection to github.com closed.
|
| |
|
| ▲ | OptionOfT 5 hours ago | parent | prev | next [-] |
| It is insane how many failures we've been getting lately, especially related to actions. * jobs not being picked up
* jobs not being able to be cancelled
* jobs running but showing up as failed
* jobs showing up as failed but not running
* jobs showing containers as pushed successfully to GitHub's registry, but then we get errors while pulling them
* ID token failures (E_FAIL) and timeouts.
I don't know if this is related to GitHub moving to Azure, or because they're allowing more AI generated code to pass through without proper reviews, or something else, but as a paying customer I am not happy. |
| |
|
| ▲ | bhouston 5 hours ago | parent | prev | next [-] |
| I cannot push/pull to any repos. Scared me for a second, but of course I then checked here. |
|
| ▲ | sgreene570 5 hours ago | parent | prev | next [-] |
| github has had a few of these as of late, starting to get old |
| |
| ▲ | baq 5 hours ago | parent | next [-] | | Remember talking about the exact same thing with very similar wording sometime pre-COVID | |
| ▲ | cluckindan 5 hours ago | parent | prev [-] | | MSFT intentionally degrading operations to get everyone to move onto Azure… oh, wait, they just moved GitHub there, carry on my wayward son! | | |
| ▲ | blasphemers 4 hours ago | parent [-] | | GitHub hasn't been moved onto azure yet, they just announced it's their goal to move over in 2026 |
|
|
|
| ▲ | JLCarveth 5 hours ago | parent | prev | next [-] |
| The last outage was a whole 5 days ago https://news.ycombinator.com/item?id=45915731 |
| |
| ▲ | jmclnx 5 hours ago | parent [-] | | Didn't I hear github is moving to Microsoft Azure ? I wonder if these outages are related to the move. Remember hotmail :) | | |
| ▲ | bhouston 5 hours ago | parent [-] | | Huh? What were they on before? The acquisition by MSFT is 7 years ago, they maintained their own infrastructure for that long? | | |
|
|
|
| ▲ | consumer451 5 hours ago | parent | prev | next [-] |
| We live in a house of cards. I hope that eventually people in power realize this. However, their incentive structures do not seem to be a forcing function for that eventuality. I have been thinking about this a lot lately. What would be a tweak that might improve this situation? |
| |
| ▲ | sznio 3 hours ago | parent | next [-] | | Not exactly for this situation, but I've been thinking about distributed caching of web content. Even if a website is down, someone somewhere most likely has it cached. Why can't I read it from their cache? If I'm trying to reach a static image file, why do I have to get it from the source? I guess I want torrent DHT for the web. | | |
| ▲ | consumer451 an hour ago | parent [-] | | That is genuinely interesting. But, let's put all "this nerd talk" into terms that someone in the C-suite could understand. How can C-suite stock RSU/comp/etc be tweaked to make them give a crap about any of this? |
| |
| ▲ | mistercheph 5 hours ago | parent | prev [-] | | Using p2p or self hosted, and accepting the temporary tradeoffs of no network effects. |
|
|
| ▲ | agnishom 3 hours ago | parent | prev | next [-] |
| (epistemic status: frustrated, irrational) This is what happens when they decide that all the budget should be spent on AI stuff rather than solid infra and devops |
|
| ▲ | mendyberger 5 hours ago | parent | prev | next [-] |
| After restarting my computer, reinstalling git, almost ready to reinstall my os, I find out it's not even my fault |
|
| ▲ | keepamovin 5 hours ago | parent | prev | next [-] |
| It's weird to think all of our data lives on physical servers (not "in the cloud") that are falliable and made and maintained by falliable humans, and could fail at any moment. So long to all the data! Good ol' byzantine backups. |
|
| ▲ | whinvik 4 hours ago | parent | prev | next [-] |
| Haha I don't know if its a good test or not but I could not figure out why git pull was failing and Claude just went crazy trying so many random things. Gemini 3 Pro after 3 random things announced Github was the issue. |
|
| ▲ | silverwind 5 hours ago | parent | prev | next [-] |
| It's not only http, also ssh. |
|
| ▲ | netsharc 5 hours ago | parent | prev | next [-] |
| I remember a colleague setting up a CI/CD system (on an aaS obviously) depending on Docker, npm, and who knows what else... I thought "I wonder what % of time all those systems are actually up at the same time" |
|
| ▲ | dadof4 5 hours ago | parent | prev | next [-] |
| Same for me, fatal: unable to access 'https://github.com/repository_example.git/': The requested URL returned error: 500 |
|
| ▲ | bstsb 5 hours ago | parent | prev | next [-] |
| side effect that isn't immediately obvious: all raw.githubusercontent.com content responds with a "404: Not Found" response. this has broken a few pipeline jobs for me, seems like they're underplaying this incident |
| |
| ▲ | pm90 5 hours ago | parent [-] | | yeah something major is borked and they're unwilling to admit it. The status page initially claimed "https git operations are affected" when it was clear that ssh were too (its updated to reflect that now). |
|
|
| ▲ | olivia-banks 5 hours ago | parent | prev | next [-] |
| This is incredibly annoying. I've been trying to fix a deployment action on GitHub for a the past bit, so my entire workflow for today has been push, wait, check... push, wait, check... et cetera. |
| |
|
| ▲ | futurestef 5 hours ago | parent | prev | next [-] |
| I wonder how much of this stuff has been caused by AI agents running on the infra? Claude Code is amazing for devops, until it kubectl deletes your ArgoCD root app |
|
| ▲ | shooker435 5 hours ago | parent | prev | next [-] |
| The internet is having one heck of a day! we focus on ecommerce technology and I can't help but think our customers will be getting nervous pre-BFCM. |
|
| ▲ | 85392_school 5 hours ago | parent | prev | next [-] |
| Seeing "404: Not Found" for all raw files |
|
| ▲ | alexskr 5 hours ago | parent | prev | next [-] |
| Mandatory break time has officially been declared.
Please step away from your keyboard, hydrate, and pretend you were productive today. |
|
| ▲ | mysteria 5 hours ago | parent | prev | next [-] |
| Cloudflare this morning, and now this. A bunch of work isn't getting done today. Maybe this will push more places towards self-hosting? |
|
| ▲ | thinkindie 5 hours ago | parent | prev | next [-] |
| I really can't believe this. I had issues with CircleCI too earlier, soon after the incident with Cloudflare resolved. |
|
| ▲ | zackify 5 hours ago | parent | prev | next [-] |
| this is actually the 5-6th time this month. actions have been degraded constantly now push and pull breaks back to back |
|
| ▲ | randall 4 hours ago | parent | prev | next [-] |
| can’t go down is better than won’t go down. the problem isn’t with centralized internet services, the problem is a fundamental flaw with http and our centralized client server model. the solution doesn’t exist. i’ll build it in a few years if nobody else does. |
|
| ▲ | clbrmbr 5 hours ago | parent | prev | next [-] |
| Same issue here for me. Downdetector [1] agrees, and github status page was just updated now. |
|
| ▲ | matkv 5 hours ago | parent | prev | next [-] |
| Just as I was wondering why my `git push` wasn't working all of a sudden :D |
|
| ▲ | stuffn 5 hours ago | parent | prev | next [-] |
| Centralized internet continues to show it's wonderful benefits. At least microsoft decided we all deserve a couple hour break from work. |
|
| ▲ | brovonov 5 hours ago | parent | prev | next [-] |
| Good thing I already moved away from gh to a selfhosted Forgejo instance. |
|
| ▲ | sre2025 5 hours ago | parent | prev | next [-] |
| Why are there outages everywhere all the time now? AWS, Azure, GitHub, Cloudflare, etc. Is this the result of "vibe coding"? Because before "vibe coding", I don't remember having this many outages around the clock. Just saying. |
| |
| ▲ | elicash 5 hours ago | parent | next [-] | | I think it has more to do with layoffs. "Why do we need so many people to keep things running!?! We never have downtime!!" | | |
| ▲ | Refreeze5224 5 hours ago | parent | next [-] | | Which the true reason for AI, reducing payroll costs. | | |
| ▲ | themafia 5 hours ago | parent [-] | | The reason I detest those who push AI as a technological solution. I think AI as a field is interesting but highly immature, but it's been over hyped to the point of absurdity, and now it is having real negative pressure on wages. That pressure has carry over effects and I agree that we're starting to observe those. |
| |
| ▲ | brovonov 5 hours ago | parent | prev [-] | | Has to be a mix of both. |
| |
| ▲ | noosphr 5 hours ago | parent | prev | next [-] | | They fired a ton of employees with no rhyme or reason to cut costs, this was the predictable outcome. It will get worse if it ever gets better. The funny thing is that the over hiring during the pandemic also had the predictable result of mass lay-offs. Whoever manages HR should be the ones fired after two back to back disasters like this. | | | |
| ▲ | harshalizee 5 hours ago | parent | prev | next [-] | | Could also be the hack and slash layoffs are starting to show its results.
Removing crucial personnel, teams spread thin, combined with low morale industrywide and you've got the perfect recipe for disaster. | |
| ▲ | nawgz 5 hours ago | parent | prev [-] | | AI use being pushed, team sizes being reduced, continued lack of care towards quality… enshittification marches on, gaining speed every day |
|
|
| ▲ | mepage 5 hours ago | parent | prev | next [-] |
| Seeing "ERROR: no healthy upstream" in push/pull operations |
|
| ▲ | Argonaut998 5 hours ago | parent | prev | next [-] |
| Mercury is in retrograde |
|
| ▲ | spapas82 5 hours ago | parent | prev | next [-] |
| Having a self hosted gitea server is a godsend in times like this! |
|
| ▲ | etchalon 5 hours ago | parent | prev | next [-] |
| What is today and who do I blame for it |
| |
| ▲ | baq 5 hours ago | parent [-] | | Computers are great at solving problems that wouldn’t have existed without computers | | |
|
|
| ▲ | kevinlajoye 5 hours ago | parent | prev | next [-] |
| Pain My guess is that it has to do with the Cloudflare outage this morning. |
|
| ▲ | 0dayman 5 hours ago | parent | prev | next [-] |
| Github is down a lot... |
|
| ▲ | case0x 5 hours ago | parent | prev | next [-] |
| I wish I could say something smart such as “People/Organisations should host their own git servers“, but as someone who had the misfortune of doing that in the past I rather have a non-functional GitHub. |
| |
| ▲ | Mossly 5 hours ago | parent | next [-] | | I've found Gitea to be pretty rock solid, at least for a small team. | | |
| ▲ | gelbphoenix 5 hours ago | parent [-] | | Would even recommend Forgejo (the same project Codeberg also uses as the base for their service) |
| |
| ▲ | mkreis 5 hours ago | parent | prev [-] | | I'm curious to learn from your mistakes, can you please elaborate what went wrong? |
|
|
| ▲ | swedishuser 5 hours ago | parent | prev | next [-] |
| Almost one hour down now. What differs this from recent AWS and Cloudflare issues is that this appears to be a global issue? |
|
| ▲ | chazeon 5 hours ago | parent | prev | next [-] |
| Seem images on GitHub web also not showing |
|
| ▲ | usui 4 hours ago | parent | prev | next [-] |
| It's working again now. |
|
| ▲ | ashishb 4 hours ago | parent | prev | next [-] |
| I have said this before, and I will say this again: GitHub stars[1] are the real lock-in for GitHub. That's why all open-core startups are always requesting you to "star them on GitHub". The VCs look at stars before deciding which open-core startup to invest in. The 4 or 5 9s of reliability simply do not matter as much. 1 - https://news.ycombinator.com/item?id=36151140 |
|
| ▲ | kennysmoothx 5 hours ago | parent | prev | next [-] |
| What a day... |
|
| ▲ | mrguyorama 5 hours ago | parent | prev | next [-] |
| I'm going to awkwardly bring up that we have avoided all github downtime and bugs and issues by simply not using github. Our git server is hosted by Atlassian. I think we've had one outage in several years? Our self hosted Jenkins setup is similarly robust, we've had a handful of hours of "Can't build" in again, several years. We are not a company made up of rockstars. We are not especially competent at infrastructure. None of the dev teams have ever had to care about our infrastructure (occasionally we read a wiki or ask someone a question). You don't have to live in this broken world. It's pretty easy not to. We had self hosted Mercurial and jenkins before we were bought by the megacorp, and the megacorp's version was even better and more reliable. Self host. Stop pretending that ignoring complexity is somehow better. |
|
| ▲ | angrydev 5 hours ago | parent | prev | next [-] |
| Ton of people in the comments here wanting to blame AI for these outages. Either you are very new to the industry or have forgotten how frequently they happen. Github in particular was a repeat offender before the MS acquisition. us-east-1 went down many times before LLMs came about. Why act like this is a new thing? |
|
| ▲ | fidotron 5 hours ago | parent | prev | next [-] |
| It used to be having GitHub in the critical path for deployment wasn't so bad, but these days you'd have to be utterly irresponsible to work that way. They need to get a grip on this. |
| |
| ▲ | MattGaiser 5 hours ago | parent [-] | | Eh, the lesson from us-east-1 outage is that you should cling to the big ones instead. You get the convenience + nobody gets mad at you over the failure. | | |
| ▲ | bhouston 5 hours ago | parent [-] | | Everything will have periods of unreliability. The only solution is to be multi-everything (multi-provider for most things), but the costs for that are quite high and hard to see the value in that. | | |
| ▲ | dylan604 5 hours ago | parent [-] | | yes, but if you are going to provide assurances like SLAs, you need to be aware of your own allow for them. if you're customers require working with known problem areas, you should add a clause exempting those areas when they are the cause. |
|
|
|
|
| ▲ | pyenvmanger 5 hours ago | parent | prev | next [-] |
| Git push and pull not working. Getting a 500 response. |
|
| ▲ | ssawchenko 5 hours ago | parent | prev | next [-] |
| Same. ERROR: no healthy upstream
fatal: Could not read from remote repository. |
|
| ▲ | imdsm 5 hours ago | parent | prev | next [-] |
| Cloudflare, GitHub... |
|
| ▲ | Aeroi 4 hours ago | parent | prev | next [-] |
| just realized my world stops, when github does. |
|
| ▲ | theoldgreybeard 5 hours ago | parent | prev | next [-] |
| can the internet work for 5 minutes, please? |
|
| ▲ | mepage 5 hours ago | parent | prev | next [-] |
| Seeing "ERROR: no healthy upstream" in push/pull. |
|
| ▲ | SimoncelloCT 5 hours ago | parent | prev | next [-] |
| Same issue, and I need to complete my work :( |
|
| ▲ | whynotmaybe 5 hours ago | parent | prev | next [-] |
| We gonna need xkcd "compiling" but with "cloudflare||github||chatgpt||spotify down". https://xkcd.com/303/ |
|
| ▲ | dogman123 5 hours ago | parent | prev | next [-] |
| hell yea brother |
|
| ▲ | theideaofcoffee 4 hours ago | parent | prev | next [-] |
| Man, I sound like a broken record, but... Love that for them. How many more outages until people start to see that farming out every aspect of their operations maybe, might, could have a big effect on their overall business? What's the breaking point? Then again, the skills to run this stuff properly are getting more and more rare so we'll probably see more and more big incidents popping up more frequently like this as time goes on. |
|
| ▲ | arbol 4 hours ago | parent | prev | next [-] |
| Its back |
|
| ▲ | pyenvmanger 5 hours ago | parent | prev | next [-] |
| Git pull and push not working |
|
| ▲ | lenerdenator 5 hours ago | parent | prev | next [-] |
| It would be nice if this was actually broken down bit-by-bit after it happened, if only for paying customers of these cloud services. These companies are supposed to have the top people on site reliability. That these things keep happening and no one really knows why makes me doubt them. Alternatively, The takeaway for today: clearly, Man was not meant to have networked, distributed computing resources. We thought we could gather our knowledge and become omniscient, to be as the Almighty in our faculties. The folly. The hubris. The arrogance. |
|
| ▲ | WesolyKubeczek 5 hours ago | parent | prev | next [-] |
| So that’s how the Azure migration is going. |
|
| ▲ | smashah 5 hours ago | parent | prev | next [-] |
| Spooky day today on the internet. Huge CF outage, Gemini 3 launches now I can't push anything to my repos. |
|
| ▲ | MattGaiser 5 hours ago | parent | prev | next [-] |
| https://www.githubstatus.com/incidents/5q7nmlxz30sk |
|
| ▲ | broosted 5 hours ago | parent | prev | next [-] |
| can't do git pull or git push 503 and 500 errors |
|
| ▲ | saydus 5 hours ago | parent | prev | next [-] |
| Cherry on top will be another aws outage |
| |
| ▲ | linsomniac 5 hours ago | parent [-] | | Funny you should say that, I'm here looking because our monitoring server is seeing 80-90% packet loss on our wireguard from our data center to EC2 Oregon... | | |
| ▲ | linsomniac 4 hours ago | parent [-] | | FYI: Not AWS. Been doing some more investigation, it looks like it's either at our data center, or something on the path to AWS, because if I fail over to our secondary firewall it takes a slightly different path both internally and externally, but the packet loss goes away. |
|
|
|
| ▲ | _pdp_ 5 hours ago | parent | prev | next [-] |
| Is it just me or it seems that there is an increased frequency of these types of incidents as of late. |
| |
|
| ▲ | lherron 5 hours ago | parent | prev | next [-] |
| Gemini 3 = Skynet ? |
|
| ▲ | treeroots 5 hours ago | parent | prev | next [-] |
| what else is out there like github? |
| |
|
| ▲ | broosted 5 hours ago | parent | prev | next [-] |
| can't do git pull or push 503 and 500 errors |
|
| ▲ | projproj 5 hours ago | parent | prev [-] |
| Obviously just speculation, but maybe don't let AI write your code... Microsoft CEO says up to 30% of the company’s code was written by AI
https://techcrunch.com/2025/04/29/microsoft-ceo-says-up-to-3... |
| |
| ▲ | tauchunfall 5 hours ago | parent | next [-] | | It's degraded availability of Git operations. The enterprise cloud in EU, US, and Australia has no issues. If you look at the incident history disruptions happen often in the public cloud for years already. Before AI wrote code for them. | | |
| ▲ | TimTheTinker 5 hours ago | parent [-] | | The enterprise cloud runs on older stable versions of GitHub's backend/frontend code. |
| |
| ▲ | smsm42 5 hours ago | parent | prev | next [-] | | That sounds very bad, but I guess it depends also on which code it is. And whether Nadella actually knows what he's talking about, too. | |
| ▲ | dollylambda 5 hours ago | parent | prev | next [-] | | Maybe AI is the tech support too | |
| ▲ | Aloisius 5 hours ago | parent | prev | next [-] | | Sweet. 30% of Microsoft's code isn't protected by copyright. Time to leak that. | |
| ▲ | angrydev 5 hours ago | parent | prev [-] | | What a ridiculous comment, as if these outages didn't happen before LLMs became more commonplace. | | |
| ▲ | projproj 2 hours ago | parent | next [-] | | I admit it was a bit ridiculous. However, if Microsoft is going to brag about how much AI code they are using but not also brag about how good the code is, then we are left to speculate. The two outages in two weeks are _possible_ data points and all we have to go on unless they start providing data. | |
| ▲ | malfist 5 hours ago | parent | prev [-] | | What a ridiculous comment, as if these outages haven't been increasing in quantity since LLMs became more commonplace |
|
|