There's a lot of software in between Air Traffic Controller and Facebook. And honestly would Meta be okay with Instagram or Facebook going down even for just a few minutes? I'd think at this point that'd be considered a fairly severe incident.

Even if we ignore criticality, things just get really messy and confusing if you push a bunch of broken stuff and only try to start understanding what's actually going on after it's already causing issues.

▲

bdangubic 4 hours ago | parent [-]

> And honestly would Meta be okay with Instagram or Facebook going down even for just a few minutes?

sure, they coined the term “move fast and break things”

and not every “bug” brings the system down, there is bugs after bugs after bugs in both facebook and insta being pushed to production daily, it is fine… it is (almost) always fine. if you are at a place where “deploying to production” is a “thing” you better be at some super mission-critical-lives-at-stake project or you should find another project to work on.

▲

pixl97 4 hours ago | parent | next [-]

> there is bugs after bugs after bugs

These are the bugs after bugs after bugs after bugs after bugs.

Simply put they are going through dev, QA, and UAT first before they are the bugs that we see. When you're running an organization using software of any size writing bugs that takes the software down is extremely easy, data corruption even easier.

▲

bdangubic 3 hours ago | parent [-]

I wholeheartedly agree. I just don't agree with:

> We live in a world where every line of code written by a human should be reviewed by another human. We can't even do that! Nothing should go straight to prod ever, ever ever, ever

Things should 100% go to prod whenever they need to go to prod. While this in theory makes sense, there is insane amount of ceremony in large number of places I have seen personally where it takes an act of congress to deploy to production all the while it is just ceremony, people are hunting other people with links to PR sent to various slack channels "hey anyone available to take a look at this" and then someone is like "I know nothing about that service/system but I'll look at approve." I would wager a high wager that this "we must review every line of code" - where actually implemented - is largely a ceremony. Today I deployed three services to production without anyone looking at what I did. Deploying to production should absolutely be a non-event in places that are ran well and where right people are doing their jobs.

	▲	alecbz 3 hours ago \| parent \| next [-]
		I'm sure some companies do this poorly but there's lots of places where code review happens on every PR and there's processes and systems in place to make sure it's an easy process (or at least, as easy as it should be). Many large tech companies have things pushed to prod automatically many, many times per day and still have code review for all changes going out.
	▲	fragmede 3 hours ago \| parent \| prev [-]
		Even with code review, a well configured CI/CD system is going to include a wealth of automated unit and integration tests, and then also a complex deploy system involving canaries and ramp-up and blue/green deployment and flags and monitoring and alerts that's backed by a pager and on-call rotation with runbooks. Code review simply will never be perfect and catch 100% of issues, so systems are designed with that in mind. So then then question is what's actually reasonable given today's code generating tools? 0% review seems foolish but 100% seems similarly unreal. Automated code review systems like CodeRabbit are, dare I even say, reasonable as a first line of defense these days. It all comes down too developer velocity balanced with system stability. Error budgets like Google's SRE org is able to enforce against (some) services they support are one way of accomplishing that, but those are hard to put into practice. So then, as you say, it takes an act of Congress to get anything deployed. So in the abstract, imo it all comes down to the quality of the automated CI/CD system, and developers being on call for their service so they feel the pain of service unreliability and don't just throw code over the wall. But it's all talk at this level of abstraction. The reality of a given company's office politics and the amount of leverage the platform teams and whatever passes for SRE there have vs the rest of the company make all the difference.

▲

alecbz 3 hours ago | parent | prev [-]

>sure, they coined the term “move fast and break things”

Yeah I'm aware, but as any company gets larger and has more and more traffic (and money) dependent on their existing systems working, keeping those systems working becomes more and more important.

There's lots of things worth protecting to ensure that people keep using your product that fall short of "lives are at stake". Of course it's a spectrum but lots of large enterprises that aren't saving lives but still care a lot about making sure their software keeps running.