Remix.run Logo
flaminHotSpeedo 3 hours ago

What's the culture like at Cloudflare re: ops/deployment safety?

They saw errors related to a deployment, and because it was related to a security issue instead of rolling it back they decided to make another deployment with global blast radius instead?

Not only did they fail to apply the deployment safety 101 lesson of "when in doubt, roll back" but they also failed to assess the risk related to the same deployment system that caused their 11/18 outage.

Pure speculation, but to me that sounds like there's more to the story, this sounds like the sort of cowboy decision a team makes when they've either already broken all the rules or weren't following them in the first place

dkyc 2 hours ago | parent | next [-]

One thing to keep in mind when judging what's 'appropriate' is that Cloudflare was effectively responding to an ongoing security incident outside of their control (the React Server RCE vulnerability). Part of Cloudlfare's value proposition is being quick to react to such threats. That changes the equation a bit: any hour you wait longer to deploy, your customers are actively getting hacked through a known high-severity vulnerability.

In this case it's not just a matter of 'hold back for another day to make sure it's done right', like when adding a new feature to a normal SaaS application. In Cloudflare's case moving slower also comes with a real cost.

That isn't to say it didn't work out badly this time, just that the calculation is a bit different.

flaminHotSpeedo 2 hours ago | parent | next [-]

To clarify, I'm not trying to imply that I definitely wouldn't have made the same decision, or that cowboy decisions aren't ever the right call.

However, this preliminary report doesn't really justify the decision to use the same deployment system responsible for the 11/18 outage. Deployment safety should have been the focus of this report, not the technical details. My question that I want answered isn't "are there bugs in Cloudflare's systems" it's "has Cloudflare learned from it's recent mistakes to respond appropriately to events"

vlovich123 an hour ago | parent [-]

> doesn't really justify the decision to use the same deployment system responsible for the 11/18 outage

There’s no other deployment system available. There’s a single system for config deployment and it’s all that was available as they haven’t yet done the progressive roll out implementation yet.

edoceo 19 minutes ago | parent [-]

Ok. Sure But shouldn't they have some beta/staging/test area they could deploy to, run tests for an hour then do the global blast?

Already__Taken 2 hours ago | parent | prev | next [-]

the cve isn't a zero day though how come cloudflare werent at the table for early disclosure?

flaminHotSpeedo 2 hours ago | parent [-]

Do you have a public source about an embargo period for this one? I wasn't able to find one

charcircuit an hour ago | parent | next [-]

Considering there were patched libraries at the time of disclosure, those libraries' authors must have been informed ahead of time.

Pharaoh2 an hour ago | parent | prev [-]

https://react.dev/blog/2025/12/03/critical-security-vulnerab...

Privately Disclosed: Nov 29 Fix pushed: Dec 1 Publicly disclosed: Dec 3

drysart an hour ago | parent [-]

Then even in the worst case scenario, they were addressing this issue two days after it was publicly disclosed. So this wasn't a "rush to fix the zero day ASAP" scenario, which makes it harder to justify ignoring errors that started occuring in a small scale rollout.

udev4096 2 hours ago | parent | prev [-]

Clownflare did what it does best, mess up and break everything. It will keep happening again and again

toomuchtodo an hour ago | parent [-]

Indeed, but it is what it is. Cloudflare comes out of my budget, and even with downtime, its better than not paying them. Do I want to deal with what Cloudflare offers? I do not, I have higher value work to focus on. I want to pay someone else to deal with this, and just like when cloud providers are down, it'll be back up eventually. Grab a coffee or beer and hang; we aren't savings lives, we're just building websites. This is not laziness or nihilism, but simply being rational and pragmatic.

liampulles 2 hours ago | parent | prev | next [-]

Rollback is a reliable strategy when the rollback process is well understood. If a rollback process is not well known and well experienced, then it is a risk in itself.

I'm not sure of the nature of the rollback process in this case, but leaning on ill-founded assumptions is a bad practice. I do agree that a global rollout is a problem.

newsoftheday 39 minutes ago | parent [-]

Rollback carries with it the contextual understanding of complete atomicity; otherwise it's slightly better than a yeet. It's similar to backups that are untested.

crote an hour ago | parent | prev | next [-]

> They saw errors related to a deployment, and because it was related to a security issue instead of rolling it back they decided to make another deployment with global blast radius instead?

Note that the two deployments were of different components.

Basically, imagine the following scenario: A patch for a critical vulnerability gets released, during rollout you get a few reports of it causing the screensaver to show a corrupt video buffer instead, you roll out a GPO to use a blank screensaver instead of the intended corporate branding, a crash in a script parsing the GPOs on this new value prevents users from logging in.

There's no direct technical link between the two issues. A mitigation of the first one merely exposed a latent bug in the second one. In hindsight it is easy to say that the right approach is obviously to roll back, but in practice a roll forward is often the better choice - both from an ops perspective and from a safety perspective.

Given the above scenario, how many people are genuinely willing to do a full rollback, file a ticket with Microsoft, and hope they'll get around to fixing it some time soon? I think in practice the vast majority of us will just look for a suitable temporary workaround instead.

lukeasrodgers 2 hours ago | parent | prev | next [-]

Roll back is not always the right answer. I can’t speak to its appropriateness in this particular situation of course, but sometimes “roll forward” is the better solution.

flaminHotSpeedo 2 hours ago | parent | next [-]

Like the other poster said, roll back should be the right answer the vast majority of the time. But it's also important to recognize that roll forward should be a replacement for the deployment you decided not to roll back, not a parallel deployment through another system.

I won't say never, but a situation where the right answer to avoid a rollback (that it sounds like was technically fine to do, just undesirable from a security/business perspective) is a parallel deployment through a radioactive, global blast radius, near instantaneous deployment system that is under intense scrutiny after another recent outage should be about as probable as a bowl of petunias in orbit

crote 41 minutes ago | parent [-]

Is a roll back even possible at Cloudflare's size?

With small deployments it usually isn't too difficult to re-deploy a previous commit. But once you get big enough you've got enough developers that half a dozen PRs will have been merged since the start of the incident and now. How viable is it to stop the world, undo everything, and start from scratch any time a deployment causes the tiniest issues?

Realistically the best you're going to get is merging a revert of the problematic changeset - but with the intervening merges that's still going to bring the system in a novel state. You're rolling forwards, not backwards.

newsoftheday 37 minutes ago | parent | next [-]

If companies like Cloudflare haven't figured out how to do reliable rollbacks, there seems little hope for any of us.

yuliyp 30 minutes ago | parent | prev [-]

I'd presume they have the ability to deploy a previous artifact vs only tip-of-master.

echelon 2 hours ago | parent | prev [-]

You want to build a world where roll back is 95% the right thing to do. So that it almost always works and you don't even have to think about it.

During an incident, the incident lead should be able to say to your team's on call: "can you roll back? If so, roll back" and the oncall engineer should know if it's okay. By default it should be if you're writing code mindfully.

Certain well-understood migrations are the only cases where roll back might not be acceptable.

Always keep your services in "roll back able", "graceful fail", "fail open" state.

This requires tremendous engineering consciousness across the entire org. Every team must be a diligent custodian of this. And even then, it will sometimes break down.

Never make code changes you can't roll back from without reason and without informing the team. Service calls, data write formats, etc.

I've been in the line of billion dollar transaction value services for most of my career. And unfortunately I've been in billion dollar outages.

drysart an hour ago | parent [-]

"Fail open" state would have been improper here, as the system being impacted was a security-critical system: firewall rules.

It is absolutely the wrong approach to "fail open" when you can't run security-critical operations.

this_user 3 hours ago | parent | prev | next [-]

The question is perhaps what the shape and status of their tech stack is. Obviously, they are running at massive scale, and they have grown extremely aggressively over the years. What's more, especially over the last few years, they have been adding new product after new product. How much tech debt have they accumulated with that "move fast" approach that is now starting to rear its head?

sandeepkd an hour ago | parent [-]

I think this is probably a bigger root cause and is going to show up in different ways in future. The mere act of adding new products to an existing architecture/system is bound to create knowledge silos around operations and tech debt. There is a good reason why big companies keep smart people on their payroll to just change couple of lines after a week of debate.

2 hours ago | parent | prev | next [-]
[deleted]
otterley 2 hours ago | parent | prev | next [-]

From the post:

“We have spoken directly with hundreds of customers following that incident and shared our plans to make changes to prevent single updates from causing widespread impact like this. We believe these changes would have helped prevent the impact of today’s incident but, unfortunately, we have not finished deploying them yet.

“We know it is disappointing that this work has not been completed yet. It remains our first priority across the organization.”

NicoJuicy an hour ago | parent | prev | next [-]

Where I work, all teams were notified about the React CVE.

Cloudflare made it less of an expedite.

ignoramous an hour ago | parent | prev | next [-]

> this sounds like the sort of cowboy decision

Ouch. Harsh given that Cloudflare's being over-honest (to disabling the internal tool) and the outage's relatively limited impact (time wise & no. of customers wise). It was just an unfortunate latent bug: Nov 18 was Rust's Unwrap, Dec 5 its Lua's turn with its dynamic typing.

Now, the real cowboy decision I want to see is Cloudflare [0] running a company-wide Rust/Lua code-review with Codex / Claude...

cf TFA:

  if rule_result.action == "execute" then
    rule_result.execute.results = ruleset_results[tonumber(rule_result.execute.results_index)]
  end

  This code expects that, if the ruleset has action="execute", the "rule_result.execute" object will exist ... error in the [Lua] code, which had existed undetected for many years ... prevented by languages with strong type systems. In our replacement [FL2 proxy] ... code written in Rust ... the error did not occur.
[0] https://news.ycombinator.com/item?id=44159166
nine_k 2 hours ago | parent | prev | next [-]

> more to the story

From a more tinfoil-wearing angle, it may not even be a regular deployment, given the idea of Cloudflare being "the largest MitM attack in history". ("Maybe not even by Cloudflare but by NSA", would say some conspiracy theorists, which is, of course, completely bonkers: NSA is supposed to employ engineers who never let such blunders blow their cover.)

rvz 2 hours ago | parent | prev | next [-]

> Not only did they fail to apply the deployment safety 101 lesson of "when in doubt, roll back" but they also failed to assess the risk related to the same deployment system that caused their 11/18 outage.

Also there seems to be insufficient testing before deployment with very junior level mistakes.

> As soon as the change propagated to our network, code execution in our FL1 proxy reached a bug in our rules module which led to the following LUA exception:

Where was the testing for this one? If ANY exception happened during the rules checking, the deployment should fail and rollback. Instead, they didn't assess that as a likely risk and pressed on with the deployment "fix".

I guess those at Cloudflare are not learning anything from the previous disaster.

deadbabe 3 hours ago | parent | prev | next [-]

As usual, Cloudflare is the man in the arena.

samrus 3 hours ago | parent [-]

There are other men in the arena who arent tripping on their own feet

usrnm 3 hours ago | parent [-]

Like who? Which large tech company doesn't have outages?

k8sToGo 2 hours ago | parent | next [-]

It's not about outages. It's about the why. Hardware can fail. Bugs can happen. But to continue a roll out despite warning sings and without understanding the cause and impact is on another level. Especially if it is related to the same problem as last time.

udev4096 2 hours ago | parent [-]

And yet, it's always clownflare breaking everything. Failures are inevitable, which is widely known, therefore we build resilience systems to overcome the inevitable

deadbabe an hour ago | parent [-]

It is healthy for tech companies to have outages, as they will build experience in resolving them. Success breeds complacency.

nish__ 2 hours ago | parent | prev | next [-]

Google does pretty good.

k__ 2 hours ago | parent | prev [-]

"tripping on their own feet" == "not rolling back"

NoSalt 2 hours ago | parent | prev [-]

Ooh ... I want to be on a cowboy decision making team!!!