Remix.run Logo
Cloudflare outage on November 18, 2025 post mortem(blog.cloudflare.com)
365 points by eastdakota 3 hours ago | 209 comments

Related: Cloudflare Global Network experiencing issues - https://news.ycombinator.com/item?id=45963780 - Nov 2025 (1580 comments)

ojosilva 2 hours ago | parent | next [-]

This is the multi-million dollar .unwrap() story. In a critical path of infrastructure serving a significant chunk of the internet, calling .unwrap() on a Result means you're saying "this can never fail, and if it does, crash the thread immediately."The Rust compiler forced them to acknowledge this could fail (that's what Result is for), but they explicitly chose to panic instead of handle it gracefully. This is textbook "parse, don't validate" anti-pattern.

I know, this is "Monday morning quarterbacking", but that's what you get for an outage this big that had me tied up for half a day.

butvacuum 2 minutes ago | parent | next [-]

It rang more as "A/B deployments are pointless if you can't tell if a downstream failure is related." To me.

wrs 2 hours ago | parent | prev | next [-]

It seems people have a blind spot for unwrap, perhaps because it's so often used in example code. In production code an unwrap or expect should be reviewed exactly like a panic.

It's not necessarily invalid to use unwrap in production code if you would just call panic anyway. But just like every unsafe block needs a SAFETY comment, every unwrap in production code needs an INFALLIBILITY comment. clippy::unwrap_used can enforce this.

dist1ll an hour ago | parent [-]

> every unwrap in production code needs an INFALLIBILITY comment. clippy::unwrap_used can enforce this.

How about indexing into a slice/map/vec? Should every `foo[i]` have an infallibility comment? Because they're essentially `get(i).unwrap()`.

10000truths 26 minutes ago | parent | next [-]

Yes? Funnily enough, I don't often use indexed access in Rust. Either I'm looping over elements of a data structure (in which case I use iterators), or I'm using an untrusted index value (in which case I explicitly handle the error case). In the rare case where I'm using an index value that I can guarantee is never invalid (e.g. graph traversal where the indices are never exposed outside the scope of the traversal), then I create a safe wrapper around the unsafe access and document the invariant.

tux3 42 minutes ago | parent | prev | next [-]

Usually you'd want to write almost all your slice or other container iterations with iterators, in a functional style.

For the 5% of cases that are too complex for standard iterators? I never bother justifying why my indexes are correct, but I don't see why not.

You very rarely need SAFETY comments in Rust because almost all the code you write is safe in the first place. The language also gives you the tool to avoid manual iteration (not just for safety, but because it lets the compiler eliminate bounds checks), so it would actually be quite viable to write these comments, since you only need them when you're doing something unusual.

dist1ll 21 minutes ago | parent [-]

For iteration, yes. But there's other cases, like any time you have to deal with lots of linked data structures. If you need high performance, chances are that you'll have to use an index+arena strategy. They're also common in mathematical codebases.

danielheath an hour ago | parent | prev [-]

I mean... yeah, in general. That's what iterators are for.

vlovich123 an hour ago | parent | prev | next [-]

To be fair, this failed in the non-rust path too because the bot management returned that all traffic was a bot. But yes, FL2 needs to catch panics from individual components but I’m not sure if failing open is necessarily that much better (it was in this case but the next incident could easily be the result of failing open).

But more generally you could catch the panic at the FL2 layer to make that decision intentional - missing logic at that layer IMHO.

hedora 30 minutes ago | parent [-]

Catching panic probably isn’t a great idea if there’s any unsafe code in the system. (Do the unsafe blocks really maintain heap invariants if across panics?)

AgentME 10 minutes ago | parent | prev | next [-]

This is assuming that the process could have done anything sensible while it had the malformed feature file. It might be in this case that this was one configuration file of several and maybe the program could have been built to run with some defaults when it finds this specific configuration invalid, but in the general case, if a program expects a configuration file and can't do anything without it, panicking is a normal thing to do. There's no graceful handling (beyond a nice error message) a program like Nginx could do on a syntax error in its config.

The real issue is further up the chain where the malformed feature file got created and deployed without better checks.

smj-edison 12 minutes ago | parent | prev | next [-]

Isn't the point of this article that pieces of infrastructure don't go down to root causes, but due to bad combinations of components that are correct individually? After reading "engineering a safer world", I find root cause analysis rather reductionistic, because it wasn't just an unwrap, it was that the payload was larger than normal, because of a query that didn't select by database, because a clickhouse made more databases visible? Hard to say "it was just due to an unwrap" imo. Especially in terms of how to fix an issue going forwards. I think the article lists a lot of good ideas, that aren't just "don't unwrap", like enabling more global kill switches for features, or eliminating the ability for core dumps or other error reports to overwhelm system resources.

ChrisMarshallNY 25 minutes ago | parent | prev | next [-]

Swift has implicit unwrap (!), and explicit unwrap (?).

I don't like to use implicit unwrap. Even things that are guaranteed to be there, I treat as explicit (For example, (self.view?.isEnabled ?? false), in a view controller, instead of self.view.isEnabled).

I always redefine @IBOutlets from:

    @IBOutlet weak var someView!
to:

    @IBOutlet weak var someView?
I'm kind of a "belt & suspenders" type of guy.
cvhc an hour ago | parent | prev | next [-]

Some languages and style guides simply forbid throwing exceptions without catching / proper recovery. Google C++ bans exceptions and the main mechanism for propogating errors is `absl::Status` which the caller has to check. Not familiar with Rust but it seems unwrap is such a thing that would be banned.

pdimitar 34 minutes ago | parent | next [-]

There are even lints for this but people get impatient and just override them or fight for them to no longer be the default.

As usual: people problem, not a tech problem. In the last years a lot of strides have been made. But people will be people.

tonyhart7 16 minutes ago | parent [-]

and people make mistake

at some point machine would be better in coding because well machine code is machine instruction task

same like chess, engine is better than human grandmaster because its solvable math field

coding is no different

gpm 32 minutes ago | parent | prev [-]

Unwrap is used in places where in C++ you would just have undefined behavior. It wouldn't make any more sense to blanket ban it than it would ban ever dereferencing a pointer just in case its null - even if you just checked that it wasn't null.

cherryteastain 23 minutes ago | parent [-]

Rust's Result is the same thing as C++'s std::expected. How is calling std::expected::value undefined behaviour?

gpm 14 minutes ago | parent [-]

Rust's foo: Option<&T> is rust's rough equivalent to C++'s const T* foo. The C++ *foo is equivalent to the rust unsafe{ *foo.unwrap_unchecked() }, or in safe code *foo.unwrap() (which changes the undefined behavior to a panic).

Rust's unwrap isn't the same as std::expected::value. The former panics - i.e. either aborts the program or unwinds depending on context and is generally not meant to be handled. The latter just throws an exception that is generally expected to be handled. Panics and exceptions use similar machinery (at least they can depending on compiler options) but they are not equivalent - for example nested panics in destructors always abort the program.

In code that isn't meant to crash `unwind` should be treated as a sign saying that "I'm promising that this will never happen", but just like in C++ you promise that pointers you deference are valid and signed integers you add don't overflow making promises like that is a necessary part of productive programming.

shadowgovt 23 minutes ago | parent | prev | next [-]

In addition, it looks like this system wasn't on any kind of 1%/10%/50%/100% rollout gating. Such a rollout would trivially have shown the poison input killing tasks.

ajross 31 minutes ago | parent | prev | next [-]

I'm not completely sure I agree. I mean, I do agree about the .unwrap() culture being a bug trap. But I don't think this example qualifies.

The root cause here was that a file was mildly corrupt (with duplicate entries, I guess). And there was a validation check elsewhere that said "THIS FILE IS TOO BIG".

But if that's a validation failure, well, failing is correct? What wasn't correct was that the failure reached production. What should have happened is that the validation should have been a unified thing and whatever generated the file should have flagged it before it entered production.

And that's not an issue with function return value API management. The software that should have bailed was somewhere else entirely, and even there an unwrap explosion (in a smoke test or pre-release pass or whatever) would have been fine.

arccy an hour ago | parent | prev [-]

if you make it easy to be lazy and panic vs properly handling the error, you've designed a poor language

nine_k an hour ago | parent | next [-]

At Facebook they name certain "escape hatch" functions in a way that inescapably make them look like a GIANT EYESORE. Stuff like DANGEROUSLY_CAST_THIS_TO_THAT, or INVOKE_SUPER_EXPENSIVE_ACTION_SEE_YOU_ON_CODE_REVIEW. This really drives home the point that such things must not be used except in rare extraordinary cases.

If unwrap() were named UNWRAP_OR_PANIC(), it would be used much less glibly. Even more, I wish there existed a super strict mode when all places that can panic are treated as compile-time errors, except those specifically wrapped in some may_panic_intentionally!() or similar.

adzm 14 minutes ago | parent | next [-]

> make them look like a GIANT EYESORE

React.__SECRET_INTERNALS_DO_NOT_USE_OR_YOU_WILL_BE_FIRED comes to mind. I did have to reach to this before, but it certainly works for keeping this out of example code and other things like reading other implementations without the danger being very apparent.

At some point it was renamed to __CLIENT_INTERNALS_DO_NOT_USE_OR_WARN_USERS_THEY_CANNOT_UPGRADE which is much less fun.

Nathanba an hour ago | parent | prev [-]

right and if the language designers named it UNWRAP_OR_PANIC() then people would rightfully be asking why on earth we can't just use a try-catch around code and have an easier life

nine_k 35 minutes ago | parent | next [-]

But a panic can be caught and handled safely (e.g. via std:: panic tools). I'd say that this is the correct use case for exceptions (ask Martin Fowler, of all people).

There is already a try/catch around that code, which produces the Result type, which you can presumptuously .unwrap() without checking if it contains an error.

Instead, one should use the question mark operator, that immediately returns the error from the current function if a Result is an error. This is exactly similar to rethrowing an exception, but only requires typing one character, the "?".

yoyohello13 43 minutes ago | parent | prev | next [-]

Probably not, since errors as values are way better than exceptions.

nomel 34 minutes ago | parent [-]

How so? An exception is a value that's given the closest, conceptually appropriate, point that was decided to handle the value, allowing you to keep your "happy path" as clean code, and your "exceptional circumstances" path at the level of abstraction that makes sense.

It's way less book-keeping with exceptions, since you, intentionally, don't have to write code for that exceptional behavior, except where it makes sense to. The return by value method, necessarily, implements the same behavior, where handling is bubbled up to the conceptually appropriate place, through returns, but with much more typing involved. Care is required for either, since not properly bubbling up an exception can happen in either case (no re-raise for exceptions, no return after handling for return).

yoyohello13 13 minutes ago | parent | next [-]

There are many many pages of text discussing this topic, but having programmed in both styles, exceptions make it too easy for programmer to simply ignore them. Errors as values force you to explicitly handle it there, or toss it up the stack. Maybe some other languages have better exception handling but in Python it’s god awful. In big projects you can basically never know when or how something can fail.

nomel 4 minutes ago | parent [-]

I would claim the opposite. If you don't catch an exception, you'll get a halt.

With return values, you can trivially ignore an exception.

    let _ = fs::remove_file("file_doesn't_exist");

    or

    value, error = some_function()
    // carry on without doing anything with error
In the wild, I've seen far more ignoring return errors, because of the mechanical burden of having type handling at every function call.
pyrolistical 19 minutes ago | parent | prev [-]

Exception is hidden control flow, where as error values are not.

That is the main reason why zig doesn’t have exceptions.

nomel 11 minutes ago | parent [-]

I'd categorize them more as "event handlers" than "hidden". You can't know where the execution will go at a lower level, but that's the entire point: you don't care. You put the handlers at the points where you care.

sfink 39 minutes ago | parent | prev [-]

...and you can? try-catch is usually less ergonomic than the various ways you can inspect a Result.

    try {
      data = some_sketchy_function();
    } catch (e) {
      handle the error;
    }
vs

    result = some_sketchy_function();
    if let Err(e) = result {
      handle the error;
    }
Or better yet, compare the problematic cases where the error isn't handled:

    data = some_sketchy_function();
vs

    data = some_sketchy_function().UNWRAP_OR_PANIC();
In the former (the try-catch version that doesn't try or catch), the lack of handling is silent. It might be fine! You might just depend on your caller using `try`. In the latter, the compiler forces you to use UNWRAP_OR_PANIC (or, in reality, just unwrap) or `data` won't be the expected type and you will quickly get a compile failure.

What I suspect you mean, because it's a better argument, is:

    try {
        sketchy_function1();
        sketchy_function2();
        sketchy_function3();
        sketchy_function4();
    } catch (e) {
        ...
    }
which is fair, although how often is it really the right thing to let all the errors from 4 independent sources flow together and then get picked apart after the fact by inspecting `e`? It's an easier life, but it's also one where subtle problems constantly creep in without the compiler having any visibility into them at all.
JuniperMesos an hour ago | parent | prev | next [-]

It's a little subtler than this. You want it to be easy to not handle an error while developing, so you can focus on getting the core logic correct before error-handling; but you want it to be hard to deploy or release the software without fully handling these checks. Some kind of debug vs release mode with different lints seems like a reasonable approach.

SchwKatze an hour ago | parent | prev | next [-]

Unwrap isn't a synonym for laziness, it's just like an assertion, when you do unwrap() you're saying the Result should NEVER fail, and if does, it should abort the whole process. What was wrong was the developer assumption, not the use of unwrap.

SchemaLoad an hour ago | parent | next [-]

It also makes it very obvious in the code, something very dangerous is happening here. As a code reviewer you should see an unwrap() and have alarm bells going off. While in other languages, critical errors are a lot more hidden.

kloop 24 minutes ago | parent [-]

I hate that it's a method. That can get lost in a method chain easily enough during a code review.

A function or a keyword would interrupt that and make it less tempting

dietr1ch an hour ago | parent | prev | next [-]

> What was wrong was the developer assumption, not the use of unwrap.

How many times can you truly prove that an `unwrap()` is correct and that you also need that performance edge?

Ignoring the performance aspect that often comes from a hat-trick, to prove such a thing you need to be wary of the inner workings of a call giving you a `Return`. That knowledge is only valid at the time of writing your `unwrap()`, but won't necessarily hold later.

Also, aren't you implicitly forcing whoever changes the function to check for every smartass dev that decided to `unwrap` at their callsite? That's bonkers.

JuniperMesos an hour ago | parent [-]

I doubt that this unwrap was added for performance reasons; I suspect it was rather added because the developer temporarily didn't want to deal with what they thought was an unlikely error case while they were working on something else; and no other system recognized that the unwrap was left in and flagged it before it was deployed on production servers.

If I were Cloudflare I would immediately audit the codebase for all uses of unwrap (or similar rust panic idioms like expect), ensure that they are either removed or clearly documented as to why it's worth crashing the program there, and then add a linter to their CI system that will fire if anyone tries to check in a new commit with unwrap in it.

Rohansi an hour ago | parent | prev [-]

> when you do unwrap() you're saying the Result should NEVER fail

Returning a Result by definition means the method can fail.

SchwKatze 25 minutes ago | parent | next [-]

Yeah, I think I expressed wrongly here. A more correct version would be: "when you do unwrap() you're saying that an error on this particular path shouldn't be recoverable and we should fail-safe."

Dylan16807 an hour ago | parent | prev [-]

> Returning a Result by definition means the method can fail.

No more than returning an int by definition means the method can return -2.

yoyohello13 41 minutes ago | parent [-]

What? Results have a limited number of possible error states that are well defined.

Dylan16807 39 minutes ago | parent [-]

Some call points to a function that returns a Result will never return an Error.

Some call points to a function that returns an int will never return -2.

Sometimes you know things the type system does not know.

Rohansi 27 minutes ago | parent [-]

The difference is functions which return Result have explicitly chosen to return a Result because they can fail. Sure, it might not fail in the current implementation and/or configuration, but that could change later and you might not know until it causes problems. The type system is there to help you - why ignore it?

Dylan16807 22 minutes ago | parent [-]

Because it would be a huge hassle to go into that library and write an alternate version that doesn't return a Result. So you're stuck with the type system being wrong in some way. You can add error-handling code upfront but it will be dead code at that point in time, which is also not good.

yoyohello13 an hour ago | parent | prev | next [-]

So… basically every language ever?

Except maybe Haskell.

dkersten an hour ago | parent [-]

And Gleam

leshenka an hour ago | parent | prev | next [-]

All languages with few exceptions have these kinds of escape hatches like unwrap

otterley an hour ago | parent | prev [-]

https://en.wikipedia.org/wiki/Crash-only_software

nine_k an hour ago | parent [-]

Works when you have the Erlang system that does graceful handing for you: reporting, restarting.

otterley 2 hours ago | parent | prev | next [-]

> work has already begun on how we will harden them against failures like this in the future. In particular we are:

> Hardening ingestion of Cloudflare-generated configuration files in the same way we would for user-generated input

> Enabling more global kill switches for features

> Eliminating the ability for core dumps or other error reports to overwhelm system resources

> Reviewing failure modes for error conditions across all core proxy modules

Absent from this list are canary deployments and incremental or wave-based deployment of configuration files (which are often as dangerous as code changes) across fault isolation boundaries -- assuming CloudFlare has such boundaries at all. How are they going to contain the blast radius in the future?

This is something the industry was supposed to learn from the CrowdStrike incident last year, but it's clear that we still have a long way to go.

Also, enabling global anything (i.e., "enabling global kill switches for features") sounds like an incredibly risky idea. One can imagine a bug in a global switch that transforms disabling a feature into disabling an entire system.

nikcub 2 hours ago | parent | next [-]

They require the bot management config to update and propagate quickly in order to respond to attacks - but this seems like a case where updating a since instance first would have seen the panic and stopped the deploy.

I wonder why clickhouse is used to store the feature flags here, as it has it's own duplication footguns[0] which could have also easily lead to a query blowing up 2/3x in size. oltp/sqlite seems more suited, but i'm sure they have their reasons

[0] https://clickhouse.com/docs/guides/developer/deduplication

HumanOstrich an hour ago | parent [-]

I don't think sqlite would come close to their requirements for permissions or resilience, to name a couple. It's not the solution for every database issue.

Also, the link you provided is for eventual deduplication at the storage layer, not deduplication at query time.

mewpmewp2 2 hours ago | parent | prev | next [-]

It seems they had this continous rollout for the config service, but the services consuming this were affected even by small percentage of these config providers being faulty, since they were auto updating every few minutes their configs. And it seems there is a reason for these updating so fast, presumably having to react to threat actors quickly.

otterley 2 hours ago | parent [-]

It's in everyone's interest to mitigate threats as quickly as possible. But it's of even greater interest that a core global network infrastructure service provider not DOS a significant proportion of the Internet by propagating a bad configuration too quickly. The key here is to balance responsiveness against safety, and I'm not sure they struck the right balance here. I'm just glad that the impact wasn't as long and as severe as it could have been.

tptacek an hour ago | parent [-]

This isn't really "configuration" so much as it is "durable state" within the context of this system.

otterley an hour ago | parent [-]

In my 30 years of reliability engineering, I've come to learn that this is a distinction without a difference.

People think of configuration updates (or state updates, call them what you will) as inherently safer than code updates, but history (and today!) demonstrates that they are not. Yet even experienced engineers will allow changes like these into production unattended -- even ones who wouldn't dare let a single line of code go live without being subject to the full CI/CD process.

tptacek an hour ago | parent | next [-]

Reframe this problem: instead of bot rules being propagated, it's the enrollment of a new customer or a service at an existing customer --- something that must happen at Cloudflare several times a second. Does it still make sense to you to think about that in terms of "pushing new configuration to prod"?

otterley an hour ago | parent [-]

Those aren't the facts before us. Also, CRUD operations relating to a specific customer or user tend not to cause the sort of widespread incidents we saw today.

tptacek an hour ago | parent [-]

They're not, they're a response to your claim that "state" and "configuration" are indifferentiable.

HumanOstrich an hour ago | parent | prev [-]

They narrowed down the actual problem to some Rust code in the Bot Management system that enforced a hard limit on the number of configuration items by returning an error, but the caller was just blindly unwrapping it.

otterley an hour ago | parent [-]

A dormant bug in the code is usually a condition precedent to incidents like these. Later, when a bad input is given, the bug then surfaces. The bug could have laid dormant for years or decades, if it ever surfaced at all.

The point here remains: consider every change to involve risk, and architect defensively.

tptacek an hour ago | parent [-]

They made the classic distributed systems mistake and actually did something. Never leap to thing-doing!

otterley an hour ago | parent [-]

If they're going to yeet configs into production, they ought to at least have plenty of mitigation mechanisms, including canary deployments and fault isolation boundaries. This was my primary point at the root of this thread.

And I hope fly.io has these mechanisms as well :-)

tptacek an hour ago | parent [-]

We've written at long, tedious length about how hard this problem is.

otterley an hour ago | parent [-]

Have a link?

tptacek an hour ago | parent [-]

Most recently, a few weeks ago (but you'll find more just a page or two into the blog):

https://fly.io/blog/corrosion/

otterley an hour ago | parent [-]

It's great that you're working on regionalization. Yes, it is hard, but 100x harder if you don't start with cellular design in mind. And as I said in the root of the thread, this is a sign that CloudFlare needs to invest in it just like you have been.

tptacek 41 minutes ago | parent [-]

I recoil from that last statement not because I have a rooting interest in Cloudflare but because the last several years of working at Fly.io have drilled Richard Cook's "How Complex Systems Fail"† deep into my brain, and what you said runs aground of Cook #18: Failure free operations require experience with failure.

If the exact same thing happens again at Cloudflare, they'll be fair game. But right now I feel people on this thread are doing exactly, precisely, surgically and specifically the thing Richard Cook and the Cook-ites try to get people not to do, which is to see complex system failures as predictable faults with root causes, rather than as part of the process of creating resilient systems.

https://how.complexsystems.fail/

otterley 33 minutes ago | parent [-]

Suppose they did have the cellular architecture today, but every other fact was identical. They'd still have suffered the failure! But it would have been contained, and the damage would have been far less.

Fires happen every day. Smoke alarms go off, firefighters get called in, incident response is exercised, and lessons from the situation are learned (with resulting updates to the fire and building codes).

Yet even though this happens, entire cities almost never burn down anymore. And we want to keep it that way.

As Cook points out, "Safety is a characteristic of systems and not of their components."

tptacek 29 minutes ago | parent [-]

Pretty sure he's making my point (or, rather, me his) there. (I'm never going to turn down an opportunity to nerd out about Cookism).

Scaevolus an hour ago | parent | prev | next [-]

Global configuration is useful for low response times to attacks, but you need to have very good ways to know when a global config push is bad and to be able to rollback quickly.

In this case, the older proxy's "fail-closed" categorization of bot activity was obviously better than the "fail-crash", but every global change needs to be carefully validated to have good characteristics here.

Having a mapping of which services are downstream of which other service configs and versions would make detecting global incidents much easier too, by making the causative threads of changes more apparent to the investigators.

ants_everywhere 14 minutes ago | parent | prev [-]

it's always a config push. people rollout code slowly but don't have the same mechanisms for configs. But configs are code, and this is a blind spot that causes an outsized percentage of these big outages.

SerCe 2 hours ago | parent | prev | next [-]

As always, kudos for releasing a post mortem in less than 24 hours after the outage, very few tech organisations are capable of doing this.

yen223 2 hours ago | parent | next [-]

I'm curious about how their internal policies work such that they are allowed to publish a post mortem this quickly, and with this much transparency.

Any other large-ish company, there would be layers of "stakeholders" that will slow this process down. They will almost always never allow code to be published.

eastdakota an hour ago | parent | next [-]

Well… we have a culture of transparency we take seriously. I spent 3 years in law school that many times over my career have seemed like wastes but days like today prove useful. I was in the triage video bridge call nearly the whole time. Spent some time after we got things under control talking to customers. Then went home. I’m currently in Lisbon at our EUHQ. I texted John Graham-Cumming, our former CTO and current Board member whose clarity of writing I’ve always admired. He came over. Brought his son (“to show that work isn’t always fun”). Our Chief Legal Officer (Doug) happened to be in town. He came over too. The team had put together a technical doc with all the details. A tick-tock of what had happened and when. I locked myself on a balcony and started writing the intro and conclusion in my trusty BBEdit text editor. John started working on the technical middle. Doug provided edits here and there on places we weren’t clear. At some point John ordered sushi but from a place with limited delivery selection options, and I’m allergic to shellfish, so I ordered a burrito. The team continued to flesh out what happened. As we’d write we’d discover questions: how could a database permission change impact query results? Why were we making a permission change in the first place? We asked in the Google Doc. Answers came back. A few hours ago we declared it done. I read it top-to-bottom out loud for Doug, John, and John’s son. None of us were happy — we were embarrassed by what had happened — but we declared it true and accurate. I sent a draft to Michelle, who’s in SF. The technical teams gave it a once over. Our social media team staged it to our blog. I texted John to see if he wanted to post it to HN. He didn’t reply after a few minutes so I did. That was the process.

jofzar 12 minutes ago | parent | next [-]

> I texted John to see if he wanted to post it to HN. He didn’t reply after a few minutes so I did

Damn corporate karma farming is ruthless, only a couple minute SLA before taking ownership of the karma. I guess I'm not built for this big business SLA.

philipgross 36 minutes ago | parent | prev | next [-]

You call this transparency, but fail to answer the most important questions: what was in the burrito? Was it good? Would you recommend?

eastdakota 20 minutes ago | parent [-]

Chicken burrito from Coyo Taco in Lisbon. I am not proud of this. It’s worse than ordering from Chipotle. But there are no Chipotle’s in Lisbon… yet.

anurag an hour ago | parent | prev [-]

Appreciate the extra transparency on the process.

tom1337 2 hours ago | parent | prev | next [-]

I mean the CEO posted the post-mortem so there aren't that many layers of stakeholders above. For other post-mortems by engineers, Matthew once said that the engineering team is running the blog and that he wouldn't event know how to veto even if he wanted [0]

[0] https://news.ycombinator.com/item?id=45588305

madeofpalk 2 hours ago | parent | prev | next [-]

From what I've observed, it depends on whether you're an "engineering company" or not.

thesh4d0w 2 hours ago | parent | prev [-]

The person who posted both this blog article and the hacker news post, is Matthew Prince, one of highly technical billionaire founders of cloudflare. I'm sure if he wants something to happen, it happens.

bayesnet 2 hours ago | parent | prev [-]

And a well-written one at that. Compared to the AWS port-mortem this could be literature.

gucci-on-fleek 2 hours ago | parent | prev | next [-]

> This showed up to Internet users trying to access our customers' sites as an error page indicating a failure within Cloudflare's network.

As a visitor to random web pages, I definitely appreciated this—much better than their completely false “checking the security of your connection” message.

> The issue was not caused, directly or indirectly, by a cyber attack or malicious activity of any kind. Instead, it was triggered by a change to one of our database systems' permissions

Also appreciate the honesty here.

> On 18 November 2025 at 11:20 UTC (all times in this blog are UTC), Cloudflare's network began experiencing significant failures to deliver core network traffic. […]

> Core traffic was largely flowing as normal by 14:30. We worked over the next few hours to mitigate increased load on various parts of our network as traffic rushed back online. As of 17:06 all systems at Cloudflare were functioning as normal.

Why did this take so long to resolve? I read through the entire article, and I understand why the outage happened, but when most of the network goes down, why wasn't the first step to revert any recent configuration changes, even ones that seem unrelated to the outage? (Or did I just misread something and this was explained somewhere?)

Of course, the correct solution is always obvious in retrospect, and it's impressive that it only took 7 minutes between the start of the outage and the incident being investigated, but it taking a further 4 hours to resolve the problem and 8 hours total for everything to be back to normal isn't great.

eastdakota 2 hours ago | parent [-]

Because we initially thought it was an attack. And then when we figured it out we didn’t have a way to insert a good file into the queue. And then we needed to reboot processes on (a lot) of machines worldwide to get them to flush their bad files.

gucci-on-fleek 2 hours ago | parent | next [-]

Thanks for the explanation! This definitely reminds me of CrowdStrike outages last year:

- A product depends on frequent configuration updates to defend against attackers.

- A bad data file is pushed into production.

- The system is unable to easily/automatically recover from bad data files.

(The CrowdStrike outages were quite a bit worse though, since it took down the entire computer and remediation required manual intervention on thousands of desktops, whereas parts of Cloudflare were still usable throughout the outage and the issue was 100% resolved in a few hours)

tptacek 2 hours ago | parent | prev | next [-]

Richard Cook #18 (and #10) strikes again!

https://how.complexsystems.fail/#18

It'd be fun to read more about how you all procedurally respond to this (but maybe this is just a fixation of mine lately). Like are you tabletopping this scenario, are teams building out runbooks for how to quickly resolve this, what's the balancing test for "this needs a functional change to how our distributed systems work" vs. "instead of layering additional complexity on, we should just have a process for quickly and maybe even speculatively restoring this part of the system to a known good state in an outage".

dbetteridge an hour ago | parent | prev | next [-]

Question from a casual bystander, why not have a virtual/staging mini node that receives these feature file changes first and catches errors to veto full production push?

Or you do have something like this but the specific db permission change in this context only failed in production

forsalebypwner an hour ago | parent [-]

I think the reasoning behind this is because of the nature of the file being pushed - from the post mortem:

"This feature file is refreshed every few minutes and published to our entire network and allows us to react to variations in traffic flows across the Internet. It allows us to react to new types of bots and new bot attacks. So it’s critical that it is rolled out frequently and rapidly as bad actors change their tactics quickly."

tetec1 2 hours ago | parent | prev [-]

Yeah, I can imagine that this insertion was some high-pressure job.

EvanAnderson 2 hours ago | parent | prev | next [-]

It reads a lot like the Crowdstrike SNAFU. Machine-generated configuration file b0rks-up the software that consumes it.

The "...was then propagated to all the machines that make up our network..." followed by "....caused the software to fail." screams for a phased rollout / rollback methodology. I get that "...it’s critical that it is rolled out frequently and rapidly as bad actors change their tactics quickly" but today's outage highlights that rapid deployment isn't all upside.

The remediation section doesn't give me any sense that phased deployment, acceptance testing, and rapid rollback are part of the planned remediation strategy.

tptacek 2 hours ago | parent | next [-]

I don't think this system is best thought of as "deployment" in the sense of CI/CD; it's a control channel for a distributed bot detection system that (apparently) happens to be actuated by published config files (it has a consul-template vibe to it, though I don't know if that's what it is).

EvanAnderson 2 hours ago | parent | next [-]

That's why I likened it Crowdstrike. It's a signature database that blew up the consumer of said database. (You probably caught my post mid-edit, too. You may be replying to the snarky paragraph I felt better of and removed.)

Edit: Similar to Crowdstrike, the bot detector should have fallen-back to its last-known-good signature database after panicking, instead of just continuing to panic.

eastdakota 2 hours ago | parent | prev [-]

That’s correct.

tptacek 2 hours ago | parent [-]

Is it actually consul-template? (I have post-consul-template stress disorder).

navigate8310 2 hours ago | parent | prev | next [-]

I'm amazed that they are not using any simulator of some sort and pushing changes directly to production.

Aeolun 2 hours ago | parent | prev [-]

I’m fairly certain it will be after they read this thread. It doesn’t feel like they don’t want, or are incapable of improving?

lukan 2 hours ago | parent | prev | next [-]

"Throwing us off and making us believe this might have been an attack was another apparent symptom we observed: Cloudflare’s status page went down. The status page is hosted completely off Cloudflare’s infrastructure with no dependencies on Cloudflare. While it turned out to be a coincidence, it led some of the team diagnosing the issue to believe that an attacker may be targeting both our systems as well as our status page."

Unfortunately they do not share, what caused the status page to went down as well. (Does this happen often? Otherwise a big coincidence it seems)

eastdakota 2 hours ago | parent | next [-]

We don’t know. Suspect it may just have been a big uptick in load and a failure of its underlying infrastructure to scale up.

dnw 2 hours ago | parent | next [-]

Yes, probably a bunch of automated bots decided to check the status page when they saw failures in production.

reassess_blind an hour ago | parent | prev [-]

The status page is hosted on AWS Cloudfront, right? It sure looks like Cloudfront was overwhelmed by the traffic spike, which is a bit concerning. Hope we'll see a post from their side.

notatoad 2 hours ago | parent | prev | next [-]

it seems like a good chance that despite thinking their status page was completely independent of cloudfront, enough of the internet is dependent on cloudfront now that they're simply wrong about the status page's independence.

verletzen 2 hours ago | parent [-]

i think you've got cloudflare and cloudfront mixed up.

notatoad a minute ago | parent [-]

ahah oops. yeah, it's a problem. i've got two projects ongoing that each rely on one of them, and i can never keep it straight.

Aeolun 2 hours ago | parent | prev | next [-]

I mean, that would require a postmortem from statuspage.io right? Is that a service operated by cloudflare?

edoceo 4 minutes ago | parent [-]

Atlasaian

paulddraper an hour ago | parent | prev [-]

Quite possibly it was due to high traffic.

IDK Atlassian Statuspage clientele, but it's possible Cloudflare is much larger than usual.

pdimitar 12 minutes ago | parent | prev | next [-]

While I heavily frown upon using `unwrap` and `expect` in Rust code and make sure to have Clippy tell me about every single usage of them, I also understand that without them Rust might have been seen as an academic curiosity language.

They are escape hatches. Without those your language would never take off.

But here's the thing. Escape hatches are like emergency exits. They are not to be used by your team to go to lunch in a nearby restaurant.

---

Cloudflare should likely invest in better linting and CI/CD alerts. Not to mention isolated testing i.e. deploy this change only to a small subset and monitor, and only then do a wider deployment.

Hindsight is 20/20 and we can all be smartasses after the fact of course. But I am really surprised because lately I am only using Rust for hobby projects and even I know I should not use `unwrap` and `expect` beyond the first iteration phases.

---

I have advocated for this before but IMO Rust at this point will benefit greatly by disallowing those unsafe APIs by default in release mode. Though I understand why they don't want to do it -- likely millions of CI/CD pipelines will break overnight. But in the interim, maybe a rustc flag we can put in our `Cargo.toml` that enables such a stricter mode? Or have that flag just remove all the panicky API _at compile time_ though I believe this might be a Gargantuan effort and is likely never happening (sadly).

In any case, I would expect many other failures from Cloudflare but not _this_ one in particular.

duped 4 minutes ago | parent [-]

This is not a reasonable take to me. unwrap/expect are the idiomatic way to express code paths returning Option/Result as unreachable.

Bubbling up the error or None does not make the program correct. Panicking may be the only reasonable thing to do.

If panicking is guaranteed because of some input mistake to the system your failure is in testing.

w10-1 8 minutes ago | parent | prev | next [-]

Unanswered still: why 4.5 hours? It took way too long to discover the cause.

Naively I think there should be a continuous log of configuration changes affecting given servers; code could correlate the change to the outage to generate failure hypotheses.

I appreciate the heroic bottom-up solutions working backwards from symptoms, but there should be the equivalent of a top-down system model status (a bit like the difference between heartbeat as pulse vs EKG monitoring).

ademarre 15 minutes ago | parent | prev | next [-]

I integrated Turnstile with a fail-open strategy that proved itself today. Basically, if the Turnstile JS fails to load in the browser (or in a few specific frontend error conditions), we allow the user to submit the web form with a dummy challenge token. On the backend, we process the dummy token like normal, and if there is an error or timeout checking Turnstile's siteverify endpoint, we fail open.

Of course, some users were still blocked, because the Turnstile JS failed to load in their browser but the subsequent siteverify check succeeded on the backend. But overall the fail-open implementation lessened impact to our customers nonetheless.

Fail-open with Turnstile works for us because we have other bot mitigations that are sufficient to fall back on in the event of a Cloudflare outage.

cj 12 minutes ago | parent [-]

So to bypass captcha all a user has to do is block the script from loading? I can see that working but only for attacks that aren’t targeted?

ademarre 8 minutes ago | parent [-]

Only if they are able to block the siteverify check performed by our backend server. That's not the kind of attack we are trying to mitigate with Turnstile.

vsgherzi 2 hours ago | parent | prev | next [-]

Why does cloudflare allow unwraps in their code? I would've assumed they'd have clippy lints stopping that sort of thing. Why not just match with { ok(value) => {}, Err(error) => {} } the function already has a Result type.

At the bare minimum they could've used an expect("this should never happen, if it does database schema is incorrect").

The whole point of errors as values is preventing this kind of thing.... It wouldn't have stopped the outage but it would've made it easy to diagnose.

If anyone at cloudflare is here please let me in that codebase :)

waterTanuki 2 hours ago | parent [-]

Not a cloudflare employee but I do write a lot of Rust. The amount of things that can go wrong with any code that needs to make a network call is staggeringly high. unwrap() is normal during development phase but there are a number of times I leave an expect() for production because sometimes there's no way to move forward.

SchemaLoad an hour ago | parent | next [-]

Yeah it seems likely that even if there wasn't an unwrap, there would have been some error handling that wouldn't have panicked the process, but would have still left it inoperable if every request was instead going through an error path.

vsgherzi 2 hours ago | parent | prev [-]

I'm in a similar boat, at the very leas an expect can give hits to what happened. However this can also be problematic if your a library developer. Sometimes rust is expected to never panic especially in situations like WASM. This is a major problem for companies like Amazon Prime Video since they run in a WASM context for their TV APP. Any panic crashes everything. Personally I usually just either create a custom error type (preferred) or erase it away with Dyn Box Error (no other option). Random unwraps and expects haunt my dreams.

dzonga 2 hours ago | parent | prev | next [-]

> thread fl2_worker_thread panicked: called Result::unwrap() on an Err value

I don't use Rust, but a lot of Rust people say if it compiles it runs.

Well Rust won't save you from the usual programming mistake. Not blaming anyone at cloudflare here. I love Cloudflare and the awesome tools they put out.

end of day - let's pick languages | tech because of what we love to do. if you love Rust - pick it all day. I actually wanna try it for industrial robot stuff or small controllers etc.

there's no bad language - just occassional hiccups from us users who use those tools.

jryio 2 hours ago | parent | next [-]

You misunderstand what Rust’s guarantees are. Rust has never promised to solve or protect programmers from logical or poor programming. In fact, no such language can do that, not even Haskell.

Unwrapping is a very powerful and important assertion to make in Rust whereby the programmer explicitly states that the value within will not be an error, otherwise panic. This is a contract between the author and the runtime. As you mentioned, this is a human failure, not a language failure.

Pause for a moment and think about what a C++ implementation of a globally distributed network ingress proxy service would look like - and how many memory vulnerabilities there would be… I shudder at the thought… (n.b. nginx)

This is the classic example of when something fails, the failure cause over indexes on - while under indexing on the quadrillions of memory accesses that went off without a single hitch thanks to the borrow checker.

I postulate that whatever the cost in millions or hundreds of millions of dollars by this Cloudflare outage, it has paid for more than by the savings of safe memory access.

See: https://en.wikipedia.org/wiki/Survivorship_bias

lmm an hour ago | parent | prev | next [-]

> Rust won't save you from the usual programming mistake.

Disagree. Rust is at least giving you an "are you sure?" moment here. Calling unwrap() should be a red flag, something that a code reviewer asks you to explain; you can have a linter forbid it entirely if you like.

No language will prevent you from writing broken code if you're determined to do so, and no language is impossible to write correct code in if you make a superhuman effort. But most of life happens in the middle, and tools like Rust make a huge difference to how often a small mistake snowballs into a big one.

SchemaLoad an hour ago | parent [-]

Yep, unwrap() and unsafe are escape hatches that need very good justifications. It's fine for casual scripts where you don't care if it crashes. For serious production software they should be either banned, or require immense scrutiny.

tptacek 2 hours ago | parent | prev | next [-]

What people are saying is that idiomatic prod rust doesn't use unwrap/expect (both of which panic on the "exceptional" arm of the value) --- instead you "match" on the value and kick the can up a layer on the call chain.

olivia-banks an hour ago | parent [-]

What happens to it up the callstack? Say they propagated it up the stack with `?`. It has to get handled somewhere. If you don't introduce any logic to handle the duplicate databases, what else are you going to do when the types don't match up besides `unwrap`ing, or maybe emitting a slightly better error message? You could maybe ignore that module's error for that request, but if it was a service more critical than bot mitigation you'd still have the same symptom of getting 500'd.

evil-olive 40 minutes ago | parent | next [-]

> What happens to it up the callstack?

as they say in the post, these files get generated every 5 minutes and rolled out across their fleet.

so in this case, the thing farther up the callstack is a "watch for updated files and ingest them" component.

that component, when it receives the error, can simply continue using the existing file it loaded 5 minutes earlier.

and then it can increment a Prometheus metric (or similar) representing "count of errors from attempting to load the definition file". that metric should be zero in normal conditions, so it's easy to write an alert rule to notify the appropriate team that the definitions are broken in some way.

that's not a complete solution - in particular it doesn't necessarily solve the problem of needing to scale up the fleet, because freshly-started instances won't have a "previous good" definition file loaded. but it does allow for the existing instances to fail gracefully into a degraded state.

in my experience, on a large enough system, "this could never happen, so if it does it's fine to just crash" is almost always better served by a metric for "count of how many times a thing that could never happen has happened" and a corresponding "that should happen zero times" alert rule.

acdha an hour ago | parent | prev | next [-]

The way I’ve seen this on a few older systems was that they always keep the previous configuration around so it can switch back. The logic is something like this:

1. At startup, load the last known good config.

2. When signaled, load the new config.

3. When that passes validation, update the last-known-good pointer to the new version.

That way something like this makes the crash recoverable on the theory that stale config is better than the service staying down. One variant also recorded the last tried config version so it wouldn’t even attempt to parse the latest one until it was changed again.

For Cloudflare, it’d be tempting to have step #3 be after 5 minutes or so to catch stuff which crashes soon but not instantly.

__turbobrew__ an hour ago | parent | prev | next [-]

Presumably you kick up the error to a level that says “if parsing new config fails, keep the old config”

tptacek an hour ago | parent | prev [-]

Yeah, see, that's what I mean.

Klonoar an hour ago | parent | prev | next [-]

> I don't use Rust, but a lot of Rust people say if it compiles it runs.

Do you grok what the issue was with the unwrap, though...?

Idiomatic Rust code does not use that. The fact that it's allowed in a codebase says more about the engineering practices of that particular project/module/whatever. Whoever put the `unwrap` call there had to contend with the notion that it could panic and they still chose to do it.

It's a programmer error, but Rust at least forces you to recognize "okay, I'm going to be an idiot here". There is real value in that.

metaltyphoon 2 hours ago | parent | prev | next [-]

> Well Rust won't save you from the usual programming mistake

This is not a Rust problem. Someone consciously chose to NOT handle an error, possibly thinking "this will never happen". Then someone else conconciouly reviewed (I hope so) a PR with an unwrap() and let it slide.

dzonga 2 hours ago | parent | prev [-]

other people might say - why use unsafe rust - but we don't know the conditions of what the original code shipped under. why the pr was approved.

could have been tight deadline, managerial pressure or just the occasional slip up.

trengrj 2 hours ago | parent | prev | next [-]

Classic combination of errors:

Having the feature table pivoted (with 200 feature1, feature2, etc columns) meant they had to do meta queries to system.columns to get all the feature columns which made the query sensitive to permissioning changes (especially duplicate databases).

A Crowdstrike style config update that affects all nodes but obviously isn't tested in any QA or staged rollout strategy beforehand (the application panicking straight away with this new file basically proves this).

Finally an error with bot management config files should probably disable bot management vs crash the core proxy.

I'm interested here why they even decided to name Clickhouse as this error could have been caused by any other database. I can see though the replicas updating causing flip / flopping of results would have been really frustrating for incident responders.

tptacek 2 hours ago | parent [-]

Right but also this is a pretty common pattern in distributed systems that publish from databases (really any large central source of truth); it might be like the problem in systems like this. When you're lucky the corner cases are obvious; in the big one we experienced last year, a new row in our database tripped an if-let/mutex deadlock, which our system dutifully (and very quickly) propagated across our entire network.

The solution to that problem wasn't better testing of database permutations or a better staging environment (though in time we did do those things). It was (1) a watchdog system in our proxies to catch arbitrary deadlocks (which caught other stuff later), (2) segmenting our global broadcast domain for changes into regional broadcast domains so prod rollouts are implicitly staged, and (3) a process for operators to quickly restore that system to a known good state in the early stages of an outage.

(Cloudflare's responses will be different than ours, really I'm just sticking up for the idea that the changes you need don't follow obviously from the immediate facts of an outage.)

ed_mercer 2 hours ago | parent | prev | next [-]

Wow. 26M/s 5xx error HTTP status codes over a span of roughly two hours. That's roughly 187 billion HTTP errors that interrupted people (and systems)!

RagingCactus an hour ago | parent | prev | next [-]

Lots of people here are (perhaps rightfully) pointing to the unwrap() call being an issue. That might be true, but to me the fact that a reasonably "clean" panic at a defined line of code was not quickly picked up in any error monitoring system sounds just as important to investigate.

Assuming something similar to Sentry would be in use, it should clearly pick up the many process crashes that start occurring right as the downtime starts. And the well defined clean crashes should in theory also stand out against all the random errors that start occuring all over the system as it begins to go down, precisely because it's always failing at the exact same point.

testemailfordg2 an hour ago | parent | prev | next [-]

"Customers on our old proxy engine, known as FL, did not see errors, but bot scores were not generated correctly, resulting in all traffic receiving a bot score of zero."

This simply means, the exception handling quality of your new FL2 is non-existent and is not at par / code logic wise similar to FL.

I hope it was not because of AI driven efficiency gains.

lmm 39 minutes ago | parent [-]

In most domains, silently returning 0 in a case where your logic didn't actually calculate the thing you were trying to calculate is far worse than giving a clear error.

avereveard an hour ago | parent | prev | next [-]

Question: customer having issues also couldn't switch their dns to bypass the service, why is the control plane updated along the data plane here it seem a lot of use could save business continuity if they could change their dns entry temporarily

habibur an hour ago | parent | prev | next [-]

    On 18 November 2025 at 11:20 UTC (all times in this blog are UTC), Cloudflare's network began experiencing significant failures
    As of 17:06 all systems at Cloudflare were functioning as normal
6 hours / 5 years gives ~99.98% uptime.
tristan-morris 2 hours ago | parent | prev | next [-]

Why call .unwrap() in a function which returns Result<_,_>?

For something so critical, why aren't you using lints to identify and ideally deny panic inducing code. This is one of the biggest strengths of using Rust in the first place for this problem domain.

tptacek 2 hours ago | parent | next [-]

Probably because this case was something more akin to an assert than an error check.

piperswe 11 minutes ago | parent | next [-]

In that case, it should probably be expect rather than unwrap, to document why the assertion should never fail.

marcusb 2 hours ago | parent | prev | next [-]

Rust has debug asserts for that. Using expect with a comment about why the condition should not/can't ever happen is idiomatic for cases where you never expect an Err.

This reads to me more like the error type returned by append with names is not (ErrorFlags, i32) and wasn't trivially convertible into that type so someone left an unwrap in place on an "I'll fix it later" basis, but who knows.

thundergolfer 2 hours ago | parent | prev | next [-]

Fly writes a lot of Rust, do you allow `unwrap()` in your production environment? At Modal we only allow `expect("...")` and the message should follow the recommended message style[1].

I'm pretty surprised that Cloudflare let an unwrap into prod that caused their worst outage in 6 years.

1. https://doc.rust-lang.org/std/option/enum.Option.html#recomm...

tptacek an hour ago | parent [-]

After The Great If-Let Outage Of 2024, we audited all our code for that if-let/rwlock problem, changed a bunch of code, and immediately added a watchdog for deadlocks. The audit had ~no payoff; the watchdog very definitely did.

I don't know enough about Cloudflare's situation to confidently recommend anything (and I certainly don't know enough to dunk on them, unlike the many Rust experts of this thread) but if I was in their shoes, I'd be a lot less interested in eradicating `unwrap` everywhere and more in making sure than an errant `unwrap` wouldn't produce stable failure modes.

But like, the `unwrap` thing is all programmers here have to latch on to, and there's a psychological self-soothing instinct we all have to seize onto some root cause with a clear fix (or, better yet for dopaminergia, an opportunity to dunk).

A thing I really feel in threads like this is that I'd instinctively have avoided including the detail about an `unwrap` call --- I'd have worded that part more ambiguously --- knowing (because I have a pathological affinity for this community) that this is exactly how HN would react. Maybe ironically, Prince's writing is a little better for not having dodged that bullet.

thundergolfer an hour ago | parent [-]

Fair. I agree that saying "it's the unwrap" and calling it a day is wrong. Recently actually we've done an exercise on our Worker which is "assume the worst kind of panic happens. make the Worker be ok with it".

But I do feel strongly that the expect pattern is a highly useful control and that naked unwraps almost always indicate a failure to reason about the reliability of a change. An unwrap in their core proxy system indicates a problem in their change management process (review, linting, whatever).

tristan-morris 2 hours ago | parent | prev | next [-]

Oh absolutely, that's how it would have been treated.

Surely a unwrap_or_default() would have been a much better fit--if fetching features fails, continue processing with an empty set of rules vs stop world.

stefan_ 2 hours ago | parent | prev [-]

You are saying this would not have happened in a C release build where asserts define to nothing?

Wonder why these old grey beards chose to go with that.

tptacek 2 hours ago | parent | next [-]

I am one of those old grey beards (or at least, I got started shipping C code in the 1990s), and I'd leave asserts in prod serverside code given the choice; better that than a totally unpredictable error path.

ashishb 2 hours ago | parent | prev [-]

> You are saying this would not have happened in a C release build where asserts define to nothing?

Afaik, Go and Java are the only languages that make you pause and explicitly deal with these exceptions.

tristan-morris 2 hours ago | parent [-]

And rust, but they chose to panic on the error condition. Wild.

sayrer 2 hours ago | parent | prev | next [-]

Yes, can't have .unwrap() in production code (it's ok in tests)

orphea 2 hours ago | parent | next [-]

Like goto, unwrap is just a tool that has its use cases. No need to make a boogeyman out of it.

metaltyphoon an hour ago | parent | next [-]

Yes it's meant to be used in test code. If you're sure it can't fail do then use .expect() that way it shows you made a choice and it wasn't just a dev oversight.

fwjafwasd an hour ago | parent | prev | next [-]

panicans should be using .expect() in production

gishh 2 hours ago | parent | prev [-]

To be fair, if you’re not “this tall” you really shouldn’t consider using goto in a c program. Most people aren’t that tall.

keyle 2 hours ago | parent | prev [-]

unwrap itself isn't the problem...

koakuma-chan 2 hours ago | parent | prev [-]

Why is there a 200 limit on appending names?

zmj an hour ago | parent | next [-]

Everything has a limit. You can define it, or be surprised when you find out what it is.

nickmonad an hour ago | parent | prev [-]

Limits in systems like these are generally good. They mention the reasoning around it explicitly. It just seems like the handling of that limit is what failed and was missed in review.

ulfw 3 minutes ago | parent | prev | next [-]

The internet hasn't been the internet in years. It was originally built to withstand wars. The whole idea of our IP based internet was to reroute packages should networks go down. Decentralisation was the mantra and how it differed from early centralised systems such as AOL et al.

This is all gone. The internet is a centralised system in the hand of just a few companies. If AWS goes down half the internet does. If Azure, Google Cloud, Oracle Cloud, Tencent Cloud or Alibaba Cloud goes down a large part of the internet does.

Yesterday with Cloudflare down half the sites I tried gave me nothing but errors.

The internet is dead.

cvhc 40 minutes ago | parent | prev | next [-]

I don't get why that SQL query was even used in the first place. It seems it fetches feature names at runtime instead of using a static hardcoded schema. Considering this decides the schema of a global config, I don't think the dynamicity is a good idea.

wildmXranat an hour ago | parent | prev | next [-]

Hold up ,- when I used a C or similar language for accessing a database and wanted to clamp down on memory usage to deterministically control how much I want to allocated, I would explicitly limit the number of rows in the query.

There never was an unbound "select all rows from some table" without a "fetch first N rows only" or "limit N"

If you knew that this design is rigid, why not leverage the query to actually do it ?

What am I missing ?

alhirzel 23 minutes ago | parent | prev | next [-]

> I worry this is the big botnet flexing.

Even worse - the small botnet that controls everything.

ksajadi 2 hours ago | parent | prev | next [-]

May I just say that Matthew Prince is the CEO of Cloudflare and a lawyer by training (and a very nice guy overall). The quality of this postmortem is great but the fact that it is from him makes one respect the company even more.

yoyohello13 an hour ago | parent | prev | next [-]

People really like to hate on Rust for some reason. This wasn’t a Rust problem, no language would have saved them from this kind of issue. In fact, the compiler would have warned that this was a possible issue.

I get it, don’t pick languages just because they are trendy, but if any company’s use case is a perfect fit for Rust it’s cloudflare.

SchemaLoad an hour ago | parent [-]

Yeah even if you handled this situation without unwrap() if you just went down an error path that didn't panic, the service would likely still be inoperable if every single request went down the error path.

sema4hacker an hour ago | parent | prev | next [-]

If you deploy a change to your system, and things start to go wrong that same day, the prime suspect (no matter how unlikely it might seem) should be the change you made.

1970-01-01 31 minutes ago | parent | next [-]

This is an area where they are allowed to think yet another record setting DDoS attack first and bad config second.

wildmXranat 31 minutes ago | parent | prev [-]

My first question when faced with an unknown error is "What was the last change and when was it promoted?"

back_to_basics 37 minutes ago | parent | prev | next [-]

While it's certainly worthwhile to discuss the Technical and Procedural elements that contributed to this Service Outage, the far more important (and mutually-exclusive aspect) to discuss should be:

Why have we built / permitted the building of / Subscribed to such a Failure-intolerant "Network"?

1970-01-01 an hour ago | parent | prev | next [-]

I would have been a bit cheeky and opened with 'It wasn't DNS.'

arjie an hour ago | parent | prev | next [-]

Great post-mortem. Very clear. Surprised that num(panicking threads) didn't show up somewhere in telemetry.

slyall an hour ago | parent | prev | next [-]

Ironically just now I got a Cloudflare "Error code 524" page because blog.cloudflare.com was down

nanankcornering an hour ago | parent | prev | next [-]

Matt, Looking forward in regaining Elon's and his team trust to use CF again.

wilg 31 minutes ago | parent [-]

I wish Elon would regain my trust!

sigmar 2 hours ago | parent | prev | next [-]

Wow. What a post mortem. Rather than Monday morning quarterbacking how many ways this could have been prevented, I'd love to hear people sound-off on things that unexpectedly broke. I, for one, did not realize logging in to porkbun to edit DNS settings would become impossible with a cloudflare meltdown

nullbyte808 2 hours ago | parent | prev | next [-]

I thought it was an internal mess-up. I thought an employee screwed a file up. Old methods are sometimes better than new. AI fails us again!

chatmasta an hour ago | parent | prev | next [-]

Wow, crazy disproportional drop in the stock price… good buying opportunity for $NET.

rvz 11 minutes ago | parent [-]

Agree.

Cloudflare is very cheap at these prices.

zzzeek 2 hours ago | parent | prev | next [-]

> Instead, it was triggered by a change to one of our database systems' permissions which caused the database to output multiple entries into a “feature file” used by our Bot Management system.

And here is the query they used ** (OK, so it's not exactly):

     SELECT * from feature JOIN permissions on feature.feature_type_id = permissions.feature_type_id
someone added a new row to permissions and the JOIN started returning two dupe feature rows for each distinct feature.

** "here is the query" is used for dramatic effect. I have no knowledge of what kind of database they are even using much less queries (but i do have an idea).

more edits: OK apparently it's described later in the post as a query against clickhouse's table metadata table, and because users were granted access to an additional database that was actually the backing store to the one they normally worked with, some row level security type of thing doubled up the rows. Not sure why querying system.columns is part of a production level query though, seems overly dynamic.

captainkrtek 2 hours ago | parent [-]

I believe they mentioned ClickHouse

nawgz 2 hours ago | parent | prev | next [-]

> a change to one of our database systems' permissions which caused the database to output multiple entries into a “feature file” used by our Bot Management system ... to keep [that] system up to date with ever changing threats

> The software had a limit on the size of the feature file that was below its doubled size. That caused the software to fail

A configuration error can cause internet-scale outages. What an era we live in

Edit: also, after finishing my reading, I have to express some surprise that this type of error wasn't caught in a staging environment. If the entire error is that "during migration of ClickHouse nodes, the migration -> query -> configuration file pipeline caused configuration files to become illegally large", it seems intuitive to me that doing this same migration in staging would have identified this exact error, no?

I'm not big on distributed systems by any means, so maybe I'm overly naive, but frankly posting a faulty Rust code snippet that was unwrapping an error value without checking for the error didn't inspire confidence for me!

shoo 36 minutes ago | parent | next [-]

The speed and transparency of Cloudflare publishing this port mortem is excellent.

I also found the "remediation and follow up" section a bit lacking, not mentioning how, in general, regressions in query results caused by DB changes could be caught in future before they get widely rolled out.

Even if a staging env didn't have a production-like volume of data to trigger the same failure mode of a bot management system crash, there's also an opportunity to detect that something has gone awry if there were tests that the queries were returning functionally equivalent results after the proposed permission change. A dummy dataset containing a single http_requests_features column would suffice to trigger the dupe results behaviour.

In theory there's a few general ways this kind of issue could be detected, e.g. someone or something doing a before/after comparison to test that the DB permission change did not regress query results for common DB queries, for changes that are expected to not cause functional changes in behaviour.

Maybe it could have been detected with an automated test suite of the form "spin up a new DB, populate it with some curated toy dataset, then run a suite of important queries we must support and check the results are still equivalent (after normalising row order etc) to known good golden outputs". This style of regression testing is brittle, burdensome to maintain and error prone when you need to make functional changes and update what then "golden" outputs are - but it can give a pretty high probability of detecting that a DB change has caused unplanned functional regressions in query output, and you can find out about this in a dev environment or CI before a proposed DB change goes anywhere near production.

mewpmewp2 2 hours ago | parent | prev | next [-]

It would have been caught only in stage if there was similar amount of data in the database. If stage has 2x less data it would have never occurred there. Not super clear how easy it would have been to keep stage database exactly as production database in terms of quantity and similarity of data etc.

I think it's quite rare for any company to have exact similar scale and size of storage in stage as in prod.

Aeolun 2 hours ago | parent [-]

> I think it's quite rare for any company to have exact similar scale and size of storage in stage as in prod.

We’re like a millionth the size of cloudflare and we have automated tests for all (sort of) queries to see what would happen with 20x more data.

Mostly to catch performance regressions, but it would work to catch these issues too.

I guess that doesn’t say anything about how rare it is, because this is also the first company at which I get the time to go to such lengths.

mewpmewp2 2 hours ago | parent [-]

But now consider how much extra data Cloudflare at its size would have to have just for staging, doubling or more their costs to have stage exactly as production. They would have to simulate similar amount of requests on top of themselves constantly since presumably they have 100s or 1000s of deployments per day.

In this case it seems the database table in question seemed modest in size (the features for ML) so naively thinking they could have kept stage features always in sync with prod at the very least, but could be they didn't consider that 55 rows vs 60 rows or similar could be a breaking point given a certain specific bug.

It is much easier to test with 20x data if you don't have the amount of data cloudflare probably handles.

Aeolun an hour ago | parent [-]

That just means it takes longer to test. It may not be possible to do it in a reasonable timeframe with the volumes involved, but if you already have 100k servers running to serve 25M requests per second, maybe briefly booting up another 100k isn’t going to be the end of the world?

Either way, you don’t need to do it on every commit, just often enough that you catch these kinds of issues before they go to prod.

norskeld 2 hours ago | parent | prev | next [-]

This wild `unwrap()` kinda took me aback as well. Someone really believed in themselves writing this. :)

Jach 2 hours ago | parent [-]

They only recently rewrote their core in Rust (https://blog.cloudflare.com/20-percent-internet-upgrade/) -- given the newness of the system and things like "Over 100 engineers have worked on FL2, and we have over 130 modules" I won't be surprised for further similar incidents.

jmclnx 2 hours ago | parent | prev [-]

I have to wonder if AI was involved with the change.

norskeld 2 hours ago | parent [-]

I don't think this is the case with CloudFlare, but for every recent GitHub outage or performance issue... oh boy, I blame the clankers!

jijji an hour ago | parent | prev | next [-]

this is where change management really shines because in a change management environment this would have been prevented by a backout procedure and it would never have been rolled out to production before going into QA, with peer review happening before that... I don't know if they lack change management but it's definitely something to think about

mercnz an hour ago | parent [-]

i think that is data rather than code which is where it falls short, in a way you need stringent code and more safeguarded code; it's like if everyone sends you 64k posts as that's all your proxy layer lets in, someone checked sending 128kb and it gave an error before reaching your app - and then someone sends 128kb and the proxy layer has changed - and your app crashes as it was more than 64kb and your app had an assert against that. to actually track issues with erraneous data that overflows well and stuff isn't so much code test but more like fuzz testing, brute force testing etc. which i think people should do; but that's more like we need strong test networks, and also those test networks may need to be more internet like to reflect real issues too, so the whole testing infrastructure in itself becomes difficult to get right - like they have their own tunneling system etc, they could segregate some of their servers and make a test system with better error diagnosis etc potentially. but to my mind, if they had better error propogation back that really identified what was happening and where then that would be a lot better in general. sure, start doing that on a test network. this is something i've beeen tihnking about in general - i made a simple rpc system for being able to send real time rust tracing logs (it allows to just use the normal tracing framework and use a thin rpc layer) back from multiple end servers but that's mostly for granular debugging. i've never quite understood why systems like systemd-journald aren't more network centric when they're going to be big and complex kitchensink approaches - apparently there's dbus support, but to my mind something inbetween debugging level of code and warning/info. like even if it's doing things like 1/20 of log info it's too much volume if things like large files getting close to limits is increasing etc and we can see this as things run, and can see if it's localised or common etc it'd help have more resilient systems. something may already exist in this line but i didn't come across anything in a reasonably passive way - i mean there's debugging tools like dtrace etc that have been around for ages.

rvz 2 hours ago | parent | prev | next [-]

Great write up.

This is the first significant outage that has involved Rust code, and as you can see the .unwrap is known to carry the risk of a panic and should never be used on production code.

binarymax 2 hours ago | parent | prev | next [-]

28M 500 errors/sec for several hours from a single provider. Must be a new record.

No other time in history has one single company been responsible for so much commerce and traffic. I wonder what some outage analogs to the pre-internet ages would be.

captainkrtek 2 hours ago | parent | next [-]

Something like a major telco going out, for example the AT&T 1990 outage of long distance calling:

> The standard procedures the managers tried first failed to bring the network back up to speed and for nine hours, while engineers raced to stabilize the network, almost 50% of the calls placed through AT&T failed to go through.

> Until 11:30pm, when network loads were low enough to allow the system to stabilize, AT&T alone lost more than $60 million in unconnected calls.

> Still unknown is the amount of business lost by airline reservations systems, hotels, rental car agencies and other businesses that relied on the telephone network.

https://users.csc.calpoly.edu/~jdalbey/SWE/Papers/att_collap...

alhirzel 20 minutes ago | parent | prev | next [-]

> I wonder what some outage analogs to the pre-internet ages would be.

Lots of things have the sky in common. Maybe comet-induced ice ages...

nullbyte808 2 hours ago | parent | prev | next [-]

Yes, all(most) eggs should not be in one basket. Perfect opportunity to setup a service that checks cloudflare then switches a site's DNS to akami as a backup.

manquer 2 hours ago | parent | prev | next [-]

Absolute volume maybe[1], as relative % of global digital communication traffic, the era of early telegraph probably has it beat.

In the pre digital era, East India Company dwarfs every other company in any metric like commerce controlled, global shipping, communication traffic, private army size, %GDP , % of workforce employed by considerable margins.

The default was large consolidated organization throughout history, like say Bell Labs, or Standard Oil before that and so on, only for a brief periods we have enjoyed benefits of true capitalism.

[1] Although I suspect either AWS or MS/Azure recent down-times in the last couple of years are likely higher

adventured 2 hours ago | parent | prev [-]

> No other time in history has one single company been responsible for so much commerce and traffic.

AWS very likely has Cloudflare beat in commerce responsibility. Amazon is equal to ~2.3% of US GDP by itself.

moralestapia 2 hours ago | parent | prev | next [-]

No publicity is bad publicity.

Best post mortem I've read in a while, this thing will be studied for years.

A bit ironic that their internal FL2 tool is supposed to make Cloudflare "faster and more secure" but brought a lot of things down. And yeah, as other have already pointed out, that's a very unsafe use of Rust, should've never made it to production.

issafram 35 minutes ago | parent | prev | next [-]

I give them a pass on lots of things, but this is inexcusable

rawgabbit 2 hours ago | parent | prev | next [-]

     > The change explained above resulted in all users accessing accurate metadata about tables they have access to. Unfortunately, there were assumptions made in the past, that the list of columns returned by a query like this would only include the “default” database:

  SELECT
  name,
  type
  FROM system.columns
  WHERE
  table =        'http_requests_features'
  order by name;

    Note how the query does not filter for the database name. With us gradually rolling out the explicit grants to users of a given ClickHouse cluster, after the change at 11:05 the query above started returning “duplicates” of columns because those were for underlying tables stored in the r0 database.
0xbadcafebee 2 hours ago | parent | prev [-]

So, to recap:

  - Their database permissions changed unexpectedly (??)
  - This caused a 'feature file' to be changed in an unusual way (?!)
     - Their SQL query made assumptions about the database; their permissions change thus resulted in queries getting additional results, permitted by the query
  - Changes were propagated to production servers which then crashed those servers (meaning they weren't tested correctly)
     - They hit an internal application memory limit and that just... crashed the app
  - The crashing did not result in an automatic backout of the change, meaning their deployments aren't blue/green or progressive
  - After fixing it, they were vulnerable to a thundering herd problem
  - Customers who were not using bot rules were not affected; CloudFlare's bot-scorer generated a constant bot score of 0, meaning all traffic is bots
In terms of preventing this from a software engineering perspective, they made assumptions about how their database queries work (and didn't validate the results), and they ignored their own application limits and didn't program in either a test for whether an input would hit a limit, or some kind of alarm to notify the engineers of the source of the problem.

From an operations perspective, it would appear they didn't test this on a non-production system mimicing production; they then didn't have a progressive deployment; and they didn't have a circuit breaker to stop the deployment or roll-back when a newly deployed app started crashing.

tptacek 2 hours ago | parent | next [-]

People jump to say things like "where's the rollback" and, like, probably yeah, but keep in mind that speculative rollback features (that is: rollbacks built before you've experienced the real error modes of the system) are themselves sources of sometimes-metastable distributed system failures. None of this is easy.

paulddraper an hour ago | parent | prev [-]

Looks like you have the perfect window to disrupt them with a superior product.

mercnz 42 minutes ago | parent [-]

just before this outage i was exploring bunnycdn as the idea of cloudflare taking over dns still irks me slightly. there are competitors. but there's a certain amount of scale that cloudflare offers which i think can help performance in general. that said in the past i found cloudflare performance terrible when i was doings lots of testing. they are predominantly a pull based system not a push, so if content isn't current the cache miss performance can be kind of blah. i think their general backhaul paths have improved, but at least from new zealand they used to seem to do worse than hitting a los angeles proxy that then hits origin. (although google was in a similar position before, where both 8.8.8.8 and www.google.co.nz/.com were both faster via los angeles than via normal paths - i think google were doing asia parent, like if testing 8.8.8.8 misses it was super far away). i think now that we have http/3 etc though that performance is a bit simpler to achieve, and that ddos, bot protection is kind of the differentiator, and i think that cloudflare's bot protection may work reasonably well in general?