Remix.run Logo
aavshr 5 hours ago

> In short, a latent bug in a service underpinning our bot mitigation capability started to crash after a routine configuration change we made. That cascaded into a broad degradation to our network and other services. This was not an attack.

From the CTO, Source: https://x.com/dok2001/status/1990791419653484646

__turbobrew__ 5 hours ago | parent | next [-]

It still astounds me that the big dogs still do not phase config rollouts. Code is data, configs are data, they are one and the same. It was the same issue with the giant crowdstrike outage last year, they were rawdogging configs globally and a bad config made it out there and everything went kaboom.

You NEED to phase config rollouts like you phase code rollouts.

crazygringo 3 hours ago | parent | next [-]

The big dogs absolutely do phase config rollouts as a general rule.

There are still two weaknesses:

1) Some configs are inherently global and cannot be phased. There's only one place to set them. E.g. if you run a webapp, this would be configs for the load balancer as opposed to configs for each webserver

2) Some configs have a cascading effect -- even though a config is applied to 1% of servers, it affects the other servers they interact with, and a bad thing spreads across the entire network

creatonez 26 minutes ago | parent [-]

> Some configs are inherently global and cannot be phased

This is also why "it is always DNS". It's not that DNS itself is particularly unreliable, but rather that it is the one area where you can really screw up a whole system by running a single command, even if everything else is insanely redundant.

__turbobrew__ 10 minutes ago | parent [-]

I don’t believe that there is anything necessarily which requires DNS configs to be global.

You can shard your service behind multiple names:

my-service-1.example.com

my-service-2.example.com

my-service-3.example.com …

Then you can create smoke tests which hit each phase of the DNS and if you start getting errors you stop the rollout of the service.

siegecraft 3 hours ago | parent | prev | next [-]

I think it's uncharitable to jump to the conclusion that just because there was a config-based outage they don't do phased config rollouts. And even more uncharitable to compare them to crowdstrike.

__turbobrew__ 2 hours ago | parent | next [-]

I have read several cloudflare postmortems and my confidence in their systems is pretty low. They used to run their entire control plane out of a single datacenter which is amateur hour for a tech company that has over $60 billion in market cap.

I also don’t understand how it is uncharitable to compare them to crowdstrike as both companies run critical systems that affect a large number of people’s lives, and both companies seem to have outages at a similar rate (if anything, cloudflare breaks more often than crowdstrike).

cyberpunk 3 hours ago | parent | prev [-]

It seem fairly logical to me? If a config change causes services to crash then rollout stops … at least in every phased rollout system i’ve ever built…

JohnMakin 5 hours ago | parent | prev | next [-]

In a company I am no longer with I argued much the same when we rolled out "global CI/CD" on IAC. You made one change, committed and pushed, wham it's on 40+ server clusters globally. I hated it. The principal was enamored with it, "cattle not pets" and all that, but the result was things slowed down considerably because anyone working with it became so terrified of making big changes.

wbl 4 hours ago | parent | prev | next [-]

Then you get customer visible delays.

immibis an hour ago | parent | prev [-]

Because adversaries adapt quickly, they have a system that deploys their counter-adversary bits quickly without phasing - no matter whether they call them code or configs. See also: Crowdstrike.

JohnMakin 5 hours ago | parent | prev | next [-]

Wish this could rocket to the top of the comment thread, digging through hundreds of comments speculating about a cyberattack to find this felt silly

imdsm 5 hours ago | parent | prev [-]

Configuration changes are dangerous for CF it seems, and knocked down $NET almost 4% today. I wonder what the industry wide impact is for each of these outages?

sammy2255 3 hours ago | parent [-]

Pre market was red for all tech stocks today before the outage even happened