Roll back is not always the right answer. I can’t speak to its appropriateness in this particular situation of course, but sometimes “roll forward” is the better solution.

▲

flaminHotSpeedo 2 hours ago | parent | next [-]

Like the other poster said, roll back should be the right answer the vast majority of the time. But it's also important to recognize that roll forward should be a replacement for the deployment you decided not to roll back, not a parallel deployment through another system.

I won't say never, but a situation where the right answer to avoid a rollback (that it sounds like was technically fine to do, just undesirable from a security/business perspective) is a parallel deployment through a radioactive, global blast radius, near instantaneous deployment system that is under intense scrutiny after another recent outage should be about as probable as a bowl of petunias in orbit

▲

crote 42 minutes ago | parent [-]

Is a roll back even possible at Cloudflare's size?

With small deployments it usually isn't too difficult to re-deploy a previous commit. But once you get big enough you've got enough developers that half a dozen PRs will have been merged since the start of the incident and now. How viable is it to stop the world, undo everything, and start from scratch any time a deployment causes the tiniest issues?

Realistically the best you're going to get is merging a revert of the problematic changeset - but with the intervening merges that's still going to bring the system in a novel state. You're rolling forwards, not backwards.

	▲	newsoftheday 38 minutes ago \| parent \| next [-]
		If companies like Cloudflare haven't figured out how to do reliable rollbacks, there seems little hope for any of us.
	▲	yuliyp 32 minutes ago \| parent \| prev [-]
		I'd presume they have the ability to deploy a previous artifact vs only tip-of-master.

▲

echelon 2 hours ago | parent | prev [-]

You want to build a world where roll back is 95% the right thing to do. So that it almost always works and you don't even have to think about it.

During an incident, the incident lead should be able to say to your team's on call: "can you roll back? If so, roll back" and the oncall engineer should know if it's okay. By default it should be if you're writing code mindfully.

Certain well-understood migrations are the only cases where roll back might not be acceptable.

Always keep your services in "roll back able", "graceful fail", "fail open" state.

This requires tremendous engineering consciousness across the entire org. Every team must be a diligent custodian of this. And even then, it will sometimes break down.

Never make code changes you can't roll back from without reason and without informing the team. Service calls, data write formats, etc.

I've been in the line of billion dollar transaction value services for most of my career. And unfortunately I've been in billion dollar outages.

	▲	drysart an hour ago \| parent [-]
		"Fail open" state would have been improper here, as the system being impacted was a security-critical system: firewall rules. It is absolutely the wrong approach to "fail open" when you can't run security-critical operations.