Remix.run Logo
jart 4 days ago

fly.io publishes their post-mortems here: https://fly.io/infra-log/

The last post-mortem they wrote is very interesting and full of details. Basically back in 2016 the heart or keystone component of fly.io production infrastructure was called consul, which is a highly secure TLS server that tracks shared state and it requires that both the server certificate and the client certificate be authenticated. Since it was centralized, it had scaling issues, so fly.io wrote a replacement for it in 2020 called corrosion, and quickly forgot about consul, but didn't have the heart to kill it. Then in October 2024 consul's root key signing key expires, which brought down all connectivity, and since it uses bidirectional authentication, they couldn't bring it back online until they deployed new SSL certificates to every machine in their fleet. Somehow they did this in half an hour, but the chain of dominoes had already been set in motion to reveal other weaknesses in their infrastructure that they could eliminate. There was this other internal service whose own independent set of TLS keys had also expired long ago, but they didn't notice until they tried rebooting it as part of the consul rekey, since doing so severed the TCP connections it had established way back when its certificate was valid. Plus the whole time this is happening, their logging tools are DDOSing their network provider. It took some real heroes to save the company and all their customers too when that many things explode at once.

ignoramous 4 days ago | parent [-]

On that Consul outage, Fly Infra concludes, "The moral of the story is, no more half-measures."

On their careers page [1], the Fly team goes, "We're not big believers in tech debt."

As an outsider, reads like a cacophony of contradictions?

[1] https://fly.io/docs/hiring/working/#we-re-ruthless-about-doi...

jart 4 days ago | parent | next [-]

No one actually lives up to their principles, but it's still important that we have them.

If you actually do live up to yours, then you need to adopt better principles.

whilenot-dev 4 days ago | parent [-]

Any principle in itself isn't without critique, agree, but it's still the choice being made to pick this specific principle that tells the whole story. There are so many principles to pick from and the tech dept pick follows up with a "We have a 3-month “no refactoring” rule for new hires. This isn’t everyone’s preferred work style! We try to be up front about stuff.", which sounds a bit like an additional perform or else... principle that just delays ownership of the stuff you're supposed to work with. In the best case that sounds like naiive optimism and in the worst case that's gross negligence... neither one speaks "engineering" to me.

tptacek 3 days ago | parent [-]

It is absolutely not a "perform or else" rule. Why are you reading so far into this? We really do have a rule about tech-debt changes, and it's a useful insight into why you might or might not want to work here, which is why we bring it up, despite the possibility it might alienate people; we'd like to be as honest as we can be. Worrying about people reading hustle-culture bullshit into stuff like this is a reason not to be transparent, which sucks.

tptacek 3 days ago | parent | prev | next [-]

All the other comments aside: these aren't even contradictory statements. We really do have no-tech-debt rules, and they generally have not been responsible for our outages. Consul wasn't tech debt; it was a carefully made decision (that I happen to disagree with and enjoy thinking about Michael Ehrmantrout shooting in the face).

We're just people, working on building a thing.

https://www.youtube.com/watch?v=ghNJxYP5Ses

Also: stop calling yourself an "outsider". You follow us as closely as anybody. :)

pajeetz 3 days ago | parent [-]

People hosting their business with a cloud hosting provider doesn't care about your technical debt, we care about our businesses not going down for several hours and then being gaslighted that its normal and told to expect more in the future by the founder.

tptacek 3 days ago | parent [-]

If you'd be happier without the companies involved in stories commenting here, then by all means get more people to write comments like this and see if you can chase them away. I think you won't have so much luck with me, but it might work with other companies. Nobody is gaslighting you.

Aeolun 4 days ago | parent | prev | next [-]

Two contradictory statements do not read like a 'cacophony' of anything to me xD I think you need a whole lot more than two to do that word justice.

JimDabell 4 days ago | parent | next [-]

“No more half-measures” and “We’re not big believers in tech debt” aren’t even contradictory statements, let alone a cacophony of them.

mattgreenrocks 4 days ago | parent | prev [-]

The comment section doing what it does best!

ignoramous 3 days ago | parent [-]

For brevity I chose to put up only the conclusion from a postmortem (of which I've read plenty by now) and another point from their otherwise comparatively shorter careers page, which imo capture the inherent tension between building out fast & building out right. This is not something I've started complaining about today or yesterday. I've used Fly in prod for 4 years and spilled much ink on this topic on their forums already. Even if I critique, I remain optimistic about Fly despite the seemingly endless list of failure modes building such complex systems entail: https://community.fly.io/t/fly-down/10224/15

(personally speaking, I'm humble enough because I can hardly build a toy side-project right!)

bdcravens 4 days ago | parent | prev [-]

"full measures" aren't the same thing as tech debt. Complexity isn't even the same thing as tech debt.