| ▲ | Fly.io outage – resolved(status.flyio.net) |
| 241 points by punkpeye 4 days ago | 284 comments |
| |
|
| ▲ | benhoyt 4 days ago | parent | next [-] |
| My fly.io-hosted website went down for 5 minutes (6 hours ago), but then came right back up, and has been up ever since. I use a free monitoring service that checks it every 5 minutes, so it's possible it missed another short bit of downtime. But fly.io has been pretty reliable overall for me! |
| |
| ▲ | nomilk 4 days ago | parent | next [-] | | Would be fascinated to see your data over a period of months. Application up time is flakey, but what was worse were fly deploys failing for no clear reason. Sometimes layers would just hang and eventually fail for no particular reason; I'd run the same command an hour or two later without any changes and it would just work as expected. I'd love to make a monitoring service to deploy a basic app (i.e. run the fly deploy command) every 5 minutes and see how often those deploys fail or hang. I'd guess ~5% inexplicably fail, which is frustrating unless you've got a lot of spare time. | | |
| ▲ | jrockway 4 days ago | parent | next [-] | | I used to run a service that created k8s clusters on GCP for our customers. We did want to check that that functionality kept working and had a prober test it periodically. It was actually broken a lot. Always good to monitor your dependencies if you have the time. Then when someone complains about an issue in your service, you can check your monitoring to see if your upstream services are broken. If they are, at least you know where to start debugging. | |
| ▲ | sanswork 4 days ago | parent | prev | next [-] | | My downtimes from fly are pretty rare but generally global when they happen, in this outage we had no downtime but couldn't deploy for a few hours. I have issues with deploying about once per quarter(deploy most days across a few apps) | | |
| ▲ | nomilk 4 days ago | parent [-] | | If that’s the case I suspect fly is getting a lot more reliable. I stopped using them about a year ago so haven’t kept up on their reliability since. Glad to hear, it’s good for a competitive market to have many providers, and fly might have issues but hopefully has a bright future | | |
| ▲ | sanswork 4 days ago | parent [-] | | They are definitely getting more reliable. I was an early user and moved off them to self hosted for quite a while because of the frequent downtime in early days. Their support still leaves a lot to be desired even as someone that pays for it but the ease of running and deploying a distributed front end keeps bringing me back. |
|
| |
| ▲ | rozenmd 4 days ago | parent | prev | next [-] | | This may be of interest to you: https://news.ycombinator.com/item?id=42243282 | |
| ▲ | Joel_Mckay 4 days ago | parent | prev [-] | | [flagged] | | |
| ▲ | LorenzoGood 4 days ago | parent [-] | | What does rust have to do with fly.io? | | |
| ▲ | aobdev 4 days ago | parent | next [-] | | Snark aside, Joel is suggesting that because Fly uses rust-based virtualization software they should have a more reliable deployment process. | | |
| ▲ | LorenzoGood 4 days ago | parent | next [-] | | Thanks for clarifying. | |
| ▲ | Joel_Mckay 4 days ago | parent | prev [-] | | [flagged] | | |
| ▲ | nomilk 4 days ago | parent [-] | | By asking directly and someone answering, it solves the problem for the person wondering, but also anyone else wondering (i.e. asking directly scales very nicely). | | |
|
| |
| ▲ | Joel_Mckay 4 days ago | parent | prev [-] | | [flagged] |
|
|
| |
| ▲ | rozenmd 4 days ago | parent | prev | next [-] | | I externally monitor fly.io and it's docs here: https://flyio.onlineornot.com/ Looks like it lasted 16 minutes for them. | | |
| ▲ | tptacek 3 days ago | parent [-] | | It wasn't a request routing outage; apps running on Fly.io didn't stop running. It was a deployments outage. For reasons passing understanding (I am reliably informed I'm wrong to complain about this), our website is the same Elixir app as our dashboard, and the dashboard got redeployed at one point. Our website being down is not the same as the whole service being down, though I guess there's a truth-in-advertising poetry to it being down when deployments are busted. | | |
| ▲ | sevenseacat 3 days ago | parent | next [-] | | A lot of apps did stop running - https://community.fly.io/t/fly-io-site-is-currently-inaccess... The entire API was also unusable, not just deployments. | | |
| ▲ | tptacek 2 days ago | parent [-] | | Sorry, you're right: pretty much any time I'm saying deployments are blocked, I'm really saying the API was down. |
| |
| ▲ | itbeho 3 days ago | parent | prev [-] | | I'm not sure if your explanation is comforting or disconcerting. | | |
| ▲ | tptacek 3 days ago | parent | next [-] | | Why not both? Tell me what's comforting and I'll tell you why you shouldn't be comforted; tell me why you're disconcerted and I'll tell you maybe something else. All we can do is be straight about things. | |
| ▲ | pajeetz 3 days ago | parent | prev [-] | | [flagged] | | |
| ▲ | tptacek 3 days ago | parent [-] | | I'm an HN person before I'm a Fly.io person, and as an HN person I find the points you're trying to make --- anybody can see them throughout the thread simply by searching your name --- tedious. As a businessperson, I don't think I have much to gain by genuflecting to the importance of reliability; everybody I care about on this site shares an understanding with us that reliability is important, though apparently not with you that all these systems are fallible. So I'm making the decision not to genuflect, and instead call you out --- you in particular, anonymous, venomous, green-named commenter --- as a a writer of boring and facile attempted dunks. | | |
| ▲ | pajeetz 3 days ago | parent [-] | | Are we not allowed to expect reliable uptimes from a cloud provider? What part of "fly.io has a documented history of prolonged downtimes and data redundancy issues" do you disagree with? Are you calling everybody liars who have had bad experience with fly.io, frankly, business and reputation loss that came as a result of trusting fly.io ? | | |
| ▲ | tptacek 3 days ago | parent [-] | | Nobody has called anybody a liar. I'm very comfortable with what i've said thus far on this thread, so maybe we're fine leaving it here. | | |
| ▲ | pajeetz 2 days ago | parent [-] | | is that why you are going through all my comments and flagging and downvoting? you know this just makes you look even worse right? |
|
|
|
|
|
|
| |
| ▲ | davidgl 4 days ago | parent | prev | next [-] | | Same for us, down for ~5 mins, back up and fine, error was 501 | | | |
| ▲ | beezlewax 4 days ago | parent | prev | next [-] | | Do you mind if I ask what monitoring service that is? | | | |
| ▲ | dprotaso 3 days ago | parent | prev | next [-] | | What free monitoring tool do you use? | |
| ▲ | 4 days ago | parent | prev [-] | | [deleted] |
|
|
| ▲ | jart 4 days ago | parent | prev | next [-] |
| fly.io publishes their post-mortems here: https://fly.io/infra-log/ The last post-mortem they wrote is very interesting and full of details. Basically back in 2016 the heart or keystone component of fly.io production infrastructure was called consul, which is a highly secure TLS server that tracks shared state and it requires that both the server certificate and the client certificate be authenticated. Since it was centralized, it had scaling issues, so fly.io wrote a replacement for it in 2020 called corrosion, and quickly forgot about consul, but didn't have the heart to kill it. Then in October 2024 consul's root key signing key expires, which brought down all connectivity, and since it uses bidirectional authentication, they couldn't bring it back online until they deployed new SSL certificates to every machine in their fleet. Somehow they did this in half an hour, but the chain of dominoes had already been set in motion to reveal other weaknesses in their infrastructure that they could eliminate. There was this other internal service whose own independent set of TLS keys had also expired long ago, but they didn't notice until they tried rebooting it as part of the consul rekey, since doing so severed the TCP connections it had established way back when its certificate was valid. Plus the whole time this is happening, their logging tools are DDOSing their network provider. It took some real heroes to save the company and all their customers too when that many things explode at once. |
| |
| ▲ | ignoramous 4 days ago | parent [-] | | On that Consul outage, Fly Infra concludes, "The moral of the story is, no more half-measures." On their careers page [1], the Fly team goes, "We're not big believers in tech debt." As an outsider, reads like a cacophony of contradictions? [1] https://fly.io/docs/hiring/working/#we-re-ruthless-about-doi... | | |
| ▲ | jart 4 days ago | parent | next [-] | | No one actually lives up to their principles, but it's still important that we have them. If you actually do live up to yours, then you need to adopt better principles. | | |
| ▲ | whilenot-dev 4 days ago | parent [-] | | Any principle in itself isn't without critique, agree, but it's still the choice being made to pick this specific principle that tells the whole story. There are so many principles to pick from and the tech dept pick follows up with a "We have a 3-month “no refactoring” rule for new hires. This isn’t everyone’s preferred work style! We try to be up front about stuff.", which sounds a bit like an additional perform or else... principle that just delays ownership of the stuff you're supposed to work with. In the best case that sounds like naiive optimism and in the worst case that's gross negligence... neither one speaks "engineering" to me. | | |
| ▲ | tptacek 3 days ago | parent [-] | | It is absolutely not a "perform or else" rule. Why are you reading so far into this? We really do have a rule about tech-debt changes, and it's a useful insight into why you might or might not want to work here, which is why we bring it up, despite the possibility it might alienate people; we'd like to be as honest as we can be. Worrying about people reading hustle-culture bullshit into stuff like this is a reason not to be transparent, which sucks. |
|
| |
| ▲ | tptacek 3 days ago | parent | prev | next [-] | | All the other comments aside: these aren't even contradictory statements. We really do have no-tech-debt rules, and they generally have not been responsible for our outages. Consul wasn't tech debt; it was a carefully made decision (that I happen to disagree with and enjoy thinking about Michael Ehrmantrout shooting in the face). We're just people, working on building a thing. https://www.youtube.com/watch?v=ghNJxYP5Ses Also: stop calling yourself an "outsider". You follow us as closely as anybody. :) | | |
| ▲ | pajeetz 3 days ago | parent [-] | | People hosting their business with a cloud hosting provider doesn't care about your technical debt, we care about our businesses not going down for several hours and then being gaslighted that its normal and told to expect more in the future by the founder. | | |
| ▲ | tptacek 3 days ago | parent [-] | | If you'd be happier without the companies involved in stories commenting here, then by all means get more people to write comments like this and see if you can chase them away. I think you won't have so much luck with me, but it might work with other companies. Nobody is gaslighting you. |
|
| |
| ▲ | Aeolun 4 days ago | parent | prev | next [-] | | Two contradictory statements do not read like a 'cacophony' of anything to me xD I think you need a whole lot more than two to do that word justice. | | |
| ▲ | JimDabell 3 days ago | parent | next [-] | | “No more half-measures” and “We’re not big believers in tech debt” aren’t even contradictory statements, let alone a cacophony of them. | |
| ▲ | mattgreenrocks 3 days ago | parent | prev [-] | | The comment section doing what it does best! | | |
| ▲ | ignoramous 3 days ago | parent [-] | | For brevity I chose to put up only the conclusion from a postmortem (of which I've read plenty by now) and another point from their otherwise comparatively shorter careers page, which imo capture the inherent tension between building out fast & building out right. This is not something I've started complaining about today or yesterday. I've used Fly in prod for 4 years and spilled much ink on this topic on their forums already. Even if I critique, I remain optimistic about Fly despite the seemingly endless list of failure modes building such complex systems entail: https://community.fly.io/t/fly-down/10224/15 (personally speaking, I'm humble enough because I can hardly build a toy side-project right!) |
|
| |
| ▲ | bdcravens 3 days ago | parent | prev [-] | | "full measures" aren't the same thing as tech debt. Complexity isn't even the same thing as tech debt. |
|
|
|
| ▲ | cryptos 4 days ago | parent | prev | next [-] |
| Fly.io seems to be a bit of a mixed bag: https://news.ycombinator.com/item?id=41917436 https://news.ycombinator.com/item?id=35044516 https://news.ycombinator.com/item?id=34742946 https://news.ycombinator.com/item?id=34229751 If a cloud platform doesn't really provide reliability, I'd say it's probably not worth it. You could better just rent a (virtual) server and save the cloud tax. |
| |
| ▲ | huijzer 4 days ago | parent | next [-] | | For experiments and hobby projects the value proposition is amazing. Where else can you spin up an independent instance for $1.94 per month?* *Note this is for an instance with only 256MB RAM (https://fly.io/docs/about/pricing/), but it's definitely possible to run non-trivial projects on that. Rust-based web servers like Rocket require only about 10MB RAM. Basic PHP servers should also fit from what I can find. | | |
| ▲ | oefrha 4 days ago | parent | next [-] | | There are plenty of better deals as long as you don’t limit yourself to big clouds and clouds with startup-esque landing pages frequently posted to HN. LowEndTalk may be the most well-known place for finding such deals. (Not saying the typical cheap VPS on LowEndTalk has comparable PaaS features. Only responding to parent’s use case of a single cheap instance.) | |
| ▲ | throwaway63467 4 days ago | parent | prev | next [-] | | Best business model in the world, buy stuff in big bags, put it in smaller ones, sell at a multiple of the original price. Fly is mostly (to my knowledge) reselling Netactuate and OVH servers, their main innovation is the developer experience on top, using Docker on a MicroVM based approach. Of course not only that, but I think it’s their main differentiator. Haven’t used that in a while but Scaleway offered ridiculously cheap dedicated ARM hardware close to these price points, not sure if they still do. | | |
| ▲ | huijzer 14 hours ago | parent [-] | | Isn't that most business models? Make it easier to do something. A car is easier than a bike. A grocery store is easier than talking to a farmer. Ordering a cab is easier than walking. A VPS is easier than managing your own hardware. |
| |
| ▲ | input_sh 4 days ago | parent | prev | next [-] | | Nowhere? Because that's a ridiculously low amount of RAM to offer even in your cheapest offerings? You can easily get 4 GB of RAM for $5 from the likes of Hetzner or Hostinger, so that's 16x more RAM for 2.5x the price. One relatively unknown provider I have used in the past offers 2 GB of RAM for €3.6/month (if paid monthly, €3 if anually), so 8x more RAM for 1.5-2x the price. I'm sure I could find something even cheaper, but I'm just looking at providers I have personally used. BTW that dropdown seems to be sorted cheapest > most expensive. If you go to the bottom of the list the price for that same VPS doubles. | | |
| ▲ | KomoD 4 days ago | parent [-] | | > Nowhere? Because that's a ridiculously low amount of RAM to offer even in your cheapest offerings? There's definitely places that offer it... also 512m I know because I've personally bought such plans and that was $5-10/yr because I didn't need dedicated ipv4. |
| |
| ▲ | TiredOfLife 4 days ago | parent | prev | next [-] | | Oracle free is one 4 core 24gb ram vps + 2 dualcore amd vps. | | |
| ▲ | treesknees 3 days ago | parent [-] | | And actually, it's the resources that are free (CPU, memory, network) and you're allowed to split them up into multiple VMs if you want to. One of my VMs had an uptime of more than 1050 days before the infrastructure rebooted it, so in terms of availability they've certainly surprised me. The only downside I've come across with Oracle Free is that the 'best' regions are typically full. I ended up provisioning my free VMs in another region/country and it works fine. I suppose another downside (if you want to view it this way) is they will delete idle unused free VMs after a certain time period. You have to add a credit card to your account to "upgrade" your account and run free resource indefinitely. While you're not charged for anything, it makes me nervous forking over a CC number to Oracle. |
| |
| ▲ | hobo_mark 4 days ago | parent | prev | next [-] | | One such microVM per month used to be within the free monthly allowance, is that not the case anymore? | |
| ▲ | pc86 3 days ago | parent | prev | next [-] | | Maybe if you're limiting yourself to AWS-wrapper cloud companies. What good is a $2/mo cloud instance if it's down multiple times a month? Just get a $5/mo VPS instead if you're really concerned about a few dollars a month. | | |
| ▲ | cxr 3 days ago | parent [-] | | > What good is a $2/mo cloud instance if it's down multiple times a month? The perverse irony is that the most common reason cited by cloud providers for not letting people set a hard cap on charges is an insistence that surely the last thing you want in the world is for your service to be taken offline, even if it does means avoiding a $1k–$100k bill at the end of the month. |
| |
| ▲ | hansvm 3 days ago | parent | prev | next [-] | | I used to use Racknerd for that sort of thing, and the costs were around there -- maybe $1.90/mo for a 512MB instance. It was easy to squeeze several hobby projects onto the machine. | |
| ▲ | kelvinjps10 4 days ago | parent | prev | next [-] | | I'm getting 1$ for a 2gb ram vps in ovh for the first year | |
| ▲ | pajeetz 3 days ago | parent | prev | next [-] | | i recommend lowendtalk what fly.io doing is running colocated baremetal servers and using firecracker to overcommit (probably via memory ballooning and other disk compression on demand) if you are going to haggle over $2/month then you are better off just connecting your raspberry pi with wireguard/cloudflare tunnel on a residential connection | |
| ▲ | belter 4 days ago | parent | prev [-] | | Sounds like a Lambda function.... |
| |
| ▲ | zackify 4 days ago | parent | prev | next [-] | | The reliability is very very bad. It was really insane that 2 times in the past few months the main dashboard was down as I’m demoing something. Not to mention the deploy outages and almost daily some random thing was unavailable or delayed. I had to leave a few months ago after the price raises and how many times my boss saw some issue in the project I had with them. They also deprecated and removed their sqlite backup service. Back to GCP and not worrying about so many outages now. | | |
| ▲ | pc86 3 days ago | parent | next [-] | | Now just to worry about GCP getting shut down with a few days' notice. /s But in all seriousness the gall to raise prices before actually fixing the reliability problems is pretty shocking. I understand it's a bit of a chicken-and-egg thing where you maybe are tight on resources but there's no scenario where it's acceptable to have a product with these kinds of problems and then raise prices on existing customers who are putting up with it. | | |
| ▲ | encom 3 days ago | parent [-] | | No /s is needed. Relying on any Google product long term is crazy. | | |
| ▲ | sofixa 3 days ago | parent [-] | | Google's b2b products are relatively stable (relative to their b2c free services). You generally get somewhere like a year of notice if they shut it down. |
|
| |
| ▲ | pajeetz 3 days ago | parent | prev [-] | | theres just so many anecdotes/nightmare stories from people using fly.io here much more than the ones linked by GP expect to see more of these "post-mortem apologies" from fly.io in the future because it won't be the last | | |
| ▲ | tptacek 3 days ago | parent [-] | | You're right. It won't. Nobody could claim otherwise. | | |
| ▲ | pajeetz 3 days ago | parent [-] | | "Expect more downtime in the near future, btw please host your business critical applications with our cloud offering" did i read that correctly? | | |
|
|
| |
| ▲ | qeternity 4 days ago | parent | prev | next [-] | | I don't really understand the value prop of fly.io. They seem to have an impressive engineering team despite the outages, but is edge compute really something that 99.9% of devs need? There are tons of large companies that operate out of a single AWS region and those services are used by millions around the globe. It just strikes me as something that enables premature optimization right out of the box. | | |
| ▲ | k__ 4 days ago | parent | next [-] | | It's basically the new Heroku with less lock-in, because it works with Docker. You get edge computing, autoscaling, and load balancing without additional configuration. Not as flexible as AWS, but also much easier to setup and maintain. But the reliability issues suck now and then. | | |
| ▲ | ignoramous 4 days ago | parent | next [-] | | > Not as flexible as AWS Today, Fly.io is more or less in the same market as Lightsail, not AWS. And when you compare it to Lightsail, it blows it away. | | |
| ▲ | watermelon0 3 days ago | parent | next [-] | | Did you count reliability into your assesment here? I'm reading about Fly.io outages multiple times a year, whereas Lightsail seem to be as stable as AWS EC2. | |
| ▲ | mtlynch 4 days ago | parent | prev [-] | | And when you compare it to Lightsail, it blows it away. This is a bit of a confusing sentence because there are so many pronouns. Do all of the "it"s refer to Fly.io? | | |
| ▲ | dijksterhuis 4 days ago | parent [-] | | > And when you compare [fly.io] to Lightsail, [fly.io] blows [Lightsail] away. |
|
| |
| ▲ | gurgunday 4 days ago | parent | prev | next [-] | | DigitalOcean has been doing this for years, and their value proposition is unmatched IMO For $5 you get: Latest gen CPUs and RAM HTTPS DDoS protection Cloudflare CDN Autoscale Competent support I'd say the best part is the predictable monthly prices And while most people probably don't care, they are an established public company, so there is more chance they will exist in 10 years | | |
| ▲ | dijksterhuis 4 days ago | parent | next [-] | | are global r/w token permissions still a thing, or did the token scopes thing finally come out of beta? also, my experience with support was not the same as yours. they were utterly useless for the most part. for a personal web dev (or similar) project, like, i agree, they’ve got good value. but having worked in a small biz where DO was what they built everything on — no. bad idea. spend more. use aws (graviton ec2 instances)/azure. | |
| ▲ | fragmede 4 days ago | parent | prev [-] | | the $5 droplet is underpowered and can't run anything substantial. it's just the price to get you in the door. | | |
| ▲ | yabones 3 days ago | parent | next [-] | | It doesn't really need to run anything "substantial" though. Running some janky wordpress site with some scabbed-on ecommerce customizations is like 50% of the internet. | |
| ▲ | infecto 3 days ago | parent | prev | next [-] | | a 1vCPU 512mb instance is plenty for most base cases. Maybe you need one additional machine to act as a background worker. I am sure there are some noisy neighbors but to say its underpowered is silly. | | |
| ▲ | fragmede 3 days ago | parent [-] | | I'm calling it underpowered because the $5 one had trouble running my custom ssh daemon. ssh! the cryptography for that shouldn't chug down the server I'm renting from them. a bigger instance from them isn't having the same problems. |
| |
| ▲ | pajeetz 3 days ago | parent | prev [-] | | you wouldn't be able to run anything substantial with that kind of budget but GO and pocketbase is on record for supporting 10k concurrent requests per second on low powered VPS |
|
| |
| ▲ | nikodotio 4 days ago | parent | prev | next [-] | | This is precisely it. The ease of deploy, https domain configuration, scaling. Additionally, having machines that turn off when not in use is easy to configure, which I never managed on AWS. | | | |
| ▲ | infecto 3 days ago | parent | prev [-] | | I have asked this multiple times but is anyone really using edge compute and getting value out of it? I am certain there are cases but I have not seen any of them written up before. | | |
| ▲ | sofixa 3 days ago | parent | next [-] | | Depends on what you mean by edge compute, but you probably are. 5G towers are a ton of compute on the edge to secure and protect the traffic passing through them. Or if by edge you mean having stuff close to your consumers, every non trivial operation does that. | | |
| ▲ | infecto 3 days ago | parent [-] | | How is it not obvious based on the thread at hand, fly.io. And no not every nontrivial operation does it to the extreme of an envisioned fly.io deployment. |
| |
| ▲ | pier25 3 days ago | parent | prev [-] | | We have an embeddable audio player served globally with very low latency. This wouldn't be possible without edge compute/data. |
|
| |
| ▲ | austinpena 4 days ago | parent | prev | next [-] | | I have an SSR Astro project. Using Fly makes my project fast. For dynamic data I use SWR. I could use Cloudflare workers but it doesn’t play so nice with Astro. I also have a “form submission service” where I receive a Post and send an email. I need maximum uptime to avoid revenue loss. It’s a go service so I deploy ~6 machines across the US to ensure I don’t drop any requests. I haven’t had downtime in years. | |
| ▲ | victorbjorklund 4 days ago | parent | prev | next [-] | | If half your customers are in new your and half in sidney it makes you app faster if you run it in both places. There is a lot of things we do for our users that we don't need (no one "needs" SPA etc). But if it is easy to make your app faster for your users, why not? | | | |
| ▲ | jrockway 4 days ago | parent | prev | next [-] | | I would take edge compute if it's free and easy. That's fly.io's value prop. In a world where much web browsing starts with ACK SYN ACK, it is nice if the server is close to you. | |
| ▲ | brainzap 4 days ago | parent | prev | next [-] | | I typed fly launch, fly deploy and my node.js project was deployed. So I guess hobby projects? | |
| ▲ | infecto 3 days ago | parent | prev [-] | | I am going to go out on a limb and say there is no real value prop to fly.io. I could completely be wrong but it always feels like the modern MongoDB. Everyone wants to use it but I am not sure they are extracting value from it and instead its a shiny toy that is fun to build from. |
| |
| ▲ | tptacek 3 days ago | parent | prev | next [-] | | This is a completely sane way to look at the world and we won't push back on it at all. We're building something extraordinarily difficult, and we're a relatively new company, and we don't have even a fraction of the resources the hyperscalers do, or, in the cases of AWS, GCP, and OCI, had at the time they started. If you're minmaxing for reliability --- which is an absolutely sane way to play --- we're not going to tell you you'd do worse in 2024 UE1. If it helps: all sorts of things can and do go wrong, but the most likely form of disruption you're likely to see here are periods of times when deployments don't work. This outage was a deployments/orchestration outage. We had a total request routing outage several months back, owing to a Rust concurrency landmine we stepped on, but those are very rare. (Deployment and state-update outages are a big deal, and if you deploy to diverse groups of Fly Machines constantly, as we encourage you to do, that being one of the big features of the platform, they can impact your availability. I'm not downplaying them.) | |
| ▲ | pajeetz 3 days ago | parent | prev | next [-] | | fly.io has a very bad reputation for reliability there doesn't seem to be any damage control beyond hackernews and even here the consensus seems to be "dont run anything mission critical on fly.io or expect data redundancy" in fact, you can almost get the same thing fly.io does by running firecracker on your own bare metal servers and cheaper too. I'm afraid the public sentiment towards fly.io has been tainted for good (I can't count how many times they apologized now). | | |
| ▲ | tptacek 3 days ago | parent [-] | | This is the second place you've offered this sentiment. Was it your expectation that we were going to hit some point, sometime in the near future, where we weren't going to have deployment-blocking outages? I'd like to better understand your premise. If it's "I can get more reliability by deploying on a hyperscaler cloud", who ever told you otherwise? | | |
| ▲ | pajeetz 3 days ago | parent [-] | | I see so you think its good business practice to basically say "expect more downtimes in the future who cares about your entire business going down for several hours more than once a year. Gotcha. I'll be sure to pass on the good word. | | |
| ▲ | tptacek 3 days ago | parent [-] | | You'd be happier with a comment saying "there will be no future outages", I see. | | |
| ▲ | pajeetz 3 days ago | parent [-] | | I'm puzzled with your statement here. Frankly, offended by your sarcasm and unprofessional behavior here. | | |
| ▲ | tptacek 3 days ago | parent [-] | | What puzzles you about it? I feel like I'm speaking straightforwardly. |
|
|
|
|
| |
| ▲ | akoculu 4 days ago | parent | prev | next [-] | | Also: https://news.ycombinator.com/item?id=36808296 | |
| ▲ | ARCarr 3 days ago | parent | prev | next [-] | | I tried out Fly.io and deployed a little test app. I couldn't even access the app, because they put it onto a server that was under "emergency maintenance" and had been that way for twelve days. | |
| ▲ | 3 days ago | parent | prev [-] | | [deleted] |
|
|
| ▲ | punkpeye 4 days ago | parent | prev | next [-] |
| Contrary to the title of the post, Fly.io API remains inaccessible. Meaning, users still cannot access deploys/databases, etc. For accurate updates, follow https://community.fly.io/t/fly-io-site-is-currently-inaccess... |
|
| ▲ | neya 4 days ago | parent | prev | next [-] |
| Personal experience between Fly.io and Railway.com - Railway wins for me hands down. I have used both and the Railways support is stellar too, in comparison. Fly.io never responded to my query about data deletion till date. Despite emailing on their support email. I have had my Railway app online till date without any major downtimes too. I recommend anyone looking for a decent replacement to try them. |
| |
| ▲ | andai 4 days ago | parent | next [-] | | I've used Railway control panel maybe a total of 10 times in my life and half the time it was having weird issues. Control panel UI not loading or not working, actions failing, deploys randomly failing... I love the idea but in practice it's not something I'd want to use for anything serious. | | |
| ▲ | justjake 3 days ago | parent [-] | | While we've always aimed for great reliability on compute, the dashboard reliability wasn't very good at the start of the year. We ack'd this and then pretty heavily to making it stellar, so if you're still having issues please let us know (that should not be the case) Best,
Jake from Railway | | |
| ▲ | andai 2 days ago | parent [-] | | I used Railway as a "set it and forget it" for a client project, and I hadn't heard from him in over a year until some Railway update caused some issues with the deploy (something about group permissions). But support was very helpful in getting that fixed very quickly, so credit there! (And to be fair it did apparently work without any problems for like a year and a half, so credit there too!) |
|
| |
| ▲ | ignoramous 4 days ago | parent | prev | next [-] | | Fly builds on their own hardware. Is Railway doing the same? If not, that'd explain some of why Railway has relatively less number of outages (they're engineering fewer things). I understand that end-users want reliability (and Fly gets a bad rep despite pretty significant investment on this front in the past 2 years), but such outages aren't exclusive to one provider & not the other. Building cloud infra is no one's definition of easy. | | | |
| ▲ | punkpeye 4 days ago | parent | prev [-] | | How does it compare in terms of price? | | |
|
|
| ▲ | shubhamjain 4 days ago | parent | prev | next [-] |
| This is probably 5th or 6th major outage from Fly.io that I have personally seen. Pretty sure there were many others and some just went unnoticed. I recommended the service to a friend, and within two days he faced two outages. Fly.io seriously needs to get it together. Why it hasn’t happened yet is a mystery to me. They have a good product but stability needs to be an absolute top for a hosting service. Everything else is secondary. |
| |
| ▲ | SOLAR_FIELDS 4 days ago | parent | next [-] | | I get this but I think if people can give GitHub a pass for shitting the bed every two weeks maybe Fly should get a bit of goodwill here. I am not affiliated with Fly at all but I do think that people should temper their expectations when even mega corp can’t get it right I guess the secret is to be the incumbent with no suitable replacement. Then you can be complete garbage in terms of reliability and everyone will just hand wave away your poor ops story | | |
| ▲ | ojame 4 days ago | parent | next [-] | | The biggest difference is GitHub in your infrastructure is (nearly always) internal. Fly in your infrastructure is external. Users generally don't see when you have issues with GitHub, but they do generally see when you have issues with Fly. That's the core difference. | | | |
| ▲ | fragmede 4 days ago | parent | prev [-] | | Who's giving GitHub a pass on shitting the bed? They go down often enough that if you don't have an internal git server setup for your CICD to hit, that's on you. | | |
| ▲ | SOLAR_FIELDS 4 days ago | parent | next [-] | | My point is made by your very post - getting off GitHub onto alternatives is not seriously discussed as an option - instead it’s “well, why didn’t you prepare better to deal with your vendor’s poor ops story” | | |
| ▲ | fragmede 4 days ago | parent [-] | | I wasn't going to bring up being on an internally hosted gitlab instead of github, but that would be the "not giving them a pass" part. | | |
| |
| ▲ | 4 days ago | parent | prev [-] | | [deleted] |
|
| |
| ▲ | adityapatadia 4 days ago | parent | prev | next [-] | | We left it about a year ago due to reliability issues. We now use digitalocean apps and working like a charm. Zero downtime with DO. | | |
| ▲ | subarctic 4 days ago | parent [-] | | You mean their App Platform right? How does the pricing compare to fly? | | |
| ▲ | adityapatadia 4 days ago | parent [-] | | Yes, App Platform. Pricing is a little higher but way lower than AWS but it is fully justified. Zero downtime in the last 1 year. With Fly, we had 3-4 downtimes in 2023 in a span of 4 months. | | |
| ▲ | subarctic 2 days ago | parent [-] | | Ok so for hobby projects it wouldn't make sense to switch then, but glad to hear it works for you. I haven't been in a position where it would make sense - I have hobby projects where I don't care much about reliability, and then there's the infrastructure the company I work for uses and that's all on AWS |
|
|
| |
| ▲ | mcqueenjordan 4 days ago | parent | prev | next [-] | | Reliability is hard when your volume is (presumably) scaling geometrically. | | |
| ▲ | paxys 4 days ago | parent [-] | | Can't use the "reliability is hard" excuse when you are quite literally in the business of selling reliability. | | |
| ▲ | mcqueenjordan 4 days ago | parent [-] | | It’s just not that big of a mystery. It’s not an excuse; it’s just true. Also, they’re not especially selling reliability as much as they’re selling small geo-distributed deployments. |
|
| |
| ▲ | 4 days ago | parent | prev | next [-] | | [deleted] | |
| ▲ | ilrwbwrkhv 4 days ago | parent | prev [-] | | Does anyone use them beyond the free tier? Same with Vercel for example. | | |
| ▲ | gk1 4 days ago | parent | next [-] | | Vercel has revenue of over $100M. So yes at least a few companies use them beyond the free tier. | |
| ▲ | dizhn 4 days ago | parent | prev [-] | | Which company? GitHub? As far as I know fly.io does not have a free tier. |
|
|
|
| ▲ | HellsMaddy 4 days ago | parent | prev | next [-] |
| Suspiciously, Turso started having issues around the same time. Their CEO confirmed on Discord it's due to the Fly outage: > Ok.I caught up with our oncall and This seems related to the Fly.io incident that is reported in our status page. Our login does call things in the Fly.io API > we are already in touch with Fly and will see if we can speed this up |
| |
| ▲ | pier25 4 days ago | parent [-] | | Not the first time Turso goes down because of Fly issues. It must suck to have built a db service and have this downtime. Apparently Turso are going to offer an AWS tier at some point. | | |
|
|
| ▲ | marvin-hansen 4 days ago | parent | prev | next [-] |
| No surprise. About a year ago, I looked at fly.io because of it's low pricing and I was wondering where they were cutting corners to still make some money. Ultimately, I found the answer in their tech docs where it was spelled out clearly that an fly instance is hardwired to one physical server and thus cannot fail over in case that server dies. Not sure if that part still is in the official documentation. In practice, that means if a server goes down, they have to load the last snapshot from that instance from the Backup and push it on a new server, update the network path, and pray to god that not more server fail than spare capacity is available. Otherwise you have to wait for a restore until the datacenter mounted a few more boxes in the rack. That explains quite a bit the randomness of those outage reports i.e. my app is down vs the other is fine and mine came back in 5 minutes vs the other took forever. As a business on a budget, I think anything else i.e. a small civo cluster serves you better. |
| |
| ▲ | ignoramous 4 days ago | parent | next [-] | | Fly.io can migrate vm+volume now: https://fly.io/docs/reference/machine-migration/ / https://archive.md/rAK0V > a fly instance is hardwired to one physical server and thus cannot fail over I'm having trouble understanding how else this is supposed to be? I understand that live migration is a thing, but even in those cases, a VM is "hardwired" to some physical server, no? | | |
| ▲ | mzi 4 days ago | parent | next [-] | | > I'm having trouble understanding how else this is supposed to be? I understand that live migration is a thing, but even in those cases, a VM is "hardwired" to some physical server, no? You can run your workload (in this case a VM) on top of a scheduler, so if one node goes down the workload is just spun up on another available node. You will have downtime, but it will be limited. | | | |
| ▲ | sofixa 3 days ago | parent | prev [-] | | > I'm having trouble understanding how else this is supposed to be? I understand that live migration is a thing, but even in those cases, a VM is "hardwired" to some physical server, no? They mean the storage part. If your VM's storage(state) is on one server and that server dies, you have to restore from backup. If your VM's storage is on remote shared storage mounted to that server and the server dies, your VM can be restarted elsewhere that has access to that shared storage. In AWS land it's the difference between instance store (local to a server) and EBS (remote, attached locally). There's a tradeoff in that shared storage will be slightly slower due to having to traverse networking, and it's harder to manage properly; but the reliability gain is massive. |
| |
| ▲ | dilyevsky 4 days ago | parent | prev | next [-] | | > Ultimately, I found the answer in their tech docs where it was spelled out clearly that an fly instance is hardwired to one physical server and thus cannot fail over in case that server dies. Majority of EC2 instance types did not have live migration until very recently. Some probably still don't (they don't really spell out how and when it's supposed to work). It is also not free - there's a noticeable brown-out when your VM gets migrated on GCP for example. | | |
| ▲ | ixaxaar 4 days ago | parent [-] | | Can you shed some more light on this "browning out" phenomenon? | | |
| |
| ▲ | pier25 4 days ago | parent | prev | next [-] | | If you want HA on Fly you need to deploy an app to multiple regions (multiple machines). Fly might still go down completely if their proxy layer fails but it's much less common. | | |
| ▲ | sb8244 3 days ago | parent [-] | | The proxy layer was the cause of yesterday's outage according to support. | | |
| ▲ | pier25 3 days ago | parent [-] | | Yes but the previous comment was about hardware failure. |
|
| |
| ▲ | fulafel 4 days ago | parent | prev [-] | | The status tells a story about a high-availability/clustering system failure so I think in this case the problem is rather the complexity of the HA machinery hurting the system's availability vs something like a simple VPS. |
|
|
| ▲ | xyst 4 days ago | parent | prev | next [-] |
| Recurring pattern I notice is outages tend to occur the week of major holidays in US. - MS 365/Teams/Exchange had a blip in the morning - Fly.io with complete outage - then a handful of sites and services impacted due to those outages Usually advocate against “change freezes” but I think a change freeze around major holidays makes sense. Give all teams a recharge/pause/whatever. Don’t put too much pressure on the B-squads that were unfortunate to draw the short stick. |
| |
| ▲ | paxys 4 days ago | parent | next [-] | | Bad code rarely causes outages at this scale. The culprit is always configuration changes. Sure you can try and reduce those as well during the holiday season, but what if a certificate has to be renewed? What if a critical security patch needs to be applied? What if a set of servers need to be reprovisioned? What if a hard disk is running out of space? You cannot plan your way out of operational challenges, regardless of what time of year it is. | | |
| ▲ | oarsinsync 4 days ago | parent | next [-] | | > Sure you can try and reduce those as well during the holiday season, but what if a certificate has to be renewed? What if a critical security patch needs to be applied? What if a set of servers need to be reprovisioned? What if a hard disk is running out of space? Reading this, I see two routine operational issues, one security issue and one hardware issue. You can’t plan you way around security issues or hardware failures, but operational issues you both can and should plan around. Holiday schedules like this are fixed points in time, so there’s absolutely no reason why you can’t plan all routine works to be completed either a week in advance, or a week after, the holiday period. Certificates don’t need to be near the point of expiry to be renewed. Capacity doesn’t need to be at critical levels to be expanded. Ultimately, this is a risk management question (as a sibling has also commented). Is the organisation willing to take on increased risk in exchange for deferring operational expenses? If the operational expense is inevitable (the certificate will need renewing), that seems like an easy answer when it comes to risk management over holidays. If the operational expense is not inevitable (will we really need to expand capacity?), it then becomes a game of probabilities and financials - likelihood of expense being incurred, amount of expense incurred if done ahead of time, impact to business if something goes wrong during a holiday. | |
| ▲ | tptacek 3 days ago | parent | prev | next [-] | | We'll have a postmortem in next week's infra log update, but here it was a particularly ambitious customer app pushing our state sync service into a corner case; it's one we knew about, but the solution (federating regional state sharing clusters rather than running one globally) is taking time to roll out. | |
| ▲ | jimmyl02 4 days ago | parent | prev | next [-] | | I think a good way of looking at it is risk. Is the change (whether it is code or configuration, etc.) worth the risk it brings on. For example if it's a small feature then it probably makes sense to wait and keep things stable. But, if it's something that itself causes larger imminent danger like security patches / hard disk space constraints, then it's worth taking on the risk of change to mitigate the risk of not doing it. At the end of the day no system is perfect and it ends up being judgement calls but I think viewing it as a risk tradeoff is helpful to understand. | |
| ▲ | 4 days ago | parent | prev | next [-] | | [deleted] | |
| ▲ | bobsyourbuncle 4 days ago | parent | prev [-] | | This is a good observation. Do you have any resources I can read up on to make this safer? |
| |
| ▲ | ploxiln 4 days ago | parent | prev | next [-] | | I think you can't avoid the fact that these holiday weeks are different from regular weeks. If you "change freeze" then you also freeze out the little fixes and perf tuning that usually happens across these systems, because they're not "critical". And then inevitably it turns out that there's a special marketing/product push, with special pricing logic that needs new code, and new UI widgets, causing a huge traffic/load surge, and it needs to go out NOW during the freeze, and this is revenue, so it is critical to the business leaders. Most of eng, and all of infra, didn't know about it, because the product team was cramming until the last minute, and it was kinda secret. So it turns out you can freeze the high-quality little fixes, but you can't really freeze the flaky brand-new features ... It's just a struggle, and I still advise to forget the freeze, and try to be reasonable and not rush things (before, during, or after the freeze). | | |
| ▲ | willsmith72 3 days ago | parent | next [-] | | Any big tech company with large peak periods disagrees with you. It's absolutely worth freezing non-critical changes. Urgent business change needs to go through? Sure, be prepared to defend to a vp/exec why it needs to go in now. Urgent security fix? Yep same vp will approve it. It's a no-brainer to stop your typical changes which aren't needed for a couple of weeks. By the way, it doesn't mean your whole pipeline needs to stop. You can still have stuff ready to go to prod or pre prod after the freeze | |
| ▲ | ignoramous 4 days ago | parent | prev [-] | | Some shops conduct game days as the freeze approaches. https://wa.aws.amazon.com/wellarchitected/2020-07-02T19-33-2... / https://archive.md/uaJlR |
| |
| ▲ | cess11 4 days ago | parent | prev | next [-] | | Blip? 365 has an ongoing incident since yesterday morning, european timezone. The reason I know is because I use their compliance tools to secure information in a rather large bankruptcy. | |
| ▲ | vrosas 4 days ago | parent | prev | next [-] | | Then you just get devs rushing out changes before the freeze… | | |
| ▲ | subarctic 4 days ago | parent | next [-] | | As a developer I don't see why I would rush out a change before the freeze when I could just wait until after. Maybe a stakeholder that really wants it would press for it to get out but personally I'd rather wait until after so I'm not fixing a bug during my holiday. | | |
| ▲ | vrosas 4 days ago | parent [-] | | Congrats on not working for the product team I work for |
| |
| ▲ | fragmede 4 days ago | parent | prev [-] | | and stampeding changes in after the thaw, also leading to downtime. so it depends on the org, but doing a freeze is still reasonable policy. Downtime on December 15th is less expensive than on black Friday or cyber Monday for most retailers, so it's just a business decision at that point. |
| |
| ▲ | aaomidi 4 days ago | parent | prev [-] | | What do "Freezes" mean? Like, do you stop renewing your certificates? Do you stop taking in security updates for your software? Sure maybe "unnecessary" changes, but the line gets very gray very fast. | | |
| ▲ | Spivak 4 days ago | parent | next [-] | | It's not very grey, prod becomes as if you told everyone but your ops team to go home and then sent your ops team on a cruise with pagers. If it's not important enough to merit interrupting their vacation you don't do it. | |
| ▲ | fragmede 4 days ago | parent | prev | next [-] | | Certs shouldn't still be done by hand that this point; if another heartbleed comes out in the next 7 days then the risk can be examined, escalated, and the CISO can overrule the freeze. If it's a patch for remote root via Bluetooth drivers on a server that has no Bluetooth hardware, it's gonna wait. you're right that there's a grey line, but crossing that line involves waking up several people and the on call person makes a judgement call. if it's not important enough to wake up several people over, then things stay frozen. | | |
| ▲ | kbolino 3 days ago | parent | next [-] | | There's still a lot of situations where automatic certificate enrollment and renewal is not possible. TLS is not the only use of X.509 certificates, and even then, public facing HTTPS is not the only use of TLS. It needs to get better but it's not there yet. | |
| ▲ | aaomidi 4 days ago | parent | prev [-] | | Right, that's basically what I mean. There are a lot of automated changes happening in the background for services. I guess the whole thing I'm saying is that not every breakage is happening because of a code change. |
| |
| ▲ | vrosas 4 days ago | parent | prev [-] | | No unnecessary code deployments. |
|
|
|
| ▲ | akshayshah 4 days ago | parent | prev | next [-] |
| The series of outages early in 2023 also had some Corrosion-related pain: https://community.fly.io/t/reliability-its-not-great/11253 |
| |
| ▲ | __turbobrew__ 4 days ago | parent [-] | | Seems like rolling their own datastore turned out to be a bad bet. Im not super familiar with their constraints but scylladb can do eventual consistency and is generally quite flexible.
CouchDB is also an option for multi-leader replication. |
|
|
| ▲ | arusahni 4 days ago | parent | prev | next [-] |
| Oof, hugops to the team. |
|
| ▲ | stevefan1999 4 days ago | parent | prev | next [-] |
| Yep...can confirm my self hosted Bitwarden there is completely FUBAR connection wise even if it is in EA, so it should be a worldwide outage...lemme guess, some internal tooling error, consensus split brain, or if it looks like someone leaked BGP routes again? |
| |
|
| ▲ | redslazer 4 days ago | parent | prev | next [-] |
| fly.io just has the weirdest outages. It has issues so regularly we dont even need to run mock outages to make sure our system fail overs work. |
| |
| ▲ | duxup 4 days ago | parent | next [-] | | When I worked for a company who worked with big banks / financial institutions we used to run disaster recovery tests. Effectively a simulated outage where the company would try to run off their backup sites. They ran everything from those sites, it was impressive. Once in a while we'd have a real outage that matched the test we ran as recently as the weekend before. I was helping a bank switch over to the DR site(s) one day during such a real outage and I left my mic open when someone asked me what the commotion was on the upper floors of our HQ. I said "super happy fun surprise disaster recovery test for company X". VP of BIG bank was on the line monitoring and laughed "I'm using that one on the executive call in 15, thanks!" Supposedly it got picked up at the bank internally after the VP made the joke and was an unofficial code for such an outage for a long time. | | |
| ▲ | latch 4 days ago | parent | next [-] | | In most BIG banks, "Vice President" is almost an entry-level title. Easily have 1000s of them. For example, this article points out that Goldman Sachs had ~12K VPs out of more than 30K employees: https://web.archive.org/web/20150311012855/https://www.wsj.c... | | |
| ▲ | SteveNuts 3 days ago | parent | next [-] | | Just like all Sales folks have heavily inflated titles, no customer wants to think they're dealing with a junior salesperson/loan officer when you're about to hand over your money. It seems like every vendor sales team I work with is an "executive" or "director of sales" even though in reality they're just regular old salespeople. | |
| ▲ | jart 4 days ago | parent | prev [-] | | VP at Goldman is equivalent to Senior SWE according to levels.fyi and their entry level is Analyst. I'm surprised by the compensation though. I would have thought people working at a place with gold in the name would be making more. Also apparently Morgan Stanley pays their VPs $67k/year. | | |
| ▲ | philipwhiuk 4 days ago | parent | next [-] | | Tech outstripped big finance corps tech a while ago. Traders make loads, not the SWEs | |
| ▲ | bormaj 4 days ago | parent | prev [-] | | That VP comp number seems quite low fwiw | | |
| ▲ | jart 3 days ago | parent [-] | | Yes how much longer till we see Morgan Stanley VPs picketing outside demanding a living wage and humming The Internationale. |
|
|
| |
| ▲ | NetOpWibby 4 days ago | parent | prev [-] | | Thankfully your comment was positive! |
| |
| ▲ | benreesman 4 days ago | parent | prev | next [-] | | In fairness to the fly.io folks (who are extremely serious hackers), they’re standing up a whole cloud provider and they’ve priced it attractively and they’re much customer-friendlier than most alternatives. I don’t envy the difficulty of doing this, but I’m quite confident they’ll iron the bugs out. | | |
| ▲ | redslazer 4 days ago | parent | next [-] | | The tech is impressive and the pricing is attractive which is why we use them. I just wish there was less black magic. | | |
| ▲ | benreesman 4 days ago | parent [-] | | I don’t always agree with @tptacek on social/political issues, and I don’t always agree with @xe on the direction of Nix, but these are legends on the technical side of things. And they’re trying to build an equitable relationship between the user of cloud services and the provider, not fund a private space program. If I were in the market for cloud services I’d highly prize a long-term relationship on mutual benefit and fair dealings over a short-term nuisance of being an early adopter. I strongly suspect your investment in fly is going to pay off. | | |
| ▲ | xena 4 days ago | parent | next [-] | | Xe here. As a sibling comment said, I didn't survive layoffs. If you're looking for someone like me, I'm on the market! | | |
| ▲ | benreesman 4 days ago | parent [-] | | Hiring people is above my pay grade, but I can vouch to my lords and masters and anyone else who cares what I think that a legend is up for grabs. b7r6@b7r6.net | | |
| |
| ▲ | tptacek 3 days ago | parent | prev | next [-] | | I'm several steps removed from day-to-day engineering at this point; the team working on this is much better than I am. It's just a very hard problem; biting it off is something you can certainly blame me for, though. (Also: not a legend, just loud.) | | |
| ▲ | benreesman 3 days ago | parent [-] | | I may be the minority on this view, but I think that it's possible to be both a recognized expert aka legend and loud ("visible" might be a kinder word). When you talk technology, I listen, and I doubt I'm alone in that. Keep up the good work with fly.io! |
| |
| ▲ | verelo 4 days ago | parent | prev | next [-] | | I want to believe, but in the meantime they’re killing the product I’ve been working hard to build trust with my own customers though. There is a limit to my idealism, and it’s well and truly in the past. | |
| ▲ | foldr 4 days ago | parent | prev | next [-] | | I suspect that making a cloud service provider run reliably requires tons of grunt work more than it requires technical heroism from a small number of highly talented individuals. | | | |
| ▲ | reissbaker 4 days ago | parent | prev | next [-] | | FWIW Xe was let go from Fly earlier this year during a round of layoffs. | | | |
| ▲ | throwaway984393 4 days ago | parent | prev [-] | | [dead] |
|
| |
| ▲ | sevenseacat 3 days ago | parent | prev [-] | | In fairness to the fly.io folks, we started using them three years ago-ish and not a lot has changed, bug-wise and downtime-wise. |
| |
| ▲ | 4 days ago | parent | prev [-] | | [deleted] |
|
|
| ▲ | teaearlgraycold 4 days ago | parent | prev | next [-] |
| I'm grateful to HN for keeping me well aware of Fly's issues. I'll never use them. |
| |
| ▲ | kachapopopow 4 days ago | parent [-] | | It's still 99.99+% SLA? Would you really pay 100% more for <0.01% more uptime? | | |
| ▲ | runako 4 days ago | parent | next [-] | | No dog in this fight, all props to the Fly.io team for having the gumption to do what they are doing, I genuinely hope they are successful... > It's still 99.99+% SLA But this is simply not accurate. 99.99% uptime is < 52m 9.8s annually of downtime. They apparently blew well through that today. Looks like they essentially had the equivalent of 4 years of 99.99% uptime equivalent this evening. Four nines is so unforgiving that it's almost the case that if people are required to be in the loop at any point during an incident, you will blow the fourth nine for the whole year in a single incident. Again, I know it's hard. I would not want to be in the space. That fourth nine is really difficult to earn. In the meanwhile, <hugops> to the Fly team as they work to resolve this (and hopefully get some rest). | | |
| ▲ | fulafel 4 days ago | parent | next [-] | | 99.99+% SLA typically means you get some billing credits for the downtime exceeding 99.99+ availability. So technically do get a "99.99+% SLA", but you don't get 99.99+% availability. Other circles use "SLO" (where the O stands for objective). (Anyone know what the details in fly.io SLA are?) | | |
| ▲ | fulafel a day ago | parent | next [-] | | Answering myself, https://fly.io/legal/sla-uptime/ says you get some credits for under 99.9% uptime "provided that Customer reports to Fly.io such failure to meet the Uptime Commitment". So at least currently there's no talk of 99.99%. | |
| ▲ | runako 4 days ago | parent | prev [-] | | You are correct in the legal/technical sense! Technically, anyone could offer five- or six-nines and just depend on most customers not to claim the credits :-D Actually hitting/exceeding four nines is still tough. |
| |
| ▲ | xmorse 3 days ago | parent | prev [-] | | My app didn't go down yesterday, this was a downtime related to internal API and some specific regions. |
| |
| ▲ | mrcwinn 4 days ago | parent | prev | next [-] | | This is not my experience at all, as a former paying customer. | |
| ▲ | PUSH_AX 4 days ago | parent | prev | next [-] | | You say that like it's their only issue. Earlier in the year they had a catastrophic outage in LHR, we lost all our data. Yes this is also on me, I'm aware. Still, that's a hard nope from me, we migrated. | |
| ▲ | cj 4 days ago | parent | prev [-] | | I think what a lot of people fail to understand is that there are certain categories of apps that simply “can never go down” Examples include basically any PaaS, IaaS, or any company that provides a mission-critical service to another company (B2B SaaS). If you run a basic B2C CRUD app, maybe it’s not a big deal if you service goes down for 5 minutes. Unfortunately there are quite a few categories of companies where downtime simply isn’t tolerated by customers. (I operate a company with a “zero downtime” expectation from customers - it’s no joke, and I would never use any infrastructure abstraction layer other than AWS, GCP or Azure - preferably AWS us-east-1 because, well, if you know the joke…) | | |
| ▲ | toast0 4 days ago | parent | next [-] | | > I think what a lot of people fail to understand is that there are certain categories of apps that simply “can never go down” I refuse to believe that this category still exists, when I need to keep my county's alternate number for 911 in my address book, because CenturyLink had a 6 hour outage in 2014 and a two day outage in 2018. If the phone company can't manage to keep 911 running anymore, I'd be very surprised what does have zero downtime over a ten year period. Personally, nine nines is too hard, so I shoot for eight eights. | |
| ▲ | bri3d 4 days ago | parent | prev | next [-] | | My experience with very large scale B2B SaaS and PaaS has been that customers like to get money, if allowed by contract, by complaining about outages, but that overall, B2B SaaS is actually very forgiving. Most B2B SaaS solutions have very long sales cycles and a high total cost to implement, so there is a lot of inertia to switching that “a few annoying hours of downtime a year” isn’t going to cover. Also, the metric that will drive churn isn’t actually zero downtime, it’s “nearest competitor’s downtime,” which is usually a very different number. | |
| ▲ | macNchz 4 days ago | parent | prev | next [-] | | Every PaaS and IaaS I’ve ever used has had some amount of downtime, often considerably more than 5 minutes, and I’ve run production services on many of them. Plenty of random issues on major cloud providers as well. Certainly plenty of situations with dozens of Twitter posts happening but never any acknowledgement on the AWS status page. Nothing’s perfect. | | |
| ▲ | cj 4 days ago | parent [-] | | Yea, when running services where 5 minutes of downtime results in lots of support tickets, you learn to accept that the incident will happen and learn to manage the incident rather than relying that it will never occur. |
| |
| ▲ | littlestymaar 4 days ago | parent | prev | next [-] | | If your app cannot go down ever, then you cannot use a cloud provider either (because even AWS and Azure do fail sometime, just look up for “Azur down” on HN). But the truth is everybody can afford some level of outage, simply because nobody has the budget to provision an infra that can never fail. | | |
| ▲ | vrosas 4 days ago | parent [-] | | I’ve seen a team try and be truly “multi-cloud” but then ended up with this Frankenstein architecture where instead of being able to weather one cloud going down, their app would die if _any_ cloud had an issue. It was also surprisingly hard to convince people it doesn’t matter how many globally distributed clusters you have if all your data is in us-east. |
| |
| ▲ | MobiusHorizons 4 days ago | parent | prev | next [-] | | you realize all of those services you mention can't give you zero downtime, they would never even advertise that. They have quite good reliability certainly, but on long enough time horizons absolutely no-one has zero downtime. | |
| ▲ | sgrove 4 days ago | parent | prev [-] | | All of your examples have had multiple cases of going down, some for multiple days (2011 AWS was the first really long one I think) - or potentially worse, just deleting all customer data permanently and irretrievably. Meaning empirically, downtime seems to be tolerated by their customers up to some point? |
|
|
|
|
| ▲ | punkpeye 4 days ago | parent | prev | next [-] |
| It is not reflected in their status page, but fly.io itself is not even loading. |
| |
|
| ▲ | MaxfordAndSons 4 days ago | parent | prev | next [-] |
| Kinda funny that they've named their global state store "Corrosion"... not really a word I'd associate with stability and persistence. |
| |
|
| ▲ | mattbee 4 days ago | parent | prev | next [-] |
| It feels like fly is trying to repeat a growth model that worked 20 years ago: throw interesting toys at engineers, then wait for engineers to recommend their services as they move on in their careers. Part of that playbook is the old Move Fast & Break Things. That can still be the right call for young projects, but it has two big problems: 1) AWS successfully moved themselves into the position of "safe" hosting choice, so it's much rarer for engineers to have influence on something that's seen by money men as a humdrum, solved problem; 2) engineers are not the internal influencers they used to be, being laid off left and right the last few years, and without time for hobby projects. (maybe also 3) it's much harder to build a useful free tier on a hosting service, which used to be a necessary marketing expense to reach those engineers). So idk, I feel like the bar is just higher for hosting stability than it used to be, and novelty is a much harder sell, even here. Or rather: if you're going to brag about reinventing so many wheels, they need to not to come off the cart as often. |
|
| ▲ | xyst 4 days ago | parent | prev | next [-] |
| I can’t even login to my old account. Password reset is timing out yet still receive password reset e-mail. Password reset link broken, with 500 status code. |
|
| ▲ | DataOverload 4 days ago | parent | prev | next [-] |
| We switched from Fly to CF workers a while ago, and never looked back |
| |
| ▲ | punkpeye 4 days ago | parent | next [-] | | They are fundamentally different. If Cloudflare provided a way to host docker containers with volumes though, that would be game over for so many paas platforms. | | | |
| ▲ | frakkingcylons 4 days ago | parent | prev | next [-] | | I switched from apples to oranges and never looked back. | | | |
| ▲ | pier25 4 days ago | parent | prev | next [-] | | Our stuff on CF Workers has been working non stop for years now. About 6 months ago we migrated our most critical stuff from Fly to CF and boy every time Fly has issues I'm so glad we did. | | |
| ▲ | jpgvm 3 days ago | parent [-] | | Too much custom stuff too quickly, there is a lot of efficiency in vertical integration and a fully cohesive stack but it takes a very long time to stabilize if you take that route. We spent months trying to convince them of problems with their H2 implementation in their LB/proxy (they insisted nginx was at fault, spoiler - it wasn't) but had to leave (we also went to CF, which has its own problems). Eventually one of their employees wrong a long blog post about H2 that made it obvious they finally found and fixed those problems but months too late for my employer at the time. It would have been infinitely better for us if they could have just fixed their stability problems, that abstraction suited us as did their LB/proxy impl and SNI pricing. I wish them well, some really smart folk over there but I can imagine these reliability problems are probably really grinding down morale. |
| |
| ▲ | rstupek 4 days ago | parent | prev | next [-] | | How are they equivalent? | |
| ▲ | 4 days ago | parent | prev | next [-] | | [deleted] | |
| ▲ | eek2121 4 days ago | parent | prev [-] | | congrats on not developing a playbook for the time you have to 'look back'. Providers will fail. good contingencies won't. ...hears faint sound...I SAID GOOD, QUIET YOU! |
|
|
| ▲ | gigapotential 4 days ago | parent | prev | next [-] |
| HUGOPS Everything is going to be 200 OK! |
|
| ▲ | mrcwinn 4 days ago | parent | prev | next [-] |
| I tried Fly early. I was very excited about this service, but I've never had a worse hosting experience. So I left. Coincidentally I tried it again a few days ago. Surely things must be better. Nope. Auth issues in the CLI, frustrations deploying a Docker app to a Fly machine. I wouldn't recommend it to anyone. |
| |
| ▲ | steve_adams_86 4 days ago | parent [-] | | I find their user experience to be exceptional. The only flake I’ve encountered is in uptime and general reliability of services I don’t interface with directly. They’ve done a stellar job on the stuff you actually deal with, but the glue holding your services together seems pretty wobbly. |
|
|
| ▲ | pier25 4 days ago | parent | prev | next [-] |
| My apps on Fly have not gone down this time. |
|
| ▲ | EGreg 4 days ago | parent | prev | next [-] |
| What exactly does flyio.net do? |
| |
| ▲ | HellsMaddy 4 days ago | parent | next [-] | | If you mean specifically flyio.net and not just fly.io the company, I'm guessing they host their status page on a separate domain in case of DNS/registrar issues with their primary domain. | |
| ▲ | stackghost 4 days ago | parent | prev | next [-] | | IIRC their value prop is that they let you rapidly spin up deployments/machines in regions that are closest to your users, the idea being that it will be lower latency and thus better UX. | |
| ▲ | vachina 4 days ago | parent | prev | next [-] | | It’s basically what Heroku used to be but with CDN-like presence. | |
| ▲ | michaelbuckbee 4 days ago | parent | prev | next [-] | | Hosting service that has a lot of interesting distributed features. | |
| ▲ | eek2121 4 days ago | parent | prev [-] | | WEB 2.0. SEE. TOLD YA! THEY SHOULDA UPGRADED TO THAT NEWFANGLED 3.0! ;) |
|
|
| ▲ | Huppie 3 days ago | parent | prev | next [-] |
| It's interesting to see this discussion about fly.io's reliability on a day that (after over three days of downtime) Microsoft Azure finally decided the update of Azure Static Web Apps they deployed last Friday is indeed broken for customers using specific authentication settings... ...with not a single status update from Microsoft in sight. |
|
| ▲ | theideaofcoffee 4 days ago | parent | prev | next [-] |
| Color me not surprised. My few interactions with people there just gave off the impression of them being in a bit over their heads. I don't know how well that translated to their actual ops, but it's difficult to not connect the two when they continue to have major outage after major outage for a product that 'should' be their customer's bedrock upon which they build everything else. |
| |
|
| ▲ | 4 days ago | parent | prev | next [-] |
| [deleted] |
|
| ▲ | 4 days ago | parent | prev | next [-] |
| [deleted] |
|
| ▲ | tamrix 4 days ago | parent | prev | next [-] |
| [flagged] |
| |
| ▲ | viraptor 4 days ago | parent [-] | | They actually hired some serious people. If they keep failing, it's not due to lack of total experience. That experience may not be well utilised, but the potential exists and they're far from kids. |
|
|
| ▲ | travisgriggs 3 days ago | parent | prev | next [-] |
| Don’t a bunch of Elixir/Erlang guys work at fly.io? It’s weird to me that that hallmark of reliability is associated with something that the public sees as unreliable. What gives with that association? |
|
| ▲ | veggieWHITES 4 days ago | parent | prev [-] |
| I was considering these guys the other day until I saw their pricing page: https://fly.io/pricing/ (There's not a single price on there, why even create the page?) |
| |
| ▲ | rascul 4 days ago | parent | next [-] | | There's a link to what appears to be the actual pricing page https://fly.io/docs/about/pricing/ There's also a link to the pricing calculator https://fly.io/calculator | | |
| ▲ | totetsu 4 days ago | parent | next [-] | | Is that calculator hourly or monthly? | | | |
| ▲ | veggieWHITES 2 days ago | parent | prev [-] | | [flagged] | | |
| ▲ | tptacek 2 days ago | parent [-] | | LOL. If you're not charging us for it, is there any other psychoanalysis you might be willing to provide? Alternatively: sometimes, a weirdly-positioned pricing page is just a weirdly-positioned pricing page. | | |
| ▲ | veggieWHITES a day ago | parent [-] | | Psychoanalysis? This is a technology review and discussion website. It's poor design to put a link to a pricing page and not list any prices. You're literally the only GPU provider that does that. I was trying to give you my perspective as a consumer since I know you frequent these forums but I apologize if I have offended you in some way. | | |
| ▲ | akerl_ a day ago | parent [-] | | You didn’t present a consumer perspective, you offered your hot take on why the company has set up the site in the way they did. |
|
|
|
| |
| ▲ | Aeolun 4 days ago | parent | prev | next [-] | | OMG, that's hilarious. I use them, and I know what my prices are, but I'd never noticed that the page called pricing doesn't actually have any. | | |
| ▲ | tptacek 3 days ago | parent [-] | | We've always had public pricing; you can't do a metered cloud provider without a rate sheet. But it's been part of our product documentation, rather than the front page of the website, until recently; there's a whole saga behind it, which gets into whether we offer "plans" or not, how support works, all that jazz, all of which kept us from putting together a marketing pricing page. | | |
| ▲ | Aeolun 3 days ago | parent [-] | | Yeah, I’m not trying to say you didn’t. After all, I wouldn’t have signed up just to find out the price. I just never noticed it wasn’t actually on the pricing page. | | |
| ▲ | tptacek 3 days ago | parent [-] | | I'm overexplainey, because (looks around at whole thread). These aren't fun! Anyways we've been dunking on ourselves for not having a proper pricing page longer than anyone else could have. :) |
|
|
| |
| ▲ | schmichael 4 days ago | parent | prev | next [-] | | The prices are just one click deeper. Hardly a nefarious dark pattern. | |
| ▲ | 4 days ago | parent | prev [-] | | [deleted] |
|