Remix.run Logo
marvin-hansen 4 days ago

No surprise. About a year ago, I looked at fly.io because of it's low pricing and I was wondering where they were cutting corners to still make some money. Ultimately, I found the answer in their tech docs where it was spelled out clearly that an fly instance is hardwired to one physical server and thus cannot fail over in case that server dies. Not sure if that part still is in the official documentation.

In practice, that means if a server goes down, they have to load the last snapshot from that instance from the Backup and push it on a new server, update the network path, and pray to god that not more server fail than spare capacity is available. Otherwise you have to wait for a restore until the datacenter mounted a few more boxes in the rack.

That explains quite a bit the randomness of those outage reports i.e. my app is down vs the other is fine and mine came back in 5 minutes vs the other took forever.

As a business on a budget, I think anything else i.e. a small civo cluster serves you better.

ignoramous 4 days ago | parent | next [-]

Fly.io can migrate vm+volume now: https://fly.io/docs/reference/machine-migration/ / https://archive.md/rAK0V

> a fly instance is hardwired to one physical server and thus cannot fail over

I'm having trouble understanding how else this is supposed to be? I understand that live migration is a thing, but even in those cases, a VM is "hardwired" to some physical server, no?

mzi 4 days ago | parent | next [-]

> I'm having trouble understanding how else this is supposed to be? I understand that live migration is a thing, but even in those cases, a VM is "hardwired" to some physical server, no?

You can run your workload (in this case a VM) on top of a scheduler, so if one node goes down the workload is just spun up on another available node.

You will have downtime, but it will be limited.

ignoramous 4 days ago | parent [-]

> so if one goes down ... just spun up on another

On Fly, one can absolutely set this up. Multiple ways: https://fly.io/docs/apps/app-availability / https://archive.md/SJ32K

sofixa 3 days ago | parent | prev [-]

> I'm having trouble understanding how else this is supposed to be? I understand that live migration is a thing, but even in those cases, a VM is "hardwired" to some physical server, no?

They mean the storage part. If your VM's storage(state) is on one server and that server dies, you have to restore from backup. If your VM's storage is on remote shared storage mounted to that server and the server dies, your VM can be restarted elsewhere that has access to that shared storage.

In AWS land it's the difference between instance store (local to a server) and EBS (remote, attached locally).

There's a tradeoff in that shared storage will be slightly slower due to having to traverse networking, and it's harder to manage properly; but the reliability gain is massive.

dilyevsky 4 days ago | parent | prev | next [-]

> Ultimately, I found the answer in their tech docs where it was spelled out clearly that an fly instance is hardwired to one physical server and thus cannot fail over in case that server dies.

Majority of EC2 instance types did not have live migration until very recently. Some probably still don't (they don't really spell out how and when it's supposed to work). It is also not free - there's a noticeable brown-out when your VM gets migrated on GCP for example.

ixaxaar 4 days ago | parent [-]

Can you shed some more light on this "browning out" phenomenon?

toast0 4 days ago | parent [-]

Here's the GCP doc [1]. Other live migration products are similar.

Generally, you have worse performance while in the preparing to move state, an actual pause, then worse performance as the move finishes up. Depending on the networking setup, some inbound packets may be lost or delayed.

[1] https://cloud.google.com/compute/docs/instances/live-migrati...

pier25 4 days ago | parent | prev | next [-]

If you want HA on Fly you need to deploy an app to multiple regions (multiple machines).

Fly might still go down completely if their proxy layer fails but it's much less common.

sb8244 4 days ago | parent [-]

The proxy layer was the cause of yesterday's outage according to support.

pier25 4 days ago | parent [-]

Yes but the previous comment was about hardware failure.

fulafel 4 days ago | parent | prev [-]

The status tells a story about a high-availability/clustering system failure so I think in this case the problem is rather the complexity of the HA machinery hurting the system's availability vs something like a simple VPS.