Remix.run Logo
mattrobenolt 4 days ago

We deal with this by always running 3 nodes in a cluster, one per AZ, and strong backup/restore processes.

So yes, the data per-node is ephemeral, but it is redundant and durable for the whole cluster.

bourbonproof 3 days ago | parent [-]

Do I understand this right: if these 3 nodes shutdown for some reason, all data is lost and you have to actually restore from backup instead of just starting the machine again. And even if you have to restart one node (due to updates, or crashes) you also have to restore from backup? If so, why not pick a hosting provider that doesn't wipe the disk when machine shuts down?

mattrobenolt 3 days ago | parent [-]

It's more than just shutting down. You'd have to have an actual failure. Data isn't lost on a simple restart. It'd require 3 nodes to die in 3 different AZs.

While that's not impossible, the reality is that's very low.

So simply restarting nodes wouldn't trigger restoring from backup, but yes, in our case, replacing nodes entirely does require that node to restore from a backup/WALs and catch back up in replication.

EBS doesn't entirely just solve this, you still have failures and still need/want to restore from backups. This is built into our product as a fundamental feature. It's transparent to users, but the upside is that restoring from backups and creating backups is tested every day multiple times per day for a database. We aren't afraid of restoring from backups and replacing nodes by choice or by failure. It's the same to us.

We do all of the same operations already on EBS. This magic is what enables us to be able to use NVMe's since we treat EBS as ephemeral already.