The way I understood NVMe drives to work on Google Cloud is that they are ephemeral and your data will be lost if the vm reboots. How do they work in this case?

▲

mattrobenolt 4 days ago | parent | next [-]

We deal with this by always running 3 nodes in a cluster, one per AZ, and strong backup/restore processes.

So yes, the data per-node is ephemeral, but it is redundant and durable for the whole cluster.

▲

bourbonproof 3 days ago | parent [-]

Do I understand this right: if these 3 nodes shutdown for some reason, all data is lost and you have to actually restore from backup instead of just starting the machine again. And even if you have to restart one node (due to updates, or crashes) you also have to restore from backup? If so, why not pick a hosting provider that doesn't wipe the disk when machine shuts down?

	▲	mattrobenolt 3 days ago \| parent [-]
		It's more than just shutting down. You'd have to have an actual failure. Data isn't lost on a simple restart. It'd require 3 nodes to die in 3 different AZs. While that's not impossible, the reality is that's very low. So simply restarting nodes wouldn't trigger restoring from backup, but yes, in our case, replacing nodes entirely does require that node to restore from a backup/WALs and catch back up in replication. EBS doesn't entirely just solve this, you still have failures and still need/want to restore from backups. This is built into our product as a fundamental feature. It's transparent to users, but the upside is that restoring from backups and creating backups is tested every day multiple times per day for a database. We aren't afraid of restoring from backups and replacing nodes by choice or by failure. It's the same to us. We do all of the same operations already on EBS. This magic is what enables us to be able to use NVMe's since we treat EBS as ephemeral already.

▲

rcrowley 4 days ago | parent | prev | next [-]

You don't (typically) lose the data on the ephemeral drive across a reboot but you definitely can (and do!) when there are more permanent hardware failures. (They really happen!) That's why PlanetScale always maintains at least three copies of the data. We guarantee durability via replication, not by trusting the (slow, network-attached) block device.

I did an interview all about PlanetScale Metal a couple of months ago: <https://www.youtube.com/watch?v=3r9PsVwGkg4>

▲

n_u 4 days ago | parent | next [-]

Hi, thank you for your work on this and being willing to answer questions on it.

"We guarantee durability via replication". I've starting noticing this pattern more where distributed systems provide durability by replicating data rather than writing it to disk and achieving the best of both worlds. I'm curious

1. Is there a name for this technique?

2. How do you calculate your availability? This blog post[1] has some rough details but I'd love to see the math.

3. I'm guessing a key part of this is putting the replicas in different AZs and assuming failures aren't correlated so you can multiply the probabilities directly. How do you validate that failures across AZs are statistically independent?

Thanks!

[1] https://planetscale.com/blog/planetscale-metal-theres-no-rep...

▲

rcrowley 4 days ago | parent | next [-]

1. I don't know if there's a single name for this. I will point out that AWS EBS and Google Persistent Disk as industrial examples of distributed, replicated block devices are also providing durability via replication. They're just providing it at a lower level that ends up sacrificing performance. I'm struggling to come up with a citation but I think it's either Liskov or Lynch that offered a proof to the effect of achieving durability in a distributed system via replication.

2. The thinking laid out in the blog post you linked to is how we went about it. You can do the math with your own parameters by computing the probability of a second node failure within the time it takes to recover from a first node failure. These are independent failures, being on physically separate hardware in physically separate availability zones. It's only when they happen together that problems arise. The core is this: P(second node failure within MTTR for first node failure) = 1 - e^( -(MTTR node failure) / (MTBF for a node) )

3. This one's harder to test yourself. You can do all sorts of tests yourself (<https://rcrowley.org/2019/disasterpiece-theater.html>) and via AWS FIS but you kind of have to trust the cloud provider (or read their SOC 2 report) to learn how availability zones really work and really fail.

▲

n_u 3 days ago | parent [-]

P(two failures within MTTR for first node) = P(one failure)P(second failure within MTTR of first node|one failure)

independence simplifies things

= P(one failure)P(second failure within MTTR of first node)

= P(one failure) * (1 - e^-λx)

where x = MTTR for first node

λ = 1/MTBF

plugging in the numbers from your blog post

P(one failure within 30 days) = 0.01 not sure if this part is correct.

MTTR = 5 minutes + 5 hours =~ 5.083 hours

MTBF = 30 days / 0.01 = 3000 days = 72000 hours

0.01 * (1 - e^(-5.083 / 72000)) = 0.0000007 ~= 0.00007 %

I must be doing something wrong cuz I'm not getting the 0.000001% you have in the blog post. If there's some existing work on this I'd be stoked to read it, I can't quite find a source.

Also there's two nodes that have the potential to fail while the first is down but that would make my answer larger not smaller.

	▲	rcrowley 3 days ago \| parent [-]
		I computed P(node failure within MTTR) = 0.00007 same as you. I extrapolated this to the outage scenario P(at least two node failures within MTTR) = P(node failure within MTTR)^2 * (1-P(node failure within MTTR)) + P(node failure within MTTR)^3 = 5.09 * 10^-9 which rounds to 0.0000001%.

▲

maxenglander 4 days ago | parent | prev [-]

Hi n_u, PlanetScale engineer here, I'm going to just address just the point about durability via replication. I can't speak to what you've seen with other distributed systems, but, at PlanetScale, we don't do replication instead of writing to disk, we do replication in addition to writing to disk. Best of both worlds.

	▲	rcrowley 4 days ago \| parent [-]
		Good point, Max. I glossed over the "rather than" bit. We do, as you say, write to disks all over the place. Even writing to one disk, though, isn't good enough. So we write to three and wait until two have acknowledged before we acknowledge that write to the client.

▲

zenerdi0de 4 days ago | parent | prev [-]

[dead]

▲

alexeldeib 4 days ago | parent | prev [-]

can't speak to GCP specifically but usually the issue is they are host-attached and can't be migrated, so need to be wiped on VM termination or migration -- that's when you lose data.

Reboots typically don't otherwise do anything special unless they also trigger a host migration. GCP live migration has some mention of support though

GCP mentions data persists across reboots here https://cloud.google.com/compute/docs/disks/local-ssd#data_p...

note that stop/terminate via cloud APIs usually releases host capacity for other customers and would trigger data wipe, a guest initiated reboot typically will not.