Remix.run Logo
n_u 4 days ago

Hi, thank you for your work on this and being willing to answer questions on it.

"We guarantee durability via replication". I've starting noticing this pattern more where distributed systems provide durability by replicating data rather than writing it to disk and achieving the best of both worlds. I'm curious

1. Is there a name for this technique?

2. How do you calculate your availability? This blog post[1] has some rough details but I'd love to see the math.

3. I'm guessing a key part of this is putting the replicas in different AZs and assuming failures aren't correlated so you can multiply the probabilities directly. How do you validate that failures across AZs are statistically independent?

Thanks!

[1] https://planetscale.com/blog/planetscale-metal-theres-no-rep...

rcrowley 4 days ago | parent | next [-]

1. I don't know if there's a single name for this. I will point out that AWS EBS and Google Persistent Disk as industrial examples of distributed, replicated block devices are also providing durability via replication. They're just providing it at a lower level that ends up sacrificing performance. I'm struggling to come up with a citation but I think it's either Liskov or Lynch that offered a proof to the effect of achieving durability in a distributed system via replication.

2. The thinking laid out in the blog post you linked to is how we went about it. You can do the math with your own parameters by computing the probability of a second node failure within the time it takes to recover from a first node failure. These are independent failures, being on physically separate hardware in physically separate availability zones. It's only when they happen together that problems arise. The core is this: P(second node failure within MTTR for first node failure) = 1 - e^( -(MTTR node failure) / (MTBF for a node) )

3. This one's harder to test yourself. You can do all sorts of tests yourself (<https://rcrowley.org/2019/disasterpiece-theater.html>) and via AWS FIS but you kind of have to trust the cloud provider (or read their SOC 2 report) to learn how availability zones really work and really fail.

n_u 3 days ago | parent [-]

P(two failures within MTTR for first node) = P(one failure)P(second failure within MTTR of first node|one failure)

independence simplifies things

= P(one failure)P(second failure within MTTR of first node)

= P(one failure) * (1 - e^-λx)

where x = MTTR for first node

λ = 1/MTBF

plugging in the numbers from your blog post

P(one failure within 30 days) = 0.01 not sure if this part is correct.

MTTR = 5 minutes + 5 hours =~ 5.083 hours

MTBF = 30 days / 0.01 = 3000 days = 72000 hours

0.01 * (1 - e^(-5.083 / 72000)) = 0.0000007 ~= 0.00007 %

I must be doing something wrong cuz I'm not getting the 0.000001% you have in the blog post. If there's some existing work on this I'd be stoked to read it, I can't quite find a source.

Also there's two nodes that have the potential to fail while the first is down but that would make my answer larger not smaller.

rcrowley 3 days ago | parent [-]

I computed P(node failure within MTTR) = 0.00007 same as you. I extrapolated this to the outage scenario P(at least two node failures within MTTR) = P(node failure within MTTR)^2 * (1-P(node failure within MTTR)) + P(node failure within MTTR)^3 = 5.09 * 10^-9 which rounds to 0.0000001%.

maxenglander 4 days ago | parent | prev [-]

Hi n_u, PlanetScale engineer here, I'm going to just address just the point about durability via replication. I can't speak to what you've seen with other distributed systems, but, at PlanetScale, we don't do replication instead of writing to disk, we do replication in addition to writing to disk. Best of both worlds.

rcrowley 4 days ago | parent [-]

Good point, Max. I glossed over the "rather than" bit. We do, as you say, write to disks all over the place.

Even writing to one disk, though, isn't good enough. So we write to three and wait until two have acknowledged before we acknowledge that write to the client.