Remix.run Logo
rcrowley 4 days ago

1. I don't know if there's a single name for this. I will point out that AWS EBS and Google Persistent Disk as industrial examples of distributed, replicated block devices are also providing durability via replication. They're just providing it at a lower level that ends up sacrificing performance. I'm struggling to come up with a citation but I think it's either Liskov or Lynch that offered a proof to the effect of achieving durability in a distributed system via replication.

2. The thinking laid out in the blog post you linked to is how we went about it. You can do the math with your own parameters by computing the probability of a second node failure within the time it takes to recover from a first node failure. These are independent failures, being on physically separate hardware in physically separate availability zones. It's only when they happen together that problems arise. The core is this: P(second node failure within MTTR for first node failure) = 1 - e^( -(MTTR node failure) / (MTBF for a node) )

3. This one's harder to test yourself. You can do all sorts of tests yourself (<https://rcrowley.org/2019/disasterpiece-theater.html>) and via AWS FIS but you kind of have to trust the cloud provider (or read their SOC 2 report) to learn how availability zones really work and really fail.

n_u 3 days ago | parent [-]

P(two failures within MTTR for first node) = P(one failure)P(second failure within MTTR of first node|one failure)

independence simplifies things

= P(one failure)P(second failure within MTTR of first node)

= P(one failure) * (1 - e^-λx)

where x = MTTR for first node

λ = 1/MTBF

plugging in the numbers from your blog post

P(one failure within 30 days) = 0.01 not sure if this part is correct.

MTTR = 5 minutes + 5 hours =~ 5.083 hours

MTBF = 30 days / 0.01 = 3000 days = 72000 hours

0.01 * (1 - e^(-5.083 / 72000)) = 0.0000007 ~= 0.00007 %

I must be doing something wrong cuz I'm not getting the 0.000001% you have in the blog post. If there's some existing work on this I'd be stoked to read it, I can't quite find a source.

Also there's two nodes that have the potential to fail while the first is down but that would make my answer larger not smaller.

rcrowley 3 days ago | parent [-]

I computed P(node failure within MTTR) = 0.00007 same as you. I extrapolated this to the outage scenario P(at least two node failures within MTTR) = P(node failure within MTTR)^2 * (1-P(node failure within MTTR)) + P(node failure within MTTR)^3 = 5.09 * 10^-9 which rounds to 0.0000001%.