Remix.run Logo
n_u 3 days ago

P(two failures within MTTR for first node) = P(one failure)P(second failure within MTTR of first node|one failure)

independence simplifies things

= P(one failure)P(second failure within MTTR of first node)

= P(one failure) * (1 - e^-λx)

where x = MTTR for first node

λ = 1/MTBF

plugging in the numbers from your blog post

P(one failure within 30 days) = 0.01 not sure if this part is correct.

MTTR = 5 minutes + 5 hours =~ 5.083 hours

MTBF = 30 days / 0.01 = 3000 days = 72000 hours

0.01 * (1 - e^(-5.083 / 72000)) = 0.0000007 ~= 0.00007 %

I must be doing something wrong cuz I'm not getting the 0.000001% you have in the blog post. If there's some existing work on this I'd be stoked to read it, I can't quite find a source.

Also there's two nodes that have the potential to fail while the first is down but that would make my answer larger not smaller.

rcrowley 3 days ago | parent [-]

I computed P(node failure within MTTR) = 0.00007 same as you. I extrapolated this to the outage scenario P(at least two node failures within MTTR) = P(node failure within MTTR)^2 * (1-P(node failure within MTTR)) + P(node failure within MTTR)^3 = 5.09 * 10^-9 which rounds to 0.0000001%.