| ▲ | shoo 3 hours ago | |
> the section about retries doesn't mention correlations. [...] By treating P_{3rdParty}(Failure) as fixed, they're assuming a model in which each each try is completely independent: all the failures are due to background noise. But that's totally wrong, as shown by the existence of big outages like the one they're describing Yes, that jumped out at me as well. A slightly more sophisticated model could be to assume there are two possible causes of a failed 3rd party call: (a) a transient issue - failure can be masked by retrying, and (b) a serious outage - where retrying is likely to find that the 3rd party dependency is still unavailable. Our probabilistic model of this 3rd party dependency could then look something like
I.e. a failed call is 9x more likely to be caused by a transient issue than a serious outage. If the cause was a transient issue we assume independence between sequential attempts like in the article, but if the failure was caused by a serious outage there's only a 5% chance that each sequential retry attempt will succeed.In contrast with the math sketched in the article, where retrying a 3rd party call with a 10% failure rate 5 times could suffice for a 99.999% success rate, with the above model of failure modes including a serious outage failure mode producing a string of failures, we'd need to retry 135 times after a first failed call to achieve the same 99.999% success rate. Your points about overall latency client is willing to wait & retries causing additional load are good, in many systems "135 retry attempts" is impractical and would mean "our overall system has failed and is unavailable". Anyhow, it's still an interesting article. The meat of the argument and logic about 3rd party deps needing to meet some minimum bar of availability to be included still makes sense, but if our failure model considers failure modes like lengthy outages that can cause correlated failure patterns, that raises the bar for how reliable any given 3rd party dep needs to be even further. | ||