| ▲ | singhsanjay12 2 hours ago | |
Agree - sliding window error rates plus client-side circuit breakers (with half-open probes and ramp-up) work really well in practice, and the recovery-speed point is especially important. The only nuance I was trying to call out is what happens at very large scale. These mechanisms operate per client instance, so each client needs a few failures before it trips its breaker and then runs its own probes and ramp-up. That's perfectly reasonable locally, but when you have hundreds or thousands of clients, the aggregate "learning traffic" can still be noticeable. Each client might only send a little bad traffic before reacting, but multiplied across the fleet it can still add up. Similarly, recovery can still produce smaller synchronized ramps as many clients independently notice improvement around the same time. So I tend to think of client-side circuit breakers as necessary but not always sufficient at scale. They're great for fast local containment and tail-latency protection, but they work best when paired with some shared signal (LB, mesh control plane, or similar) that can dampen the aggregate effect and smooth recovery globally. | ||