Remix.run Logo
umairnadeem123 2 hours ago

one pattern that works well in practice: combine passive health checks with circuit breaker state machines at the client level. instead of binary healthy/unhealthy, track a sliding window of error rates per backend. once a backend crosses your error threshold, move it to half-open state where it gets 1 in N requests as probes. this gives you sub-second detection without the false-positive problem of aggressive active health checks.

the article focuses on detection speed but misses the equally important problem of recovery speed. backends that come back after a failure often get thundering-herded by all the clients that simultaneously notice the recovery. connection ramping (slowly increasing traffic to a recovered backend) is just as important as fast detection.

singhsanjay12 an hour ago | parent [-]

Agree - sliding window error rates plus client-side circuit breakers (with half-open probes and ramp-up) work really well in practice, and the recovery-speed point is especially important.

The only nuance I was trying to call out is what happens at very large scale. These mechanisms operate per client instance, so each client needs a few failures before it trips its breaker and then runs its own probes and ramp-up. That's perfectly reasonable locally, but when you have hundreds or thousands of clients, the aggregate "learning traffic" can still be noticeable. Each client might only send a little bad traffic before reacting, but multiplied across the fleet it can still add up. Similarly, recovery can still produce smaller synchronized ramps as many clients independently notice improvement around the same time.

So I tend to think of client-side circuit breakers as necessary but not always sufficient at scale. They're great for fast local containment and tail-latency protection, but they work best when paired with some shared signal (LB, mesh control plane, or similar) that can dampen the aggregate effect and smooth recovery globally.