| ▲ | sharklasers123 11 hours ago | |||||||||||||||||||||||||||||||
Is there not an inherent risk using an AWS service (Route 53) to do the health check? Wouldn’t it make more sense to use a different cloud provider for redundancy? | ||||||||||||||||||||||||||||||||
| ▲ | kondro 5 hours ago | parent | next [-] | |||||||||||||||||||||||||||||||
While there appears to be some us-east-1 SPoF for Route 53 updates (as shown recently), the actual health checks themselves occur in up to 8 different regions [1] with an 18%[2] agreement of failure required to initiate a failover. AWS has very good isolation between regions and, while it relies on us-east-1 for control plane updates to Route 53, health checks and failovers are data plane operations[3] and aren't affected by a us-east-1 outage. Relying on a single provider always seems like a risk, but the increased complexity of designing systems for multi-cloud will usually result in an increased risk of failure, not a decrease. 1. us-east-1, us-west-1, us-west-2, eu-west-1, ap-southeast-1, ap-southeast-2, ap-northeast-1 and sa-east-1 which defaults to all of them. 2. https://docs.aws.amazon.com/Route53/latest/DeveloperGuide/dn... 3. https://aws.amazon.com/blogs/networking-and-content-delivery... | ||||||||||||||||||||||||||||||||
| ▲ | wparad 10 hours ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||
If the check can't be done, then everything stays stable, so I'm guessing the question is, "What happens if Route 53 does the check and incorrectly reports the result?" In that case, no matter what we are using there is going to be a critical issue. I think the best I could suggest at that point would be to have records in your zone that round robin different cloud providers, but that comes with its own challenges. I believe there are some articles sitting around regarding how AWS plans for failure and the fallback mechanism actually reduces load on the system rather than makes it worse. I think it would require in-depth investigation on the expected failover mode to have a good answer there. For instance, just to make it more concrete, what sort of failure mode are you expecting to happen with the Route 53 health check? Depending on that there could be different recommendations. | ||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||
| ▲ | indigodaddy 10 hours ago | parent | prev [-] | |||||||||||||||||||||||||||||||
Had the same thought, eg if things are really down can it even do the check etc | ||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||