Remix.run Logo
thisnullptr 4 hours ago

It’s fascinating to me people think their services are so important they can’t survive any downtime. Can we all admit that, while annoying, nothing really bad happened even when us-east-1 was down for almost half a working day?

bostik 32 minutes ago | parent | next [-]

As other posters have commented, an external auth service is a very special thing indeed. In modern and/or zero-trust systems if auth doesn't work, then effectively nothing works.

My rule of thumb from the past experiences is that if you demand a 99.9% uptime for your own systems and you have an in-house auth, then that auth system must have 99.99% reliability. If you are serving auth for OTHERS, then you have a system that can absolutely never be down, and at that point five nines becomes a baseline requirement.

Auth is a critical path component. If your service is in the critical path in both reliability and latency[ß] for third parties, then every one of your failures is magnified by the number of customers getting hit by it.

ß: The current top-voted comment thread includes a mention that latency and response time should also be part of an SLA concern. I agree. For any hot-path system you must be always tracking the latency distribution, both from the service's own viewpoint AND from the point of view of the outside world. The typically useful metrics for that are p95, p99, p999 and max. Yes, max is essential to include: you want to always know what was the worst experience someone/something had during any given time window.

catlifeonmars 9 minutes ago | parent | prev | next [-]

[delayed]

shoo 3 hours ago | parent | prev | next [-]

In many contexts you are correct & further, as someone in that earlier thread about the AWS us-east-1 outage mentioned, customers can be more forgiving of outages if you as the vendor can point to a widespread AWS us-east-1 outage and note that us-east-1 is down for everyone.

But, as JSR_FDED's sibling comment notes & as is spelled out in the article, authress' business model offering an auth service means that their outage may entirely brick their clients customer facing auth / machine to machine auth.

I've worked in megacorp environments where an outage of certain internal services responsible for auth or issuing JWTs would break tens or hundreds of internal services and break various customer-facing flows. In many business contexts a big messy customer facing outage for a day or so doesn't actually matter but in some contexts it really can. In terms of blast radius, unavailability of a key auth service depended on by hundreds of things is up there with, i dunno, breaking the network.

JSR_FDED 3 hours ago | parent | prev [-]

If you’re providing auth services to many companies then a failure will increase the likelihood of something bad to an unacceptable degree.