Remix.run Logo
loloquwowndueo 17 hours ago

There are plenty of other technologies whose failure mode is a total outage, it’s not exclusive to a failed certificate renewal.

A certificate renewal process has several points at which failure can be detected and action taken, and it sounds like this team was relying only on a “failed to renew” alert/monitor.

A broken alerting system is mentioned “didn’t alert for whatever reason”.

If this certificate is so critical, they should also have something that alerts if you’re still serving a certificate with less than 2 weeks validity - by that time you should have already obtained and rotated in a new certificate. This gives plenty of time for someone to manually inspect and fix.

Sounds like a case of “nothing in this automated process can fail, so we only need this one trivial monitor which also can’t fail so meh” attitude.

tetha 7 hours ago | parent | next [-]

> If this certificate is so critical, they should also have something that alerts if you’re still serving a certificate with less than 2 weeks validity - by that time you should have already obtained and rotated in a new certificate. This gives plenty of time for someone to manually inspect and fix.

This is also why you want a mix of alerts from the service users point of view, as well as internal troubleshooting alerts. The users point-of-view alerts usually give more value and can be surprisingly simple at times.

"Remaining validity of the certificates offered by the service" is a classical check from the users point of view. It may not tell you why this is going wrong, but it tells you something is going wrong. This captures a multitude of different possible errors - certs not reloading, the wrong certs being loaded, certs not being issued, DNS going to the wrong instance, new, shorter cert lifecycles, outages at the CA, and so on.

And then you can add further checks into the machinery to speed up the process of finding out why: Checks if the cert creation jobs run properly, checks if the certs on disk / in secret store are loaded or not, ...

Good alerting solutions might also allow relationships between these alerts to simplify troubleshooting as well: Don't alert for the cert expiry, if there is a failed cert renew cron job, alert for that instead.

yearolinuxdsktp 16 hours ago | parent | prev | next [-]

Additionally, warnings can be built into the clients themselves. If you connect to a host with less than 2 weeks cert expiry time, print a warning in your client. That will be further incentive to not let certs be not renewed in time.

SoftTalker 11 hours ago | parent | prev [-]

Wait until they start expiring 47 days from issue (coming soon). Though maybe this will actually help, because it will happen often enough that you (a) won't completely forget how to deal with it and (b) have more motivation to be proactive.