> the failure mode is the opposite of graceful degradation. It’s not like there’s an increasing percentage of requests that fail as you get closer to the deadline. Instead, in one minute, everything’s working just fine, and in the next minute, every http request fails.

This has given me some interesting food for thought. I wonder how feasible it would be to create a toy webserver that did exactly this (failing an increasing percentage of requests as the deadline approaches)? My thought would be to start failing some requests as the deadline approaches a point where most would consider it "far too late" (e.g. 4 hours before `notAfter`). At this point, start responding to some percentage of requests with a custom HTTP status code (599 for the sake of example).

Probably a lot less useful than just monitoring each webserver endpoint's TLS cert using synthetics, but it's given me an idea for a fun project if nothing else.

▲

loloquwowndueo 13 hours ago | parent | next [-]

Your idea shifts monitoring to end users, which doesn’t sound awesome.

Just check expiration of the active certificate; if it’s under a threshold (say 1 week, assuming you auto-renew it when it’s 3 weeks to expiry; still serving a cert when it’s 1 week to expiration is enough signal that something went wrong) then you alert.

Then you just need to test that your alerting system is reliable. No need to use your users as canaries.

	▲	thecosmicfrog 13 hours ago \| parent [-]
		Oh absolutely, I wouldn't use this for any production system. It would be a toy hobby project. I just find the notion of turning a no-degradation failure mode into a gradual-degradation one fascinating for some reason.

▲

johannes1234321 13 hours ago | parent | prev | next [-]

For a fun project it certainly is a fun idea.

In real life, I guess there are people who don't monitor at all. For them failing requests would go unnoticed ... for the others monitoring must be easy.

But I think the core thing might be to make monitoring SSL lifetime the "obvious" default: All the grafana dashboards etc should have such an entry.

Then as soon as I setup a monitoring stack I get that reminder as well.

▲

firesteelrain 13 hours ago | parent | prev [-]

This canary is a good thought. The problem the article highlights is that people don’t practice updates enough and assume someone else or something is handling it. You only get better at it the more often it happens which is partly why long expirations are not ideal. This is what the article is highlighting as the main issue.

▲

loloquwowndueo 13 hours ago | parent [-]

It’s not a good thought. Run a single client (uptime kuma) and ask it to alert you on expiration proximity. I.e. implement proper monitoring and alerting. No need to randomly degrade your users’ experience and hope they’ll notify you instead of shrugging and going to a site that doesn’t throw made-up http errors at them randomly.

▲

firesteelrain 13 hours ago | parent [-]

If a “canary” is degrading users, it’s misdesigned.

The canary narrows the blast radius and time-to-detection.

	▲	loloquwowndueo 12 hours ago \| parent [-]
		Agreed. That’s exactly what the proposed canary is - misdesigned.