> Alerts should be actionable. If no action can or should be taken, then the alert is not needed.

Also, the best alerts come from looking at actual failures you had and not trying to make up "good alerts" from thin air. After you have an outage, figure out what alerts would have caught it, and implement those.

▲

muvlon 2 hours ago | parent | next [-]

This is one category of good alerts, but not everything.

I think alerts are to ops what tests are to dev. You have "unit alerts" for some small thing like the disk usage on a single host, "integration alerts" like literally "does the page load?" and then what you describe are "regression alerts", trying to prevent something that went wrong once from going wrong again. These are great but just like you wouldn't have 100% regression tests, I think it's also smart to try to get ahead of failures and have some common sense alerts defined.

▲

perarneng 3 hours ago | parent | prev | next [-]

"looking at actual failures you had "

Also looking at failures others had, prior experience from yourself and others contribute to good alerts. You don't have to wait for failure to implement most of them. Most of that knowlege is also trained in to most LLM's nowadays. Just ask and then also verify sources, then implement. If you get to many alerts question if you needed them or if its noice. Its a constant trimming until you find the perfect alert setup.

▲

esafak 5 hours ago | parent | prev [-]

I know something is going to happen if disk space runs out; I don't need to experience it first.

▲

stackskipton 4 hours ago | parent [-]

Sure, but for every alert, there is exception.

ElasticSearch for example can be configured using ILM policies to fill up the disk then start deleting old records. I don't need to be woken up for disk filling up on those nodes.

Even worse is CPU/RAM alerts.

	▲	ajanuary an hour ago \| parent \| next [-]
		The number of times I've had to explain how the JVM heap works...
	▲	esafak 4 hours ago \| parent \| prev [-]
		Alerts are for when things don't go as expected. You set up log rotation but an agent quietly breaks it or ES introduces a bug in it.