Remix.run Logo
hinkley 4 hours ago

One of the problems described here seems to be that the people building the dashboards aren’t the ones adding the instrumentation. Admittedly I’ve only worked on one project that was all in on telemetry instead of using log analysis. And even that one had one foot in Splunk and one in Grafana, but I worked there long enough to see that we mostly only had telemetry for charts at least someone on call used regularly. I got most of them out of Splunk but that wasn’t that hard. We hadn’t bought enough horsepower from them that it didn’t jam up if too many people got involved in diagnosing production issues.

Occasionally I convinced them that certain charts were wrong and moved them to other stats to answer the same question, and some of those could go away.

I also wrote a little tool to extract all the stats from our group’s dashboard so we could compare used to generated and I cut I’d say about a third? Which is in line with his anecdote. I then gave it to OPs and announced it at my skip level’s staff meeting so other people could do the same.

srean 3 hours ago | parent | next [-]

This.

I also think that a lot of the waste can be done away with by using application specific codecs. Yes, even gzip compresses logs and metrics by a lot, but one can go further with specialized codecs to hone in on the redundancy much quicker (than what a generic lossless compressor eventually would).

However to build these one can't have a "throw it over the 3rd party wall" mode of development.

One way to do this for stable services would be to build hi-fidelity (mathematical/statistical) models for the logs and metrics, then serialize what is non-redundant. This applies particularly well for numeric data where gzip does not do as well. What we need is the analogue of jpeg for the log type.

At my workplace there has been political buy in of the idea that if a long / metric stream has not been used in 2~3 years, then throw it away and stop collecting. This rubs me the wrong way because so many times I have wished there was some historic data for my data-science project. You never know what data you might need in the future. You, however, do know that you do not need redundant data.

binarylogic 3 hours ago | parent | prev [-]

What you're describing is very real and it works to a degree. I've seen this same manual maintenance play out over and over for 10 years: cleaning dashboards, chasing engineers to align on schemas, running cost exercises. It never gets better, only worse.

It's nuts to me that after a decade of "innovation," observability still feels like a tax on engineers. Still a huge distraction. Still requires all this tedious maintenance. And I genuinely think it's rooted in vendor misalignment. The whole industry is incentivized to create more, not give you signal with less.

The post focuses on waste, but the other side of the coin is quality. Removing waste is part of that, but so is aligning on schemas, adhering to standards, catching mistakes before they ship. When data quality is high and stays high automatically, everything you're describing goes away.

That's the real goal.