Remix.run Logo
gmuslera 2 hours ago

Reminded me a note I heard about backups. You don't want backups, it is a waste of time, bandwidth and disk space, by far most if not all of it will end being discarded without being ever used. What you really want is something to restore from if anything breaks. That is the cost that should matter to you. What if you don't have anything meaningful to make a restore from?

With observability is not the volume of data, time and bandwidth used on it, is being able to understand your system and properly diagnose and solve problems when they happen. Can you do that with less? For the next problem that you don't know yet? If you can't because of lack of information or information you didn't collect, then spending so much may be was not enough.

Of course that there are more efficient (towards the end result) ways to do it than others. But having the needed information available, even if it is never used, is the real goal here.

binarylogic 2 hours ago | parent [-]

I agree with the framing. The goal isn't less data for its own sake. The goal is understanding your systems and being able to debug when things break.

But here's the thing: most teams aren't drowning in data because they're being thorough. They're drowning because no one knows what's valuable and what's not. Health checks firing every second aren't helping anyone debug anything. Debug logs left in production aren't insurance, they're noise.

The question isn't "can you do with less?" It's "do you even know what you have?" Most teams don't. They keep everything just in case, not because they made a deliberate choice, but because they can't answer the question.

Once you can answer it, you can make real tradeoffs. Keep the stuff that matters for debugging. Cut the stuff that doesn't.

gmuslera 2 hours ago | parent [-]

There is a lot of crap that is and will ever be useless when debugging a problem. But there is a also a lot that you don't know if you will need it, at least, not yet, not when you are defining what information you collect, and may become essential when something in particular (usually unexpected) breaks. And then you won't have the past data you didn't collect.

You can go in a discovering path, can the data you collect explain how and why the system is running now? There are things that are just not relevant when things are normal and when they are not? Understanding the system, and all the moving parts, are a good guide for tuning what you collect, what you should not, and what are the missing pieces. And cycle with that, your understanding and your system will keep changing.