| ▲ | zaptheimpaler 2 months ago |
| At my company we seem to have moved a little in the opposite direction of observability 2.0. We moved away from the paid observability tools to something built on OSS with the usual split between metrics, logs and traces. It seems to be mostly for cost reasons. The sheer amount of observability data you can collect in wide events grows incredibly fast and most of it ends up never being read. It sucks but I imagine most companies do the same over time? |
|
| ▲ | wavemode 2 months ago | parent | next [-] |
| > The sheer amount of observability data you can collect in wide events grows incredibly fast and most of it ends up never being read. That just means you have to be smart about retention. You don't need permanent logs of every request that hits your application. (And, even if you do for some reason, archiving logs older than X days to colder, cheaper storage still probably makes sense.) |
| |
| ▲ | motorest 2 months ago | parent [-] | | > That just means you have to be smart about retention. It's not a problem of retention. It's a problem caused by the sheer volume of data. Telemetry data must be stored for over N days in order to be useful, and if you decide to track telemetry data of all tyoes involved in "wide events" throughout this period then you need to make room to persist it. If you're bundling efficient telemetry types like metrics with data intensive telemetry like logs in events them the data you need to store quickly adds up. | | |
| ▲ | killme2008 2 months ago | parent [-] | | Agree. The new wide event pipeline should fully utilize cheaper storage options-object storage like S3. Includes both cold and hot data and maintains performance. | | |
| ▲ | gchamonlive 2 months ago | parent | next [-] | | I'm totally in favor of cold storage. Just beware of how you are storing data, the granularity of the files and how frequent you think you'd want to access that data eventually in the future, because what kills in these services is the API cost. Oh and deleting data also trigger API costs AFAIK so there is that too... | | |
| ▲ | thewisenerd 2 months ago | parent [-] | | deleting data, has a cost. deleting data early after moving it to cold storage, has additional costs. |
| |
| ▲ | valyala 2 months ago | parent | prev [-] | | HDD-based persistent disks usually have much lower IO latency comparing to S3 (microseconds vs hundreds of milliseconds). This may help improving query performance a lot. sc1 HDD-based volumes are cheaper than S3, while st1-based volumes are only 2x more expensive than S3 ( https://aws.amazon.com/ebs/pricing/ ). So there is little economical sense in using S3 over HDD-based persistent volumes. |
|
|
|
|
| ▲ | NitpickLawyer 2 months ago | parent | prev | next [-] |
| > The sheer amount of observability data you can collect in wide events grows incredibly fast and most of it ends up never being read. Yes! I know of at least 3 anecdotal "oh shit" stories w/ teams being chewed by upper management when bills from SaaS observability tools get into hundreds of thousands because of logging. Turns out that uploading a full stack dump on error can lead to TBs of data that, as you said, most likely no-one will look at ever again. |
| |
| ▲ | incangold 2 months ago | parent [-] | | I agree with the broad point- as an industry we still fail to think of logging as a feature to be specified and tested like everything else. We use logging frameworks to indiscriminately and redundantly dump everything we can think of, instead of adopting a pattern of apps and libraries that produce thoughtful, structured event streams. It’s too easy to just chuck another log.info in; having to consider the type and information content of an event results in lower volumes and higher quality of observability data. A small nit pick but having loads of data that “most likely no-one will look at ever again” is ok to an extent, for the data that are there to diagnose incidents. It’s not useful most of the time, until it’s really really useful. But it’s a matter of degree, and dumping the same information redundantly is pointless and infuriating. This is one reason why it’s nice to create readable specs from telemetry, with traces/spans initiated from test drivers and passed through the stack (rather than trying to make natural language executable the way Cucumber does it- that’s a lot of work and complexity for non-production code). Then our observability data get looked at many times before there’s a production incident, in order to diagnose test failures. And hopefully the attributes we added to diagnose tests are also useful for similar diagnostics in prod. | | |
| ▲ | openWrangler 2 months ago | parent [-] | | I'm currently working with Coroot, which is an open source project trying to create a solution for this issue of logs and other telemetry sources being too much for any team to reasonably have time to parse manually. Data is automatically imported using eBPF and Coroot will provide insights into RCA (with things like mapped incident timeframes) to help with anything overlooked in dumps. GitHub here - hope the tool can help some folks in this thread: https://github.com/coroot/coroot |
|
|
|
| ▲ | kushalkamra 2 months ago | parent | prev | next [-] |
| you’re correct i believe, we can identify patterns and highlight the variations, so this data can be put to good use. by aggregating the historical data beyond a certain point, we can also reduce the quantum of it |
|
| ▲ | magic_hamster 2 months ago | parent | prev [-] |
| Should be easily solved with some kind of retention policy. |