Remix.run Logo
jillesvangurp 2 months ago

Opensearch and Elasticsearch do most/all of what this proposes. And then some.

The mistake many teams make is to worry about storage but not querying. Storing data is the easy part. Querying is the hard part. Some columnar data format stored in S3 doesn't solve querying. You need to have some system that loads all those files, creates indices or performs some map reduce logic to get answers out of those files. If you get this wrong, stuff gets really expensive and costly quickly.

What you indeed want is a database (probably a columnar one) that provides fast access and that can query across your data efficiently at scale. That's not observability 2.0 but observability 101. Without that, you have no observability. You just have a lot of data that is hard to query and that provides no observability unless you somehow manage solve that. Yahoo figured that out 20 years or so ago when they created hadoop, hdfs, and all the rest.

The article is right to call out the fragmented landscape here. Many products only provide partial/simplistic solutions and they don't integrate well with each other.

I started out doing some of this stuff more than 10 years ago using Elasticsearch and Kibana. Grafana was a fork that hadn't happened yet. This combination is still a good solution for logging, metrics, and traces. These days, Opensearch (the Elasticsearch fork) is a good alternative. Basically the blob of json used in the article with a nice mapping would work fine in either. That's more or less what I did around 2014.

Create a data stream, define some life cycle policies (data retention, rollups, archive/delete, etc.), and start sending data. Both Opensearch and Elasticsearch have stateless versions now that store in S3 (or similar bucket based storage). Exactly like the article proposes. I'd recommend going with Elasticsearch. It's a bit richer in features. But Opensearch will do the job.

This is not the only solution in this space but it works well enough.

piterrro 2 months ago | parent | next [-]

> The mistake many teams make is to worry about storage but not querying. Storing data is the easy part. Querying is the hard part. Some columnar data format stored in S3 doesn't solve querying. You need to have some system that loads all those files, creates indices or performs some map reduce logic to get answers out of those files.

That's a nice callout, there's a lack of awareness in our space that producing logs is one thing, but if you do it on a scale, this stuff gets pretty tricky. Storing for effective query becomes crucial and this is what most popular OSS solutions seem to forget and their approach seem to be: we'll index everything and put it into memory for fast and efficient querying.

I'm currently building a storage system just for logs[1] (or timestamped data because you can store events too, whatever you like that is written once and is indexed by a timestamp) which focuses on: data compression and query performance. There's just so much to squeeze if you think things carefully and pay attention to details. This can translate to massive savings. Seeing how much money is spent on observability tools at the company I'm currently working for (they probably spend well over 500k $ per year on: datadog, sumologic, newrelic, sentry, observe) for approximately 40-50TB of data produced per month - it just amazes me. The data could be compressed to like 2-3TB easily and stored for pennies on S3.

[1] https://logdy.dev/logdy-pro

valyala 2 months ago | parent | prev [-]

> The mistake many teams make is to worry about storage but not querying

The amounts of logs, wide events and traces, which must be stored and queried, is frequently measured in hundreds of terabytes and petabytes. A petabyte of data on S3 costs $20000/month. Storage costs usually exceed compute costs on such a scale. So it is important to efficiently compress the ingested observability data, so it occupies less disk space and saves storage costs.

Efficient storage schemes help improving the performance for heavy queries, which need to process hundreds of terabytes of logs. For example, if you need to calculate the 95th percentile of request duration among hundreds of billions of nginx logs, the database needs to read all these logs. The query performance in this case is limited by storage read bandwidth, since these amounts of logs do not fit RAM, so the processed logs cannot be cached there (either by the database itself or by the Operating System page cache).

Let's estimate the time needed for processing 100TB of logs on a storage with 10GB/s read bandwidth (high-performance S3 and/or SSD): 100TB/10GB/s = 10000 seconds = 2 hours 47 minutes. Such query performance is unacceptable in most cases.

How to optimize the query performance in this case?

1. Compress the data stored on disk. Typical logs are compressed by 10x and more, so they occupy 10x less storage space than the actual size of the stored logs. This helps improving heavy query performance in the case above by 10x, to 1000 seconds or around 17 minutes.

2. Use column-oriented storage, e.g. store data per every column separately. This usually improves compression rate, since the data from a single column usually has lower randomness comparing to the per-log-entry data. This also helps improving heavy query performance a lot, since the database can read only the data for the requested columns, while skipping the data for the rest of columns. Suppose, the request duration is stored in an uint64 fields with nanosecond precision. Then 100 billions of request durations can be stored in 100 billions * 8 = 800GB. 800GB can be read in 800GB / 10GB/s = 80 seconds on the storage with 10 GB/s read bandwidth. The query duration can be reduced even more if the column data is compressed.

There are other options, which help optimizing the query performance over petabytes of logs (wide events):

- Partitioning the data by time (for example, storing it into per-day partitions). This helps improving performance for queries with time range filters by efficiently skipping the partitions outside the requested time range (the majority of practical queries over logs / wide events).

- Using bloom filters for skipping logs without the given rarely seen words / phrases (such as trace_id, user_id, ip, etc). See https://itnext.io/how-do-open-source-solutions-for-logs-work... for details.

- Using min/max indexes for skipping logs without the given numeric values.

All these techniques are implemented in VictoriaLogs, so it achieves high performance even in a single-node setup when running on your laptop / Raspberry PI. See https://docs.victoriametrics.com/victorialogs/