Remix.run Logo
piterrro 3 days ago

> The mistake many teams make is to worry about storage but not querying. Storing data is the easy part. Querying is the hard part. Some columnar data format stored in S3 doesn't solve querying. You need to have some system that loads all those files, creates indices or performs some map reduce logic to get answers out of those files.

That's a nice callout, there's a lack of awareness in our space that producing logs is one thing, but if you do it on a scale, this stuff gets pretty tricky. Storing for effective query becomes crucial and this is what most popular OSS solutions seem to forget and their approach seem to be: we'll index everything and put it into memory for fast and efficient querying.

I'm currently building a storage system just for logs[1] (or timestamped data because you can store events too, whatever you like that is written once and is indexed by a timestamp) which focuses on: data compression and query performance. There's just so much to squeeze if you think things carefully and pay attention to details. This can translate to massive savings. Seeing how much money is spent on observability tools at the company I'm currently working for (they probably spend well over 500k $ per year on: datadog, sumologic, newrelic, sentry, observe) for approximately 40-50TB of data produced per month - it just amazes me. The data could be compressed to like 2-3TB easily and stored for pennies on S3.

[1] https://logdy.dev/logdy-pro