Remix.run Logo
iw7tdb2kqo9 a day ago

I haven't worked in ClickHouse level scale.

Can you search log data in this volume? ElasticSearch has query capabilities for small scale log data I think.

Why would I use ClickHouse instead of storing log data as json file for historical log data?

munchbunny a day ago | parent | next [-]

> Can you search log data in this volume?

(Context: I work at this scale)

Yes. However, as you can imagine, the processing costs can be potentially enormous. If your indexing/ordering/clustering strategy isn't set up well, a single query can easily end up costing you on the order of $1-$10 to do something as simple as "look for records containing this string".

My experiences line up with theirs: at the scale where you are moving petabytes of data, the best optimizations are, unsurprisingly, "touch as little data as few times as possible" and "move as little data as possible". Every time you have to serialize/de-serialize, and every time you have to perform disk/network I/O, you introduce a lot of performance cost and therefore overall cost to your wallet.

Naturally, this can put OTel directly at odds with efficiency because the OTel collector is an extra I/O and serialization hop. But then again, if you operate at the petabyte scale, the amount of money you save by throwing away a single hop can more than pay for an engineer whose only job is to write serializer/deserializer logic.

gnaman 7 hours ago | parent [-]

How do engineers troubleshoot then? Our engineers would throw hands if they are asked not to parse through two months worth of log volume for a single issue.

munchbunny 4 hours ago | parent [-]

In practice, at the scale I work at, it's barely feasible to scan one week of logs, let alone two months, because you'll be waiting hours for the result. So you learn strategies to only need to scan a subset of the logs at a time.

valyala 7 hours ago | parent | prev | next [-]

> Why would I use ClickHouse instead of storing log data as json file for historical log data?

There are multiple reasons:

1. Databases optimized for logs (such as ClickHouse or VictoriaLogs) store logs in a compressed form, where values per every log field are grouped and compressed individually (aka column-oriented storage). This results in smaller storage space comparing to plain files with JSON logs, even if they are compressed.

2. Databases optimized for logs perform typical queries at much faster speed comparing to grep over JSON files. Performance gains may be 1000x and more because these databases skip reading unneeded data. See https://chronicles.mad-scientist.club/tales/grepping-logs-re...

3. How are you going to grep 100 petabytes of JSON files? Databases optimized for logs allow querying such amounts of logs because they can scale horizontally by adding more storage nodes and storage space.

sethammons a day ago | parent | prev | next [-]

Scale and costs. We are faced with logging scale at my work. A naive "push json into splunk" will cost us over $6M/year, but I can only get maybe 5-10% of that approved.

In the article, they talk about needing 8k cpu to process their json logs, but only 90 cpu afterward.

h1fra a day ago | parent | prev [-]

Couple of years ago clickhouse wasn't that good with full text search, to me that was the biggest drawback. Yes it's faster and can handle ES scale but depending on your use case it's way faster to query ES when you do FTS or grouping without pre-build index.

valyala 7 hours ago | parent [-]

How much RAM does Elasticsearch need for fast full-text search over 100 petabytes of logs? 100 petabytes is 100 millions of gigabytes, just in case.