▲ | iw7tdb2kqo9 a day ago | ||||||||||||||||
I haven't worked in ClickHouse level scale. Can you search log data in this volume? ElasticSearch has query capabilities for small scale log data I think. Why would I use ClickHouse instead of storing log data as json file for historical log data? | |||||||||||||||||
▲ | munchbunny a day ago | parent | next [-] | ||||||||||||||||
> Can you search log data in this volume? (Context: I work at this scale) Yes. However, as you can imagine, the processing costs can be potentially enormous. If your indexing/ordering/clustering strategy isn't set up well, a single query can easily end up costing you on the order of $1-$10 to do something as simple as "look for records containing this string". My experiences line up with theirs: at the scale where you are moving petabytes of data, the best optimizations are, unsurprisingly, "touch as little data as few times as possible" and "move as little data as possible". Every time you have to serialize/de-serialize, and every time you have to perform disk/network I/O, you introduce a lot of performance cost and therefore overall cost to your wallet. Naturally, this can put OTel directly at odds with efficiency because the OTel collector is an extra I/O and serialization hop. But then again, if you operate at the petabyte scale, the amount of money you save by throwing away a single hop can more than pay for an engineer whose only job is to write serializer/deserializer logic. | |||||||||||||||||
| |||||||||||||||||
▲ | valyala 7 hours ago | parent | prev | next [-] | ||||||||||||||||
> Why would I use ClickHouse instead of storing log data as json file for historical log data? There are multiple reasons: 1. Databases optimized for logs (such as ClickHouse or VictoriaLogs) store logs in a compressed form, where values per every log field are grouped and compressed individually (aka column-oriented storage). This results in smaller storage space comparing to plain files with JSON logs, even if they are compressed. 2. Databases optimized for logs perform typical queries at much faster speed comparing to grep over JSON files. Performance gains may be 1000x and more because these databases skip reading unneeded data. See https://chronicles.mad-scientist.club/tales/grepping-logs-re... 3. How are you going to grep 100 petabytes of JSON files? Databases optimized for logs allow querying such amounts of logs because they can scale horizontally by adding more storage nodes and storage space. | |||||||||||||||||
▲ | sethammons a day ago | parent | prev | next [-] | ||||||||||||||||
Scale and costs. We are faced with logging scale at my work. A naive "push json into splunk" will cost us over $6M/year, but I can only get maybe 5-10% of that approved. In the article, they talk about needing 8k cpu to process their json logs, but only 90 cpu afterward. | |||||||||||||||||
▲ | h1fra a day ago | parent | prev [-] | ||||||||||||||||
Couple of years ago clickhouse wasn't that good with full text search, to me that was the biggest drawback. Yes it's faster and can handle ES scale but depending on your use case it's way faster to query ES when you do FTS or grouping without pre-build index. | |||||||||||||||||
|