Remix.run Logo
tudorg 4 hours ago

I have been working on another PG extension for timeseries (https://github.com/xataio/deltax) for a few months, and trying to make it score as good as possible on ClickBench.

This is a project that is simply lot of fun to work on. There are many tricks that can be used to speed-up analytics, besides just type-aware compression:

* for each segment you will keep things like max/min/sum, number of distinct values, bloom filters, etc. For a good amount of common queries, you can answer them just based on this metadata, so you don't need to decompress the columns at all.

* for text column, you compress them differently based on cardinality. Low cardinality (think labels or similar) is dictionary based compression. High cardinality is LZ4.

* Generally the smaller the data on disk, the higher the cold runs performance. This is because you need less IO to load it in memory. I have discovered that on top of the type-aware compression, it's worth doing another round of LZ4. There's also some research that it's sometimes worth doing multiple passes of LZ4.

* Partition and segment pruning. If you can tell from the metadata or bloom filters that the filter doesn't match a partition or segment, you skip the whole thing.

* Push down of filters in the decompression layer. Depending on the compression algorithm, while you decompress you can also filter out the values that you don't need. This avoids passing data and allocating memory for elements that will be later discarded anyway.

* Organization of data on disk is more important than almost anything else. Of course, that's the main point of columnar storage, but there are level of details on how to organize the data so that IO is minimized during queries. I have tried 3-4 different layouts before settling on one.

* For top N type of queries, which are really common in analytics, you want to stop the reading from disk / decompressed as soon as you have enough data to guarantee that you have a correct top N to satisfy the query.

* Parallelize everything: at least ClickBench runs on instances with a lot of CPU cores, so you need to parallelize every step of the way. This is done differently depending on the query type. For example for top N, each worker can take a subset of the segments and get the top N from each of them. Then you combine those in a single result.