Remix.run Logo
willtemperley 3 days ago

Actually looking at the DuckDB source I think they re-use a single uint64 and push bits onto this a byte at a time, until bitwidth is reached, then right-shift bitwidth bits back off when a single value has been created. Very neat and presumably quick.

I've just had so many issues with total lack of clarity with this format. They tell you a total_compressed_size for a page then it turns out the _uncompressed_ page header is included in this - but the documentation barely give any clues to the layout [1].

The reality:

Each column chunk includes a list of pages written back-to-back, with an optional dictionary page first. Each of these, including the dictionary are prepended with an uncompressed PageHeader in Thrift format.

It wasn't too hard to write a paragraph about it. It was quite hard looking for magic compression bytes in hex dumps.

Maybe there should be a "minimum workable reference implementation" or something that is slow but easy to understand.

[1] https://parquet.apache.org/docs/file-format/data-pages/colum...