Remix clone Hacker News

new | show | ask | jobs Github

	▲	willtemperley 3 days ago
		Actually looking at the DuckDB source I think they re-use a single uint64 and push bits onto this a byte at a time, until bitwidth is reached, then right-shift bitwidth bits back off when a single value has been created. Very neat and presumably quick. I've just had so many issues with total lack of clarity with this format. They tell you a total_compressed_size for a page then it turns out the _uncompressed_ page header is included in this - but the documentation barely give any clues to the layout [1]. The reality: Each column chunk includes a list of pages written back-to-back, with an optional dictionary page first. Each of these, including the dictionary are prepended with an uncompressed PageHeader in Thrift format. It wasn't too hard to write a paragraph about it. It was quite hard looking for magic compression bytes in hex dumps. Maybe there should be a "minimum workable reference implementation" or something that is slow but easy to understand. [1] https://parquet.apache.org/docs/file-format/data-pages/colum...