Remix.run Logo
ignoreusernames 3 days ago

> The reference implementation for Parquet is a gigantic Java library. I'm unconvinced this is a good idea.

I haven't though much about it, but I believe the ideal reference implementation would be a highly optimized "service like" process that you run alongside your engine using arrow to share zero copy buffers between the engine and the parquet service. Parquet predates arrow by quite a few years and java was (unfortunately) the standard for big data stuff back then, so they simply stuck with it.

> The way bit-packing has been implemented is to generate 74,000 lines of Java to read/write every combination of bitwidth, endianness and value-length

I think they did this to avoid the dynamic dispatch nature of java. If using C++ or Rust something very similar would happen, but at the compiler level which is a much saner way of doing this kind of thing.

willtemperley 3 days ago | parent | next [-]

Actually looking at the DuckDB source I think they re-use a single uint64 and push bits onto this a byte at a time, until bitwidth is reached, then right-shift bitwidth bits back off when a single value has been created. Very neat and presumably quick.

I've just had so many issues with total lack of clarity with this format. They tell you a total_compressed_size for a page then it turns out the _uncompressed_ page header is included in this - but the documentation barely give any clues to the layout [1].

The reality:

Each column chunk includes a list of pages written back-to-back, with an optional dictionary page first. Each of these, including the dictionary are prepended with an uncompressed PageHeader in Thrift format.

It wasn't too hard to write a paragraph about it. It was quite hard looking for magic compression bytes in hex dumps.

Maybe there should be a "minimum workable reference implementation" or something that is slow but easy to understand.

[1] https://parquet.apache.org/docs/file-format/data-pages/colum...

quotemstr 3 days ago | parent | prev [-]

If you're doing IPC to a sidecar to do purely numeric computation you could just as easily do in process something has gone terribly wrong with your software engineering methodology.