▲ | ignoreusernames 3 days ago | |
> The reference implementation for Parquet is a gigantic Java library. I'm unconvinced this is a good idea. I haven't though much about it, but I believe the ideal reference implementation would be a highly optimized "service like" process that you run alongside your engine using arrow to share zero copy buffers between the engine and the parquet service. Parquet predates arrow by quite a few years and java was (unfortunately) the standard for big data stuff back then, so they simply stuck with it. > The way bit-packing has been implemented is to generate 74,000 lines of Java to read/write every combination of bitwidth, endianness and value-length I think they did this to avoid the dynamic dispatch nature of java. If using C++ or Rust something very similar would happen, but at the compiler level which is a much saner way of doing this kind of thing. | ||
▲ | willtemperley 3 days ago | parent | next [-] | |
Actually looking at the DuckDB source I think they re-use a single uint64 and push bits onto this a byte at a time, until bitwidth is reached, then right-shift bitwidth bits back off when a single value has been created. Very neat and presumably quick. I've just had so many issues with total lack of clarity with this format. They tell you a total_compressed_size for a page then it turns out the _uncompressed_ page header is included in this - but the documentation barely give any clues to the layout [1]. The reality: Each column chunk includes a list of pages written back-to-back, with an optional dictionary page first. Each of these, including the dictionary are prepended with an uncompressed PageHeader in Thrift format. It wasn't too hard to write a paragraph about it. It was quite hard looking for magic compression bytes in hex dumps. Maybe there should be a "minimum workable reference implementation" or something that is slow but easy to understand. [1] https://parquet.apache.org/docs/file-format/data-pages/colum... | ||
▲ | quotemstr 3 days ago | parent | prev [-] | |
If you're doing IPC to a sidecar to do purely numeric computation you could just as easily do in process something has gone terribly wrong with your software engineering methodology. |