The reference implementation for Parquet is a gigantic Java library. I'm unconvinced this is a good idea.

Take the RLE encoding which switches between run-length encoding and bit-packing. The way bit-packing has been implemented is to generate 74,000 lines of Java to read/write every combination of bitwidth, endianness and value-length.

I just can't believe this is optimal except in maybe very specific CPU-only cases (e.g. Parquet-Java running on a giant cluster somewhere).

If it were just bit-packing I could easily offload a whole data page to a GPU and not care about having per-bitwidth optimised implementations, but having to switch encodings at random intervals just makes this a headache.

It would be really nice if actual design documents exist that specify why this is a good idea based on real-world data patterns.

▲

ignoreusernames 3 days ago | parent | next [-]

> The reference implementation for Parquet is a gigantic Java library. I'm unconvinced this is a good idea.

I haven't though much about it, but I believe the ideal reference implementation would be a highly optimized "service like" process that you run alongside your engine using arrow to share zero copy buffers between the engine and the parquet service. Parquet predates arrow by quite a few years and java was (unfortunately) the standard for big data stuff back then, so they simply stuck with it.

> The way bit-packing has been implemented is to generate 74,000 lines of Java to read/write every combination of bitwidth, endianness and value-length

I think they did this to avoid the dynamic dispatch nature of java. If using C++ or Rust something very similar would happen, but at the compiler level which is a much saner way of doing this kind of thing.

	▲	willtemperley 3 days ago \| parent \| next [-]
		Actually looking at the DuckDB source I think they re-use a single uint64 and push bits onto this a byte at a time, until bitwidth is reached, then right-shift bitwidth bits back off when a single value has been created. Very neat and presumably quick. I've just had so many issues with total lack of clarity with this format. They tell you a total_compressed_size for a page then it turns out the _uncompressed_ page header is included in this - but the documentation barely give any clues to the layout [1]. The reality: Each column chunk includes a list of pages written back-to-back, with an optional dictionary page first. Each of these, including the dictionary are prepended with an uncompressed PageHeader in Thrift format. It wasn't too hard to write a paragraph about it. It was quite hard looking for magic compression bytes in hex dumps. Maybe there should be a "minimum workable reference implementation" or something that is slow but easy to understand. [1] https://parquet.apache.org/docs/file-format/data-pages/colum...
	▲	quotemstr 3 days ago \| parent \| prev [-]
		If you're doing IPC to a sidecar to do purely numeric computation you could just as easily do in process something has gone terribly wrong with your software engineering methodology.

▲

nerdponx 3 days ago | parent | prev | next [-]

I'd rather have this file format with an incomplete reference and confusing implementation, than not have this file format at all. Parquet was such a tremendous improvement in quality of life over the prior status quo for anyone that needs to move even moderate amounts of data between systems, or anyone who cares about correctness and bug prevention when working with even the tiniest data sets. Maybe HDF5 and ORC would have filled the niche if Parquet hadn't, but I think realistically we would just be stuck with fragile CSV/TSV.

▲

willtemperley 3 days ago | parent | prev | next [-]

Addendum: if something is actually decoded by RunLengthBitPackingHybridDecoder but you call the encoding RLE this is probably because it was a bad idea in the first place. Plus it makes it really hard to search for.

▲

quotemstr 3 days ago | parent | prev [-]

74 KLOC for a decoder? That's ridiculous. Use invokedynamic. Yes, people more typically associate invokedynamic with interpreter implementations or whatever, but it's actually perfect for this use case. Generate the right code on demand and let the JVM cache it so that subsequent invocations are just as fast as if you'd written it by hand.

Jesus Christ this isn't 2005 anymore and people need to learn to use the real power of the JVM. It's stuff like this that sets it apart