▲ | willtemperley 3 days ago | |||||||||||||
The reference implementation for Parquet is a gigantic Java library. I'm unconvinced this is a good idea. Take the RLE encoding which switches between run-length encoding and bit-packing. The way bit-packing has been implemented is to generate 74,000 lines of Java to read/write every combination of bitwidth, endianness and value-length. I just can't believe this is optimal except in maybe very specific CPU-only cases (e.g. Parquet-Java running on a giant cluster somewhere). If it were just bit-packing I could easily offload a whole data page to a GPU and not care about having per-bitwidth optimised implementations, but having to switch encodings at random intervals just makes this a headache. It would be really nice if actual design documents exist that specify why this is a good idea based on real-world data patterns. | ||||||||||||||
▲ | ignoreusernames 3 days ago | parent | next [-] | |||||||||||||
> The reference implementation for Parquet is a gigantic Java library. I'm unconvinced this is a good idea. I haven't though much about it, but I believe the ideal reference implementation would be a highly optimized "service like" process that you run alongside your engine using arrow to share zero copy buffers between the engine and the parquet service. Parquet predates arrow by quite a few years and java was (unfortunately) the standard for big data stuff back then, so they simply stuck with it. > The way bit-packing has been implemented is to generate 74,000 lines of Java to read/write every combination of bitwidth, endianness and value-length I think they did this to avoid the dynamic dispatch nature of java. If using C++ or Rust something very similar would happen, but at the compiler level which is a much saner way of doing this kind of thing. | ||||||||||||||
| ||||||||||||||
▲ | nerdponx 3 days ago | parent | prev | next [-] | |||||||||||||
I'd rather have this file format with an incomplete reference and confusing implementation, than not have this file format at all. Parquet was such a tremendous improvement in quality of life over the prior status quo for anyone that needs to move even moderate amounts of data between systems, or anyone who cares about correctness and bug prevention when working with even the tiniest data sets. Maybe HDF5 and ORC would have filled the niche if Parquet hadn't, but I think realistically we would just be stuck with fragile CSV/TSV. | ||||||||||||||
▲ | willtemperley 3 days ago | parent | prev | next [-] | |||||||||||||
Addendum: if something is actually decoded by RunLengthBitPackingHybridDecoder but you call the encoding RLE this is probably because it was a bad idea in the first place. Plus it makes it really hard to search for. | ||||||||||||||
▲ | quotemstr 3 days ago | parent | prev [-] | |||||||||||||
74 KLOC for a decoder? That's ridiculous. Use invokedynamic. Yes, people more typically associate invokedynamic with interpreter implementations or whatever, but it's actually perfect for this use case. Generate the right code on demand and let the JVM cache it so that subsequent invocations are just as fast as if you'd written it by hand. Jesus Christ this isn't 2005 anymore and people need to learn to use the real power of the JVM. It's stuff like this that sets it apart |