Remix.run Logo
aduffy 3 hours ago

That is fair.

FWIW I think if you are just doing pure analytics and nothing else, Parquet will probably continue to do the job for you just fine, and you don't need to touch your workloads at all.

These new formats I think will find a niche where people aren't just running Spark jobs, but doing lots of systems building over large tables. If you're building a PB-scale data warehouse, you care a lot about the file format b/c it is a big factor in your performance curve, and you're willing to ship new experimental codecs in response to new datatypes you want to support that the system wasn't originally designed for, or you want to use a newly invented compressor.