| ▲ | aduffy 3 hours ago | |||||||
> Also, my main gripe with parquet (single table per file) is not even addressed, so, also the name is a bit hyped up. This is really more of an expectation that has been put on file formats by the query engines. Spark/Datafusion/DuckDB wouldn't really know what to do with a multi-table file. > Parquet is unfortunately very good just by virtue of being first, and so widely supported IMO that is not how technology works. It is great that Parquet is so good at a lot of things, but that does not mean just because it came first that it deserves to be the only analytic file format forever. > Its main result seems to be improved random access which, although certainly welcome, is not the point of columnar storage, as columnar storage was invented to exchange random access for something else: fast analytics Fast analytics, as well as newer ML-shaped workloads, are inherently mix of batch scans and random access. Some of the authors of F3 previously authored another paper that goes into the details of the shortcomings of Parquet https://www.vldb.org/pvldb/vol17/p148-zeng.pdf All of the newer formats that popped up recently (Vortex, Lance, F3 now) have been working on solving the problems outlined in that paper. Lance has some interesting ideas, Vortex focuses on extensibility and performance by replacing all of Parquet's black-box encoders with fully transparent encodings. This solves the tradeoff between bulk and element decoding, allowing you to have efficient full scans and really fast random access. E.g. Langchain recently rebuilt a system that used to be all Parquet files to use Vortex and saw a massive speedup, which they talk about more here: https://www.langchain.com/blog/introducing-smithdb Disclaimer: I work on Vortex, so a lot of these questions about "what is the point of building a new format" are things that I have grappled with myself. | ||||||||
| ▲ | vouwfietsman 3 hours ago | parent | next [-] | |||||||
> DuckDB wouldn't really know what to do with a Sure it would, you can attach a multi-table sqlite database in duckdb > that does not mean just because it came first I agree with most of your points, I am not stating my opinion but my observations. I am the target audience here, I want to use this, but I don't really care too much about the file format itself, at least not as much as I care about the data inside. That means access, which means compatibility with my tooling. Compatibility is hard to beat. This is the concorde of file formats. | ||||||||
| ||||||||
| ▲ | sanderjd 2 hours ago | parent | prev | next [-] | |||||||
Yeah that point about "random access is not the point of columnar formats" fell flat for me for this same reason. Almost since the first day I started using columnar data, I've been interested in solutions that strike this balance between batch and random access. This comes up all the time (in my experience) in data science / ML, where we have use cases for both access patterns against the same data. So I'm with you, I'm very unconvinced that parquet (and the various things that are parquet or essentially-parquet under the hood) are the end of the line here. | ||||||||
| ▲ | JadeNB an hour ago | parent | prev [-] | |||||||
> > Parquet is unfortunately very good just by virtue of being first, and so widely supported > IMO that is not how technology works. It is great that Parquet is so good at a lot of things, but that does not mean just because it came first that it deserves to be the only analytic file format forever. I think you and vouwfietsman (https://news.ycombinator.com/item?id=48649412) are actually saying the same thing in different words—I think their "unfortunately" means "it is unfortunate that, by virtue of coming first, this now has a support lead that will make it difficult for anyone else to catch up." | ||||||||