Remix.run Logo
vouwfietsman 3 hours ago

Not sure why this got so many upvotes, also the landing page is not great, its better to look at the paper (see link below).

Seems to be a columnar storage format that addresses some shortcomings in parquet. Thing is, though, that of all these formats the real winning feature is compatibility, which is (obviously) very hard to improve on, as anything new immediately loses.

Parquet is unfortunately very good just by virtue of being first, and so widely supported. The most widely used parquet version is the oldest version from 2013 (as per the paper itself), so parquet itself couldn't even supplant parquet. If you want to improve on it, you need to bring some serious results, which I don't think f3 does.

Also, my main gripe with parquet (single table per file) is not even addressed, so, also the name is a bit hyped up.

Also also, it seems to go out of its own way to include a compiled wasm binary for decoding, yet requires flatbuffers to parse that blob? Kind of defeats the purpose.

Its main result seems to be improved random access which, although certainly welcome, is not the point of columnar storage, as columnar storage was invented to exchange random access for something else: fast analytics. F3 seems to sacrifice fast analytics for the wasm decoder. I don't get it.

Maybe I'm being too cynical. Can someone help me out here?

https://dl.acm.org/doi/epdf/10.1145/3749163

aduffy 3 hours ago | parent | next [-]

> Also, my main gripe with parquet (single table per file) is not even addressed, so, also the name is a bit hyped up.

This is really more of an expectation that has been put on file formats by the query engines. Spark/Datafusion/DuckDB wouldn't really know what to do with a multi-table file.

> Parquet is unfortunately very good just by virtue of being first, and so widely supported

IMO that is not how technology works. It is great that Parquet is so good at a lot of things, but that does not mean just because it came first that it deserves to be the only analytic file format forever.

> Its main result seems to be improved random access which, although certainly welcome, is not the point of columnar storage, as columnar storage was invented to exchange random access for something else: fast analytics

Fast analytics, as well as newer ML-shaped workloads, are inherently mix of batch scans and random access.

Some of the authors of F3 previously authored another paper that goes into the details of the shortcomings of Parquet

https://www.vldb.org/pvldb/vol17/p148-zeng.pdf

All of the newer formats that popped up recently (Vortex, Lance, F3 now) have been working on solving the problems outlined in that paper.

Lance has some interesting ideas, Vortex focuses on extensibility and performance by replacing all of Parquet's black-box encoders with fully transparent encodings. This solves the tradeoff between bulk and element decoding, allowing you to have efficient full scans and really fast random access.

E.g. Langchain recently rebuilt a system that used to be all Parquet files to use Vortex and saw a massive speedup, which they talk about more here: https://www.langchain.com/blog/introducing-smithdb

Disclaimer: I work on Vortex, so a lot of these questions about "what is the point of building a new format" are things that I have grappled with myself.

vouwfietsman 3 hours ago | parent | next [-]

> DuckDB wouldn't really know what to do with a

Sure it would, you can attach a multi-table sqlite database in duckdb

> that does not mean just because it came first

I agree with most of your points, I am not stating my opinion but my observations. I am the target audience here, I want to use this, but I don't really care too much about the file format itself, at least not as much as I care about the data inside.

That means access, which means compatibility with my tooling.

Compatibility is hard to beat.

This is the concorde of file formats.

aduffy 3 hours ago | parent [-]

That is fair.

FWIW I think if you are just doing pure analytics and nothing else, Parquet will probably continue to do the job for you just fine, and you don't need to touch your workloads at all.

These new formats I think will find a niche where people aren't just running Spark jobs, but doing lots of systems building over large tables. If you're building a PB-scale data warehouse, you care a lot about the file format b/c it is a big factor in your performance curve, and you're willing to ship new experimental codecs in response to new datatypes you want to support that the system wasn't originally designed for, or you want to use a newly invented compressor.

sanderjd 2 hours ago | parent | prev | next [-]

Yeah that point about "random access is not the point of columnar formats" fell flat for me for this same reason. Almost since the first day I started using columnar data, I've been interested in solutions that strike this balance between batch and random access. This comes up all the time (in my experience) in data science / ML, where we have use cases for both access patterns against the same data.

So I'm with you, I'm very unconvinced that parquet (and the various things that are parquet or essentially-parquet under the hood) are the end of the line here.

JadeNB an hour ago | parent | prev [-]

> > Parquet is unfortunately very good just by virtue of being first, and so widely supported

> IMO that is not how technology works. It is great that Parquet is so good at a lot of things, but that does not mean just because it came first that it deserves to be the only analytic file format forever.

I think you and vouwfietsman (https://news.ycombinator.com/item?id=48649412) are actually saying the same thing in different words—I think their "unfortunately" means "it is unfortunate that, by virtue of coming first, this now has a support lead that will make it difficult for anyone else to catch up."

qurren 21 minutes ago | parent | prev | next [-]

> my main gripe with parquet (single table per file) is not even addressed

I consider that simplicity to be a feature, not a shortcoming.

I just tar a bunch of parquets if I need multiple tables. It is beautifully simple and easy to read in any language with its tar and parquet libraries.

saulpw 2 hours ago | parent | prev | next [-]

> Also, my main gripe with parquet (single table per file) is not even addressed, so, also the name is a bit hyped up.

When I was working with parquet, I imagined a .parquetz file format which was just a zip file containing any number of uncompressed parquet files. So you could sling multiple tables around in a single file, and still use range requests to access them.

mschuster91 3 hours ago | parent | prev [-]

>Not sure why this got so many upvotes, also the landing page is not great

Frankly it's a change from the usual ChatGPT generated slop that most landing pages are these days.