Remix.run Logo
F3(github.com)
515 points by tosh 4 hours ago | 121 comments
vouwfietsman 2 hours ago | parent | next [-]

Not sure why this got so many upvotes, also the landing page is not great, its better to look at the paper (see link below).

Seems to be a columnar storage format that addresses some shortcomings in parquet. Thing is, though, that of all these formats the real winning feature is compatibility, which is (obviously) very hard to improve on, as anything new immediately loses.

Parquet is unfortunately very good just by virtue of being first, and so widely supported. The most widely used parquet version is the oldest version from 2013 (as per the paper itself), so parquet itself couldn't even supplant parquet. If you want to improve on it, you need to bring some serious results, which I don't think f3 does.

Also, my main gripe with parquet (single table per file) is not even addressed, so, also the name is a bit hyped up.

Also also, it seems to go out of its own way to include a compiled wasm binary for decoding, yet requires flatbuffers to parse that blob? Kind of defeats the purpose.

Its main result seems to be improved random access which, although certainly welcome, is not the point of columnar storage, as columnar storage was invented to exchange random access for something else: fast analytics. F3 seems to sacrifice fast analytics for the wasm decoder. I don't get it.

Maybe I'm being too cynical. Can someone help me out here?

https://dl.acm.org/doi/epdf/10.1145/3749163

aduffy 2 hours ago | parent | next [-]

> Also, my main gripe with parquet (single table per file) is not even addressed, so, also the name is a bit hyped up.

This is really more of an expectation that has been put on file formats by the query engines. Spark/Datafusion/DuckDB wouldn't really know what to do with a multi-table file.

> Parquet is unfortunately very good just by virtue of being first, and so widely supported

IMO that is not how technology works. It is great that Parquet is so good at a lot of things, but that does not mean just because it came first that it deserves to be the only analytic file format forever.

> Its main result seems to be improved random access which, although certainly welcome, is not the point of columnar storage, as columnar storage was invented to exchange random access for something else: fast analytics

Fast analytics, as well as newer ML-shaped workloads, are inherently mix of batch scans and random access.

Some of the authors of F3 previously authored another paper that goes into the details of the shortcomings of Parquet

https://www.vldb.org/pvldb/vol17/p148-zeng.pdf

All of the newer formats that popped up recently (Vortex, Lance, F3 now) have been working on solving the problems outlined in that paper.

Lance has some interesting ideas, Vortex focuses on extensibility and performance by replacing all of Parquet's black-box encoders with fully transparent encodings. This solves the tradeoff between bulk and element decoding, allowing you to have efficient full scans and really fast random access.

E.g. Langchain recently rebuilt a system that used to be all Parquet files to use Vortex and saw a massive speedup, which they talk about more here: https://www.langchain.com/blog/introducing-smithdb

Disclaimer: I work on Vortex, so a lot of these questions about "what is the point of building a new format" are things that I have grappled with myself.

vouwfietsman 2 hours ago | parent | next [-]

> DuckDB wouldn't really know what to do with a

Sure it would, you can attach a multi-table sqlite database in duckdb

> that does not mean just because it came first

I agree with most of your points, I am not stating my opinion but my observations. I am the target audience here, I want to use this, but I don't really care too much about the file format itself, at least not as much as I care about the data inside.

That means access, which means compatibility with my tooling.

Compatibility is hard to beat.

This is the concorde of file formats.

aduffy an hour ago | parent [-]

That is fair.

FWIW I think if you are just doing pure analytics and nothing else, Parquet will probably continue to do the job for you just fine, and you don't need to touch your workloads at all.

These new formats I think will find a niche where people aren't just running Spark jobs, but doing lots of systems building over large tables. If you're building a PB-scale data warehouse, you care a lot about the file format b/c it is a big factor in your performance curve, and you're willing to ship new experimental codecs in response to new datatypes you want to support that the system wasn't originally designed for, or you want to use a newly invented compressor.

sanderjd an hour ago | parent | prev [-]

Yeah that point about "random access is not the point of columnar formats" fell flat for me for this same reason. Almost since the first day I started using columnar data, I've been interested in solutions that strike this balance between batch and random access. This comes up all the time (in my experience) in data science / ML, where we have use cases for both access patterns against the same data.

So I'm with you, I'm very unconvinced that parquet (and the various things that are parquet or essentially-parquet under the hood) are the end of the line here.

saulpw an hour ago | parent | prev | next [-]

> Also, my main gripe with parquet (single table per file) is not even addressed, so, also the name is a bit hyped up.

When I was working with parquet, I imagined a .parquetz file format which was just a zip file containing any number of uncompressed parquet files. So you could sling multiple tables around in a single file, and still use range requests to access them.

mschuster91 an hour ago | parent | prev [-]

>Not sure why this got so many upvotes, also the landing page is not great

Frankly it's a change from the usual ChatGPT generated slop that most landing pages are these days.

gavinray 3 hours ago | parent | prev | next [-]

This bit is quite genius, rather than depend on a language-specific SDK/lib for working with the formats you can fallback to exported WASM methods if none exist:

  > "Each self-describing F3 file includes both the data and meta-data, as well as WebAssembly (Wasm) binaries to decode the data. Embedding the decoders in each file requires minimal storage (kilobytes) and ensures compatibility on any platform in case native decoders are unavailable. "
jasonjayr 3 hours ago | parent | next [-]

So attackers don't have to craft specially corrupted files? They can just include the code to perform the attack in the data file itself?

weinzierl 3 hours ago | parent | next [-]

WASM has strong tried and proven sandboxing. We basically can build on nearly 30 years of experience. The decoders don't need a lot of access, they can basically be pure functions.

If this will pan out security-wise I don't know. I'm more worried that it will be so slow that no one will use it. Interesting idea, though, and I can see applications outside of the "big data" realm this apparently targets.

ok123456 3 hours ago | parent | next [-]

How do you prevent compression bomb attacks when files can define their own compression functions?

You could have some kind of OOM killer, but that will be a "footgun" that people who are actually doing "big data" will constantly shoot.

This pretty much kills any ingestion pipeline where the source is untrusted.

computomatic 2 hours ago | parent | next [-]

It seems like the WASM is simply a fallback if no other decoder is available. If the data source is untrusted, simply don’t run the WASM decoders.

“Some code is untrusted” does not mean code should never be executed. There are more use cases with trusted sources than untrusted.

ok123456 2 hours ago | parent [-]

So I define the data type to be "asdklfjaslkdfjiolsadfjoiusadfoiasfoikasjfdoisadf" and give you a decoder for it.

johncolanduoni 2 hours ago | parent | prev | next [-]

OOM killing in WebAssembly is trivial, since it’s all in a growable linear memory. All the runtimes I’m aware of have a simple maximum memory setting, and they’ll trap any allocation requests after that point.

blmarket 2 hours ago | parent | next [-]

Attack is not just on file format itself. Based on the function signature it's possible for a single decoder to generate infinite bytestream - makes a lot of headache to reader implementation - implementing STRLEN is no longer trivial question.

Either engines should put some limit (e.g. VARCHAR(2000) to enforce length to be limited to 2000, but there are some other engines supporting unlimited BLOBs), or decoder should give a hint what is the maximum length it will yield. Unfortunately current research level project does not have such considerations implemented yet...

ok123456 2 hours ago | parent | prev | next [-]

For images, it makes sense: people dealing with 16k x 16k PNGs are uncommon. Give them an error message that tells them the setting to bump. But what should be the threshold for "big data"? I'm sure it will follow Zipf's Law, but the tail will be fatter.

titzer 2 hours ago | parent | prev [-]

And many of them have built-in gas metering, so you can time out the decode if it runs too many instructions.

kibwen 2 hours ago | parent | prev [-]

Denial-of-service is bad, but it's not in the same ballpark, the same sport, the same planet, or the same universe of bad as RCE.

Retr0id 2 hours ago | parent | prev | next [-]

WASM implementations are fairly mature now, but if there was e.g. an image file format with embedded WASM that needed to execute before you could view it, it would become the new low-hanging-fruit target for 0-click RCEs - whether it's exploiting the WASM engine itself or some other attack surface that's influenceable via it (See also, the FORCEDENTRY JBIG2 exploit).

titzer 2 hours ago | parent [-]

That exploit targeted an integer overflow in a bespoke Apple sandboxing mechanism. Bespoke sandboxing mechanisms have weird bugs.

Not that Wasm engines don't have bugs, but the whole point is to have an extremely solid, well-specified and efficient implementation of a widely accepted bytecode format. We can scope down the capabilities given to any program to a minimal set.

Retr0id 2 hours ago | parent [-]

Bugs are near-inevitable, and mitigations are the last line of defence. Scripting engines are excellent for bypassing mitigations (iiuc in the case of the FORCEDENTRY exploit, it was used for adjusting ASLR'd offsets).

As a random example that's an area of personal interest to me, I know of 3 distinct methods of achieving userland ROP execution of the Nintendo Switch 2, and all three rely on the (ab)use of a scripting engine (even if they aren't a vulnerability in the scripting engine itself).

titzer 2 hours ago | parent [-]

Well don't accept code from anyone ever then.

But seriously, if your format requires extensibility to the point that it embeds a bytecode, especially a Turing-complete bytecode, what format are you going to choose? Just design a new one? That's how you end up with a scripting engine with three ROP exploits.

bilekas 3 hours ago | parent | prev | next [-]

> The decoders don't need a lot of access, they can basically be pure functions

They don't currently either do they? It's the tight coupling of the interface layer no? I'm not sure this would be faster, or more secure so reliability might be the best usecase?

Kiboneu 2 hours ago | parent | prev [-]

> WASM has strong tried and proven sandboxing. We basically can build on nearly 30 years of experience. The decoders don't need a lot of access, they can basically be pure functions.

I've heard that kind of sentiment many times before. It's not a good (thought-terminating) mindset to have for any secure software.

There are several WASM implementations, WASM is just a format. "Pure functions" are pure at a superficial level. Many people say that they don't mutate global state, but they do ... it's just hidden. The decoders "not needing a lot of access" doesn't matter if the WASM engine is pwned through arbitrary code execution inside the environment, or if it's contorted to bypass the access control you are mentioning through various side-effects.

arcfour 3 hours ago | parent | prev | next [-]

Yes...my first thought. No way in hell anyone actually trusts this.

(And as if we didn't trust the compiler enough already!)

Omega359 2 hours ago | parent [-]

Meh, it's not that bad. Pretty simple to block inline wasm and to use well known external decoders.

nine_k 3 hours ago | parent | prev | next [-]

Does WASM have built-in I/O? If not, all that a decoder would be able to do is to decode into a buffer.

0x457 2 hours ago | parent [-]

All WASM can do is transfer bag of bytes between module runtime and host. So yes, so yeah it can just decode into a buffer. Even you use wasm components to give it I/O, you can still make these go to buffer.

doctorpangloss 3 hours ago | parent | prev [-]

But the WASM runs in the sandbox! It only has access to some files, your display, inputs, ... nothing insecure at all!

gavinray 3 hours ago | parent [-]

WASM runs in a confined memory space allocated for the program. There is no I/O or host address space access.

You need to run a WASI environment for that.

rebeccajae 3 hours ago | parent | prev | next [-]

It sounds neat, but feels like it might fall apart with higher-complexity formats. What does an embedded decoder for a PDF look like? I guess since they are tightly-coupled to the file bytes themselves, the author of the file gets to choose what formats make sense, but not all formats have a one-true-decode-step.

aseipp 3 hours ago | parent [-]

Despite the name seemingly implying otherwise, F3 is an alternative to columnar storage formats like Parquet; the goal is not to support every conceivable encoding of every file type such as a PDF. Think of the use cases being more like "What if you used a specialized compressor and need a custom block decompression algorithm" or "Decode internal format into Arrow output" or something like that.

mort96 2 hours ago | parent | prev | next [-]

I don't understand how that's supposed to work. What does the decoder decode into? That's gonna depend entirely on the kind of data, right? For some formats, it's gonna be a stream of bytes; for others, a 2D plane of pixels; others again will need vertexes, 2D planes of pixels and UV maps; for some, an object graph will make more sense.

gavinray 2 hours ago | parent [-]

It appears as though the WASM decode returns two values -- one indicating the data type as a primitive value, and a second value being the data buffer

Then there is a helper in this case to de-serialize, "primitive_array_from_buffers()"

https://github.com/future-file-format/F3/blob/bd92506447dc13...

cbm-vic-20 3 hours ago | parent | prev | next [-]

Applets redux.

grodes 3 hours ago | parent | prev | next [-]

How is wasm better than C bindings?

gavinray 3 hours ago | parent | next [-]

Many languages don't have ergonomic experiences for working with C ABI's without explicit wrapper code.

Hell, Node.js didn't even get this ability until LAST MONTH:

https://nodejs.org/en/blog/release/v26.1.0

You'd have to write a second library to interface the C ABI with Node via NAPI just to consume it.

bluejekyll 3 hours ago | parent | prev | next [-]

WASM is platform independent.

What do you mean by C bindings? C bindings to what?

grodes 2 hours ago | parent [-]

C bindings to a C implementation

yung_lean 2 hours ago | parent [-]

This isn't using WASM to solve the "how can I make my file format compatible with more programming languages?" problem. This is trying to solve the "how can I add new encodings to my file format without making everyone update their code?" problem. The former would rightly be solved with C bindings that anyone can link with if they want. The latter might not seem like a big deal, but it's been the main blocker advancing the parquet format. Most people end up not caring about new advanced encodings and just write parquet files with the most compatible feature set.

coldtea 2 hours ago | parent | prev [-]

C bindings are not platform independent, nor do they come with a runtime and a sandbox, among other things. Apples to oranges.

andrewstuart2 3 hours ago | parent | prev | next [-]

I would call it clever. I'm not sure I'd call it genius.

When I'm working with data I'm working in a specific set of languages. Usually one. Yeah, other people might be working in other languages, but no individual author really needs a language-agnostic way of accessing data beyond compile time. Add to that the likely runtime boundaries that may need to be crossed instead of e.g. inlined by the compiler because it's in-language and dealing with known offsets or tags (depends on the data format of course). To the other commenter's point, am I going to have to sandbox all data access code just to be sure it's not able to do something unexpected? There's a lot of complexity here. And the inherent risk is going to slow down the operation that should be the simplest and fastest: interpreting bytes.

yung_lean 3 hours ago | parent | next [-]

A big problem with parquet, which this aims to replace, is that it's hard to add new encodings because everyone wants to stay compatible with old readers. Embedding the decoders in the file as WASM solves this problem since in theory, old readers will be able to read new files by just using the provided WASM to decode a column whose format the reader doesn't recognize.

So this is really about making a file that is forwards compatible in a way that lets you push the standards more than existing formats.

coldtea 2 hours ago | parent | prev [-]

>no individual author really needs a language-agnostic way of accessing data beyond compile time.

That's so untrue! People need language-agnostic ways to access data all the time, and people work with data accessing them from multiple languages all the time!

If I have parquet files I can load them in duckdb, in pandas and polars, process them with various independent tools, and loads of other things... and people do that.

This is also why people like something like an SQL database, your data is not locked to some specific language / lib for access.

verdverm 3 hours ago | parent | prev | next [-]

Is embedding executable code into a file a security risk? My assumption is a yes

mirashii 3 hours ago | parent | next [-]

That would be why it chose a VM that is explicitly designed for sandboxing rather than native executable code or similar, the risk can be minimized by reducing the surface area available to that executable code to almost nothing.

jayd16 27 minutes ago | parent [-]

You still have the halting problem to solve to prevent denial of service.

msla 3 hours ago | parent | prev | next [-]

> Is embedding executable code into a file a security risk?

Yes, which is why nobody uses PDFs.

NooneAtAll3 2 hours ago | parent | next [-]

which is why no sane pdf viewer implements executable features*

bguebert 2 hours ago | parent | prev [-]

I mean I disable javascript embedded in pdf and feel like it would have been better to not have that feature. It would spare people from the invoice.pdf email attachment viruses because most people had assumed pdf isn't going to be as bad as an exe.

nine_k 3 hours ago | parent | prev | next [-]

TrueType and OpenType fonts include code executed by a VM to even render them. This wasn't a viable source of attacks so far, due to the properly limited nature of the VMs.

Maybe I would pick the eBPF VM instead, with all its limiting and verifying mechanics.

cmiles74 3 hours ago | parent | next [-]

https://learn.microsoft.com/en-us/security-updates/SecurityB...

> This security update resolves a publicly disclosed vulnerability in Microsoft Windows. The vulnerability could allow remote code execution if a user opens a specially crafted document or visits a malicious Web page that embeds TrueType font files.

> This security update is rated Critical for all supported releases of Microsoft Windows. For more information, see the subsection, Affected and Non-Affected Software, in this section.

> The security update addresses the vulnerability by modifying the way that a Windows kernel-mode driver handles TrueType font files. For more information about the vulnerability, see the Frequently Asked Questions (FAQ) subsection for the specific vulnerability entry under the next section, Vulnerability Information.

tedd4u 3 hours ago | parent | prev [-]

There are many documented, exploited-in-the-wild font-file attacks (one example in 1]). Apple is re-writing their font interpreter specifically to improve security. [2]

[1] https://www.bleepingcomputer.com/news/security/facebook-disc...

[2] https://blakecrosley.com/blog/truetype-hinting-swift-migrati...

gavinray 3 hours ago | parent | prev [-]

There is no concept of "executable" vs "non-executable" content in a file.

A file is a bag of bytes. You can send those bytes to different things, like a text editor's content-stream, or as the input to a WASM interpreter.

What you decide to do with the bytes in a file is your own prerogative. Each byte is whatever you make of it.

jedberg 3 hours ago | parent | next [-]

Sure, but when the standard says "read this file and execute the instructions you find at the beginning" that is more dangerous than "this is a file with data and your program needs to figure out how to read it".

gavinray 3 hours ago | parent [-]

I guess it's a good thing that the F3 standard does not say "read this file and execute the instructions you find at the beginning", then?

The WASM encoders/decoders are embedded resources that exist as byte offsets in the file metadata, not header info.

jedberg 3 hours ago | parent [-]

Ok if you want to be pedantic, the standard says, "if you can't read this file, go to the offset and then execute the code you find" which isn't functionally different from what I said.

ratorx 3 hours ago | parent | prev | next [-]

There’s a big difference in the expected use of a file. If the file is attacker provided, and the fallback path is being used, the attacker can embed whatever WASM payload they want into the file since the file will be “opened” by “execute this offset into the file”.

Compare that to JSON. The parser NEVER needs to execute arbitrary instructions. Parser might have bugs, but it avoids a whole class of issues.

gavinray 3 hours ago | parent | next [-]

  >  the attacker can embed whatever WASM payload they want into the file since the file will be “opened” by “execute this offset into the file”.
And then do what with it?

WASM physically cannot interact with the underlying host or perform I/O -- you need a WASI environment for that.

ratorx 3 hours ago | parent [-]

Putting aside the WASM sandboxing (I’m not familiar enough with it to understand how sandboxing works) there’s a DoS vector at least. Even regexes have had many DoS issues, and I can’t imagine WASM being easier to sandbox for DoS risk.

7373737373 3 hours ago | parent [-]

There exist Wasm interpreters capable of limiting the number of instructions executed.

titzer 2 hours ago | parent [-]

Many can, even if they have JITs, e.g. Wasmtime. Failing that, it's not that hard to add bytecode instrumentation that will count instructions and terminate early. Some execution platforms that utilize Wasm just inject bytecode instrumentation into guest programs before sending them to the Wasm engine. It's relatively easy to do and not that much overhead.

bguebert 2 hours ago | parent | prev [-]

I mean json might not be the best example since for a long time people would run json through a javascript engine to parse it but I can see your point.

jastanton 3 hours ago | parent | prev | next [-]

gotcha, so the vulnerability will be in some common libraries that attackers force some wasm fallback path with custom wasm instructions that when executed does something nefarious.

I'd say at worst it's setup for poor security

outside1234 3 hours ago | parent | prev [-]

I mean can't we say the same thing about sending around a .exe though?

bluejekyll 3 hours ago | parent | next [-]

.exe has bindings to OS ABI and system calls, WASM doesn’t have this by default, it’s up to the VM to provide whatever environment the WASM executable needs, ideally there should be no system calls, no stdio, just instructions on how to interpret the file format.

gavinray 3 hours ago | parent | prev | next [-]

Double-clicking an ".exe" (or running it via a shell) is not the same as "bag of bytes", it's "send these bytes to an executable environment".

Doing `head foo.exe` is quite different than `run foo.exe`

If I encode executable instructions in "image.png" and then send them to an interpreter that runs those instructions, the file extension doesn't matter.

jastanton 3 hours ago | parent | prev [-]

exactly

vouwfietsman 2 hours ago | parent | prev [-]

except you need flatbuffers to access that blob

sph 3 hours ago | parent | prev | next [-]

I don’t know what are people commenting on. I see a README with little to no information about what this is, what problems it solves, just links to its Flatbuffer description and a directory full of source code.

What context am I missing?

burkaman 2 hours ago | parent [-]

There is a linked paper: https://dl.acm.org/doi/epdf/10.1145/3749163

largbae 3 hours ago | parent | prev | next [-]

This could use a bit more "why".

Shortcomings of Parquet are mentioned as overcome by this, which ones? Certainly not wide tool support...

Why should one leave Parquet or ORC for this structure?

altairprime 3 hours ago | parent | next [-]

The ‘why’ is referenced in the bibliography at the end of the readme; this repo is not meant to be consumed standalone. Start with the paper instead:

https://doi.org/10.1145/3749163

dietr1ch 2 hours ago | parent | prev | next [-]

I also had no idea what they were talking about, but there's good points about how hardware oblivious and somewhat global is Parquet around metadata.

I found this post interesting,

- https://medium.com/@reliabledataengineering/f3-the-future-pr...

skrtskrt 3 hours ago | parent | prev | next [-]

Yeah it seems like most of this can be handled by some more dev hours to Parquet

dj_axl 3 hours ago | parent | prev [-]

Paper mentions Parquet, ORC, Nimble, Lance, TSFile, Bullion, and BtrBlocks.

zerobees 3 hours ago | parent | prev | next [-]

Some folks described it as genius. I guess it's my turn to play the role of an annoying HN skeptic: I find it somewhat silly. Data compression formats are secondary to what you're planning to do with the data once decoded. An audio file is completely different than an SVG image. An embedded VM that decompresses video to raw pixels doesn't magically let you play that video in a text editor, so there's no radically new kind of interoperability. Each new format still needs to be handled in a format-specific way.

I guess one use case is that I come up with a video compression scheme that's better than H.265, but not all platforms support it, so I embed a decoder that would allow me to play it back on legacy hardware. But that also shows the weakness of the idea: it's unlikely that legacy hardware will perform well doing software-only decode for video formats from the future. If we rolled this idea out in the 1990s, it would not have allowed watching Netflix on an i386.

In the same vein, I doubt this would have allowed me to open Word 2021 files in Word 97. There's no 1-to-1 mapping between the data structures. So if this kind of compat isn't slam-dunk, what's the goal?

The downsides are clear. First, it's probably a maintenance nightmare: if your decoder has a bug that needs fixing, how do you patch all the files that already embed it? And then, there's size overhead and security risks. We're adding a considerable attack surface to every format parser. It's more opportunities for remote code execution, resource exhaustion attacks, and so on. Again, this is not always wrong, but what's the benefit?

vouwfietsman 2 hours ago | parent [-]

I don't think you have encountered the problems that this class of formats solves. Try looking up columnar storage formats, the pros and cons are pretty well defined these days. It is not meant for video decoding, indeed.

amluto 3 hours ago | parent | prev | next [-]

One nice thing about some modern formats is that there are tools that read them at extraordinarily high effective speed. For example, DuckDB can do all manner of nifty optimizations while reading its own native format or Parquet. And I’m not sure that those optimizations can be effectively applied to a format that needs a WASM blob to be run to understand it. By the time you run a non-SIMD or even a SIMD-optimized pass over app the data, if that pass doesn’t understand your query, you may have already lost.

I admit I only skimmed the beginning of the paper, and maybe the format is less general than it sounds.

hahahacorn 3 hours ago | parent [-]

My understanding is it’s a fallback mechanism

Groxx 3 hours ago | parent | prev | next [-]

Hm. I can kinda see it replacing self-extracting EXEs, but a lot of why you choose specific file formats is for specific features they offer - any self-describing system can fall into "there are too many competing features and nobody handles them all" exactly as easily as any other format.

Like, can this file be efficiently mmap'd? Maybe if it emulates tar internally, but you don't know until you run it. Can it be seeked to specific bytes to only decompress part? It only supports a pre-release version of ISO-36898533 seeking, and your file library dropped support for it 6 years ago. If I rewrite 1MB in the middle, can it only change those pages on disk (and maybe an index), or do I have to rewrite the whole thing? Well the wasm blob supports 97 different APIs for it (there are 35 copies of one with different names), so it's larger than the data (but nobody paid attention to that), so you have 19 options that you recognize, but your CPU's native WASM accelerator only handles two or three so you've still got to specialize your code heavily.

At least with "*.tar.gz" you have some idea of what's possible.

owentbrown 3 hours ago | parent | prev | next [-]

Nice! The world can always use a better data format.

I think you might get some traction if you post the advantages over parquet and other files directly on the readme, so that if someone goes to https://github.com/future-file-format/f3 the see why they should try it.

Mention the advantages and post metrics. Cherry pick the metrics! There's probably a good use case for this but, from the current readme, it's not clear who should use this and why.

coffeecoders 3 hours ago | parent | prev | next [-]

If I am archiving PBs of data for 10+ years, I don't want to rely on a WASM interpreter being available and performant in the future just to read a file. I want a dead-simple, heavily documented byte specification like Parquet.

Additionally, putting the decoding logic inside an WASM binary introduces an active execution layer into what should be a cold storage.

bijowo1676 an hour ago | parent | next [-]

WinRAR format does include RAR VM bytecode as part of the archive to achieve state of the art compression in media files. it was sandboxed and well accepted by everyone.

the same sandboxing capability exists for WASM as well.

it is actually better for long-term archival: you dont need to carry decompression program, since it will be a part of the archive file itself

0xbadcafebee 3 hours ago | parent | prev [-]

You don't want to run a custom 10-year-old data parsing function every time you read a single data record?

anygivnthursday 3 hours ago | parent | prev | next [-]

My concern is, if decode fails I need to debug WASM added by some other party maybe containing random bugs. Maybe a library of standard decoders maintained and tested by the project could help, but then not sure if it kills the advantage of the flexibility it provides.

titzer 2 hours ago | parent [-]

But Wasm has deterministic execution, so if decode fails for you, it should have failed for them. I.e. it's not a problem that your system has introduced; they should be able to reproduce the failure independent of any client.

anygivnthursday 8 minutes ago | parent [-]

Yes, if it comes from some reliable partner I can report bugs to, or something built in-house. In such environment probably fine. And maybe thats the main audience and not some open data exchange format where your system may be brought down by someone's random decoder.

tbolt 23 minutes ago | parent | prev | next [-]

I think we know how this story ends https://en.wikipedia.org/wiki/OpenDoc

krzyk 3 hours ago | parent | prev | next [-]

File format for what? Text, graphics, compiled code?

ghkbrew 3 hours ago | parent | next [-]

For columnar data storage I think. The description references Parquet and they appear to benchmark against Parquet, Vortex, and Lance.

meindnoch 3 hours ago | parent | prev [-]

The future.

krzyk 2 hours ago | parent [-]

I was afraid that it is for Marty McFlys only.

Qerub 2 hours ago | parent | prev | next [-]

This reminds me of Alan Kay's OOPSLA 1997 presentation "The Computer Revolution Hasn’t Happened Yet" when he describes the Air Force / Burroughs 220 file format from 1961 where the file/tape contained both the data and the procedures to read/write/print them: https://youtu.be/oKg1hTOQXoY?t=1355

alex7o an hour ago | parent | prev | next [-]

So you know what is a file format the we would be able to Reed 100 years from now. CSV, json even fits (that is 30 years old now). If you don't know the original way it was created you know what each field I supposed to mean (if done well). Otherwise you look at hex decoded data with no way if knowing how to decide it if you don't have tha spec on how and why this was encoded. Msgpack and cbor are cool but in 100 years there is no way to decide it.

thisisauserid 3 hours ago | parent | prev | next [-]

Great! I'll use it.

In the "future."

Nimble? Lance? Also in the future. Maybe.

I'll use Parquet in the present.

dang 2 hours ago | parent | prev | next [-]

One past discussion:

F3: Open-source data file format for the future [pdf] - https://news.ycombinator.com/item?id=45437759 - Oct 2025 (125 comments)

plus this bit:

An Open File Format for storing the information from a forge - https://news.ycombinator.com/item?id=44043253 - May 2025 (1 comment)

drdexebtjl 3 hours ago | parent | prev | next [-]

Probably not a good idea to name your project “future” anything, if you expect that future to become the present.

Also, f3 is already “fight-flash-fraud”.

nine_k 2 hours ago | parent | prev | next [-]

F3 seems to be a reasonable archival data format.

I see many replies criticizing F3 as an operational data format, like Parquet. Of course it can't be made as fast in the general case, or as compatible to the existing infrastructure.

OTOH F3 would be easy to decode into almost any of today's accepted formats, and likely to any of tomorrow's data formats. That's where being self-describing and self-unpacking would be important.

stackskipton an hour ago | parent [-]

What's wrong with just archiving Parquet files? Worry about lack of support for file format 50 years in the future? If that was truly the concern, would it be best to archive it in more plain text format like CSV or JSON?

Arainach 3 hours ago | parent | prev | next [-]

This project README is not particularly useful:

It doesn't explain what the project does (a file format for what? Name dropping other things I haven't heard of isn't useful)

There are no examples. It links to a flatbuffer schema which is at least well commented, but is full of deep implementation details.

The point is that within 2-3 minutes I'm not convinced why I care and still don't know enough about what this is to even think back to if if I encounter a scenario in the future where it would be useful.

> designed with efficiency, interoperability, and extensibility in mind. It provides a data organization that rectifies the layout shortcomings of the last-generation formats like Parquet,

This is all marketing speak that says nothing.

> maintaining good interoperability and extensibility (a.k.a future-proof) via embedded Wasm decoders What does this even mean? Providing a decoder is no guarantee of futureproofness.

adammarples 3 hours ago | parent [-]

Tabular data, it wants to replace Parquet

jayd16 25 minutes ago | parent | prev | next [-]

So a web page, basically?

chatmasta 2 hours ago | parent | prev | next [-]

As appealing as this is, it will never gain traction without some backwards compatibility with Parquet and wide adoption of query engines to implement that backwards compatible path.

mmaunder 2 hours ago | parent | prev | next [-]

A Wasm decoder takes encoded bytes and returns an iterator of Arrow Buffers. In case you were wondering.

mmaunder 2 hours ago | parent | prev | next [-]

Not quite vaporware, but few commits, PRs, history, actual examples etc. It's pretty thin.

meta-level 3 hours ago | parent | prev | next [-]

Don't know why but I had to think of https://xkcd.com/2116/

s1mon 2 hours ago | parent [-]

Somehow I assumed that you were linking to this one: https://xkcd.com/927/

gruntled-worker 2 hours ago | parent | prev | next [-]

Are we positively sure that WASM will prove to be more future-proof than 640K MS-DOS or WinXP, or SNES cartridge files for that matter? On 6/23/26 there are a lot of emulators that run these. Will WASM necessarily beat them on 6/23/2051? Might be a case of xkcd 927.

kridsdale3 2 hours ago | parent [-]

I really enjoy the spirit that moved you to write this comment. I want my AI Agents to adopt this attitude.

gruntled-worker 2 hours ago | parent [-]

I'll happily train them for you but I charge by the hour, not by the token.

BTW, while we're on the topic. I don't do social media. I occasionally type up a text or post on a technical board. Maybe 98% of my textual interaction these days is with LLMs. I would not be surprised if my prose changes to resemble theirs over time. I suppose that's symbiosis for ya. It's possible that your AI-dar might get even more ineffective.

adammarples 3 hours ago | parent | prev | next [-]

No commits in 8 months?

yung_lean 3 hours ago | parent [-]

Yeah this was a research project, it doesn't look like this is getting any adoption

GolDDranks 3 hours ago | parent | prev | next [-]

I love the idea, and I developed something similar of myself in the past (https://github.com/golddranks/kobuta), but... this reeks of slop. With Rust code, edition="2021" is a dead giveaway.

ShinyLeftPad 3 hours ago | parent | prev | next [-]

To save a click it's a file format for columnar data specifically (like Parquet), which they very generically named Future-proof File Format. Most of this could fit in the title instead of just "F3"

jauntywundrkind 2 hours ago | parent | prev | next [-]

The wasm decoder thing was also done in Anyblox. https://github.com/AnyBlox https://gienieczko.com/anyblox-paper

Has nimble/velox had any better luck lately? I forget what stories someone shared, but, it seemed to have such big intent, then real trouble actually getting released. I want to say someone was saying the lawyers ended up not letting a lot of the work get released. Nimble is the one competitor benchmarked against here that beats them, and is also extensible (to some degree?), so I'd love to know how things have gone for the past 6-12 months for nimble/velox. https://news.ycombinator.com/item?id=39995112 https://github.com/facebookincubator/nimble/ https://materializedview.io/p/nimble-and-lance-parquet-kille...

antisthenes 3 hours ago | parent | prev | next [-]

The description mentions shortcomings of the previous file types like parquet, but it isn't really evident to me what those shortcomings are, or if the use cases for parquet and F3 have really that much of an overlap to make this comparison valid in the first place.

lowbloodsugar 3 hours ago | parent | prev | next [-]

>via embedded Wasm decoders

runs screaming

MoonWalk 3 hours ago | parent | prev | next [-]

Is what?

corvad 3 hours ago | parent | prev | next [-]

https://xkcd.com/927/

ChrisArchitect 3 hours ago | parent | prev [-]

A more descriptive title would be helpful OP:

F3: Open-source data file format for the future

Previous discussion:

2025 https://news.ycombinator.com/item?id=45437759