Remix.run Logo
aynyc 3 hours ago

So feather for journaling and parquet for long term processing?

yencabulator 2 hours ago | parent [-]

You basically can't do row by row appends to any columnar format stored in a single file. You could kludge around it by allocating arenas inside the file but that's still a huge write amplification, instead of writing a row in a single block you'd have to write a block per column.

amluto 26 minutes ago | parent | next [-]

You can do row by row appends to a Feather (Arrow IPC — the naming is confusing). It works fine. The main problem is that the per-append overhead is kind of silly — it costs over 300 bytes (IIRC) per append.

I wish there was an industry standard format, schema-compatible with Parquet, that was actually optimized for this use case.

yencabulator 11 minutes ago | parent [-]

Creating a new record batch for a single row is also a huge kludge leading to lot of write amplification. At that point, you're better off storing rows than pretending it's columnar.

I actually wrote a row storage format reusing Arrow data types (not Feather), just laying them out row-wise not columnar. Validity bits of the different columns collected into a shared per-row bitmap, fixed offsets within a record allow extracting any field in a zerocopy fashion. I store those in RocksDB, for now.

https://git.kantodb.com/kantodb/kantodb/src/branch/main/crat...

https://git.kantodb.com/kantodb/kantodb/src/branch/main/crat...

https://git.kantodb.com/kantodb/kantodb/src/branch/main/crat...

gregw2 25 minutes ago | parent | prev [-]

Agreed.

There is room still for an open source HTAP storage format to be designed and built. :-)