Remix.run Logo
mritchie712 13 hours ago

tldr: this caches your S3 data in EFS.

we run datalakes using DuckLake and this sounds really useful. GCP should follow suit quickly.

hiyer 11 hours ago | parent | next [-]

I was thinking of using it with Duckdb as well but seems it would be of limited benefit. Parquet objects are in MBs, so they would be streamed directly from S3. With raw parquet objects, it might help with S3 listing if you have a lot of them (shave off a couple of seconds from the query). If you are already on Ducklake, Duckdb will use that for getting the list of relevant objects anyway.

wenc 9 hours ago | parent [-]

Maybe the OP is thinking of reading/writing to DuckDB native format files. Those require filesystem semantics for writing. Unfortunately, even NFS or SMB are not sufficiently FS-like for DuckDB.

Parquet is static append only, so DuckDB has no problems with those living on S3.

huntaub 6 minutes ago | parent [-]

What does DuckDB need that NFS/SMB do not provide?

anentropic 12 hours ago | parent | prev [-]

I am curious about this use case

How do you see it helping with DuckLake?

arpinum 5 hours ago | parent [-]

Latency, predicate pushdown.

Pre-compaction the recent data can be in small files, and the delete markers will also be in small files. This will bring down fetch times, while ducklake may have many of the larger blocks in memory or disk cache already.

Reading block headers for filtering is lots of small ranges, this could speed it up by 10x.