Remix.run Logo
s-a-p 13 hours ago

"making DuckDB potentially a suitable replacement for lakehouse formats such as Iceberg or Delta lake for medium scale data" > I'm a Data Engineering noob, but DuckDB alone doesn't do metadata & catalog management, which is why they've also introduce DuckLake.

Related question, curious as to your experience with DuckLake if you've used it. I'm currently setting up s3 + Iceberg + duckDB for my company (startup) and was wondering what to pick between Iceberg and DuckLake.

nchagnet 11 hours ago | parent | next [-]

We're using ducklake with data storage on Google cloud storage and the catalog inside a postgres database and it's a breeze! It may not be the most mature product, but it's definitely a good setup for small to medium applications which still require a data lake.

biophysboy 13 hours ago | parent | prev [-]

DuckLake is pretty new, so I guess it would depend on if you need a more mature, fully-featured app.

pattar 12 hours ago | parent [-]

I went to a talk by the Motherduck team about why they built DuckLake instead of leaning more in on Iceberg. The key takeaway is that instead of storing all the table metadata inside files on s3 and dealing with latency and file io they instead store all of that info inside a duckdb table. Seems like a good idea and worked smoothly when I tried, however it is not quite in a stable production state it is still <1.0. They have a nice talk about it on youtube: https://youtu.be/hrTjvvwhHEQ?si=WaT-rclQHBxnc9qV

willvarfar 9 hours ago | parent [-]

(I work a lot with BigQuery's BigLake adaptor and it's basically caching the metadata of the iceberg manifest and parquet footers in Bigtable (this is Google) so query planning is super fast etc. Really helps)