Remix.run Logo
mrbungie 7 months ago

I was almost going to build a lakehouse* with DuckDB because I low-key love it, easiest and strongest analytical engine I've found yet: scale from laptops to big metal, while being mostly out-of-core when doing sane stuff, and avoiding distributed computing for SQL in the process (looking at you Spark).

That is until I found out it does not support Iceberg writes[1], big nono as I would need another engine for inserts, and I want a simple stack :(. What a bummer.

[1] https://github.com/duckdb/duckdb_iceberg/issues/37

*that is what they are called now aren't they? I just can't follow the terms anymore haha.

nicornk 7 months ago | parent | next [-]

Fivetran tried to upstream write support but it was not accepted https://github.com/duckdb/duckdb-iceberg/pull/95

shakna 7 months ago | parent [-]

That sounds less "not accepted" and more "will implement, rewrite required". It was only a couple months ago.

jeadie 7 months ago | parent | prev | next [-]

This is one of the ideas behind using DuckDB in github.com/spiceai/spiceai

anentropic 7 months ago | parent | next [-]

That looks like an amazing "swiss army knife"...!

mrbungie 7 months ago | parent | prev [-]

Looks very cool! I will take a look, tysm!

mritchie712 7 months ago | parent | prev | next [-]

it's coming. they already have hive style parquet writes. Iceberg is more complicated than that, but it's certainly doable.

mrbungie 7 months ago | parent [-]

Yeah, it just would be great if it already did so and I hope it supports Iceberg soon, as it would enable me to change expensive (and bad) engines like AWS Athena for something more manageable.

Don't get me wrong, I'm just being a tongue-in-check egotistical bastard data engineer from hell. DuckDB is a fine piece of software as it is, and those mantainers deserve heaven.

buremba 7 months ago | parent | prev | next [-]

Not just for building a new one, it can also complement existing data-warehouse/lakehouses: https://github.com/buremba/universql

The flight extension is excellent as it removes the need to write C++ extensions and lets you use your favorite language to develop native DuckDB catalogs. It's straightforward to build data lake connectors and plug them in as a flight catalog, thanks to Airport!

benrutter 7 months ago | parent | prev | next [-]

I'm curious, did you consider delta tables? Pretty sure duckdb supports them nicely. If you did, how come you chose not to go with them?

mrbungie 7 months ago | parent [-]

Afair (I might be wrong) AWS and a big chunk of the industry is promoting Iceberg over Delta. Delta is mostly backed up by Databricks.

sukhavati 7 months ago | parent | prev [-]

same here man, ended up going with trino explicitly for writing and data management and using chdb/duckdb to process data for front-ends etc (mostly ethereum data so chdb "support" for ui256 is quite important)