Remix.run Logo
dkdcio 2 days ago

I was going to ask if you could then put DuckDB over Postgres for the OLAP query engine -- looks like that's already what it does! very interesting development in the data lake space alongside DuckLake and things

pgguru 2 days ago | parent [-]

You create foreign tables in postgres using either the pg_lake_table wrapper or pg_lake_iceberg.

Once those tables exist, queries against them are able to either push down entirely to the remote tables and uses a Custom Scan to execute and pull results back into postgres, or we transform/extract the pieces that can be executed remotely using a FDW and then treat it as a tuple source.

In both cases, the user does not need to know any of the details and just runs queries inside postgres as they always have.

spenczar5 2 days ago | parent [-]

I think I don't understand postgres enough, so forgive this naive question, but what does pushing down to the remote tables mean? Does it allow parallelism? If I query a very large iceberg table, will this system fan the work out to multiple duckdb executors and gather the results back in?

pgguru 2 days ago | parent [-]

In any query engine you can execute the same query in different ways. The more restrictions that you can apply on the DuckDB side the less data you need to return to Postgres.

For instance, you could compute a `SELECT COUNT(*) FROM mytable WHERE first_name = 'David'` by querying all the rows from `mytable` on the DuckDB side, returning all the rows, and letting Postgres itself count the number of results, but this is extremely inefficient, since that same value can be computed remotely.

In a simple query like this with well-defined semantics that match between Postgres and DuckDB, you can run the query entirely on the remote side, just using Postgres as a go-between.

Not all functions and operators work in the same way between the two systems, so you cannot just push things down unconditionally; `pg_lake` does some analysis to see what can run on the DuckDB side and what needs to stick around on the Postgres side.

There is only a single "executor" from the perspective of pg_lake, but the pgduck_server embeds a multi-threaded duckdb instance.

How DuckDB executes the portion of the query it gets is up to it; it often will involve parallelism, and it can use metadata about the files it is querying to speed up its own processing without even needing to visit every file. For instance, it can look at the `first_name` in the incoming query and just skip any files which do not have a min_value/max_value that would contain that.

spenczar5 2 days ago | parent [-]

Thanks for the detailed answer!

I use DuckDB today to query Iceberg tables. In some particularly gnarly queries (huge DISTINCTs, big sorts, even just selects that touch extremely heavy columns) I have sometimes run out of memory in that DuckDB instance.

I run on hosts without much memory because they are cheap, and easy to launch, giving me isolated query parallism, which is hard to achieve on a single giant host.

To the extent that its possible, I dream of being able to spread those gnarly OOMing queries across multiple hosts; perhaps the DISTINCTs can be merged for example. But this seems like a pretty complicated system that needs to be deeply aware of Iceberg partitioning ("hidden" in pg_lake's language), right?

Is there some component in the postgres world that can help here? I am happy to continue over email, if you prefer, by the way.

pgguru 2 days ago | parent [-]

Well, dealing with large analytics queries will always perform better with larger amounts of memory... :D You can perhaps tune things to perform based on the amount of system memory (IME 80% is what DuckDB targets if not otherwise configured). Your proposed system does sounds like it introduces quite a bit of complexity that would be better served just by using hosts with more memory.

As far as Iceberg is concerned, DuckDB has its own implementation, but we do not use that; pg_lake has its own iceberg implementation. The partitioning is "hidden" because it is separated out from the schema definition itself and can be changed gradually without the query engine needing to care about the details of how things are partitioning at read time. (For writes, we respect the latest partitioning spec and always write according to that.)