Remix.run Logo
NeutralCrane 2 days ago

I’m not a data engineer but work in an adjacent role. Is there anyone here who could dumb the use case down? Maybe an example of a problem this solves. I am struggling to understand the value proposition here.

ggregoire 2 days ago | parent [-]

> Maybe an example of a problem this solves.

Some service writes a lot of data in parquet files stored on S3 (e.g. logs), and now you want that data to be queryable from your application as if it was in postgres (e.g. near real-time analytics dashboard). pg_lake allows you to load these parquet files into postgres and query the data. You can also join that data with existing tables in postgres.

smithclay a day ago | parent | next [-]

Been experimenting with OpenTelemetry->Parquet conversion lately for logs, metrics, and traces. Lots of related projects popping up in this area. It's powerful and cheap.

- https://github.com/smithclay/otlp2parquet (shameless plug, based on Clickhouse's Otel schema) - https://github.com/Mooncake-Labs/moonlink (also has OTLP support) - https://github.com/open-telemetry/otel-arrow (official community project under early dev)

NeutralCrane a day ago | parent | prev [-]

I guess my confusion is that there already are ways to query this data with DuckDB or something like that. So is the magic here that it’s Postgres? What makes being able to query something in Postgres special? And when we say it’s now queryable by Postgres, does this mean that it takes that data and stores it in your PG db? Or it remains in S3 and this is a translation layer for querying with PG?

ch71r22 21 hours ago | parent [-]

Not sure if I have this right but this is how I understand it

> So is the magic here that it's Postgres? What makes being able to query something in Postgres special?

There are a bunch of pros and cons to using Postgres vs. DuckDB. The basic difference is OLTP vs. OLAP. It seems pg_lake aims to give you the best of both. You can combine analytics queries with transactional queries.

pg_lake also stores and manages the Iceberg catalog. If you use DuckDB you'll need to have an external catalog to get the same guarantees.

I think if you're someone who was happy using Postgres, but had to explore alternatives like DuckDB because Postgres couldn't meet your OLAP needs, a solution like pg_lake would make your life a lot simpler. Instead of deploying a whole new OLAP system, you basically just install this extension and create the tables you want OLAP performance from with `create table ... using iceberg`

> when we say it’s now queryable by Postgres, does this mean that it takes that data and stores it in your PG db?

Postgres basically stores pointers to the data in S3. These pointers are in the Iceberg catalog that pg_lake manages. The tables managed by pg_lake are special tables defined with `create table ... using iceberg` which stores the data in Iceberg/Parquet files on S3 and executes queries partially with the DuckDB engine and partially with the Postgres engine.

It looks like there is good support for copying between the Iceberg/DuckDB/Parquet world and the traditional Postgres world.

> Or it remains in S3 and this is a translation layer for querying with PG?

Yes I think that's right -- things stay in S3 and there is a translation layer so Postgres can use DuckDB to interact with the Iceberg tables on S3. If you're updating a table created with `create table ... using iceberg`, I think all the data remains in S3 and is stored in Parquet files, safely/transactionally managed via the Iceberg format.

https://github.com/Snowflake-Labs/pg_lake/blob/main/docs/ice...