| ▲ | dharbin 2 days ago |
| Why would Snowflake develop and release this? Doesn't this cannibalize their main product? |
|
| ▲ | barrrrald 2 days ago | parent | next [-] |
| One thing I admire about Snowflake is a real commitment to self-cannibalization. They were super out front with Iceberg even though it could disrupt them, because that's what customers were asking for and they're willing to bet they'll figure out how to make money in that new world Video of their SVP of Product talking about it here: https://youtu.be/PERZMGLhnF8?si=DjS_OgbNeDpvLA04&t=1195 |
| |
| ▲ | qaq 2 days ago | parent | next [-] | | Have you interacted with Snowflake teams much? We are using external iceberg tables with snowflake. Every interaction pretty much boils down to you really should not be using iceberg you should be using snowflake for storage. It's also pretty obvious some things are strategically not implemented to push you very strongly in that direction. | | |
| ▲ | barrrrald 2 days ago | parent | next [-] | | Not surprised - this stuff isn’t fully mature yet. But I interact with their team a lot and know they have a commitment to it (I’m the other guy in that video) | |
| ▲ | ozkatz 2 days ago | parent | prev [-] | | Out of curiosity - can you share a few examples of functionality currently not supported with Iceberg but that works well with their internal format? | | |
| ▲ | qaq 2 days ago | parent [-] | | even partition elimination is pretty primitive. For Query optimizer Iceberg is really not a primary target. The overall interaction with even technical people gives strong this is a sales org that happens to own an OLAP db product vibe. | | |
| ▲ | andiz 5 hours ago | parent [-] | | I have to very much disagree on that.
All pruning techniques in Snowflake work equally well both on their proprietary format as well for Iceberg tables. Iceberg is nowadays a first-class citizen in Snowflake, with pruning working at the file level, row group level, and page level. Same is true for other query optimization techniques. There is even a paper on that: https://arxiv.org/abs/2504.11540 Where pruning differences might arise for Iceberg tables is the structure of Parquet files and the availability of metadata. Both depend on the writer of the Parquet files. Metadata might be completely missing (e.g., no per column min/max), or partially missing (e.g., no page indexes), which will indeed impact the perf. This is why it's super important to choose a writer that produces rich metadata. The metadata can be backfilled / recomputed after the fact by the querying engine, but it comes at a cost. Another aspect is storage optimization: The ability to skip / prune files is intrinsically tied to the storage optimization quality of the table. If the table is neither clustered nor partitioned, or if the table has sub-optimally sized files, then all of these things will severely impact any engine's ability to skip files or subsets thereof. I would be very curious if you can find a query on an Iceberg table that shows a better partition elimination rate in a different system. |
|
|
| |
| ▲ | blef a day ago | parent | prev [-] | | Supporting Iceberg is eventually having people leaving you because they have better elsewhere, but this is birectionnal, it means you can welcome people from Databricks because you have feature parity. |
|
|
| ▲ | kentm 2 days ago | parent | prev | next [-] |
| It's not going to scale as well as Snowflake, but it gets you into an Iceberg ecosystem which Snowflake can ingest and process at scale. Analytical data systems are typically trending to heterogenous compute with a shared storage backend -- you have large, autoscaling systems to process the raw data down to something that is usable by a smaller, cheaper query engine supporting UIs/services. |
| |
| ▲ | hobs 2 days ago | parent [-] | | But if you are used to this type of compute per dollar what on earth would make you want to move to Snowflake? | | |
| ▲ | kentm 2 days ago | parent [-] | | Different parts of the analytical stack have different performance requirements and characteristics. Maybe none of your stack needs it and so you never need Snowflake at all. More likely, you don't need Snowflake to process queries from your BI tools (Mode, Tableau, Superset, etc), but you do need it to prepare data for those BI tools. Its entirely possible that you have hundreds of terabytes, if not petabytes, of input data that you want to pare down to < 1 TB datasets for querying, and Snowflake can chew through those datasets. There's also third party integrations and things like ML tooling that you need to consider. You shouldn't really consider analytical systems the same as a database backing a service. Analytical systems are designed to funnel large datasets that cover the entire business (cross cutting services and any sharding you've done) into subsequently smaller datasets that are cheaper and faster to query. And you may be using different compute engines for different parts of these pipelines; there's a good chance you're not using only Snowflake but Snowflake and a bunch of different tools. |
|
|
|
| ▲ | mslot 2 days ago | parent | prev | next [-] |
| When we first developed pg_lake at Crunchy Data and defined GTM we considered whether it could be a Snowflake competitor, but we quickly realised that did not make sense. Data platforms like Snowflake are built as a central place to collect your organisation's data, do governance, large scale analytics, AI model training and inference, share data within and across orgs, build and deploy data products, etc. These are not jobs for a Postgres server. Pg_lake foremost targets Postgres users who currently need complex ETL pipelines to get data in and out of Postgres, and accidental Postgres data warehouses where you ended up overloading your server with slow analytical queries, but you still want to keep using Postgres. |
|
| ▲ | 999900000999 2 days ago | parent | prev [-] |
| It'll probably be really difficult to set up. If it's anything like super base, your question the existence of God when trying to get it to work properly. You pay them to make it work right. |
| |
| ▲ | pgguru 2 days ago | parent [-] | | For testing, we at least have a Dockerfile to automate the setup of the pgduck_server and a minio instance so it Just Works™ with the extensions installed in your local Postgres cluster (after installing the extensions). The configuration mainly involves just defining the default iceberg location for new tables, pointing it to the pgduck_server, and providing the appropriate auth/secrets for your bucket access. |
|