Remix.run Logo
qaq 2 days ago

Have you interacted with Snowflake teams much? We are using external iceberg tables with snowflake. Every interaction pretty much boils down to you really should not be using iceberg you should be using snowflake for storage. It's also pretty obvious some things are strategically not implemented to push you very strongly in that direction.

barrrrald 2 days ago | parent | next [-]

Not surprised - this stuff isn’t fully mature yet. But I interact with their team a lot and know they have a commitment to it (I’m the other guy in that video)

ozkatz 2 days ago | parent | prev [-]

Out of curiosity - can you share a few examples of functionality currently not supported with Iceberg but that works well with their internal format?

qaq 2 days ago | parent [-]

even partition elimination is pretty primitive. For Query optimizer Iceberg is really not a primary target. The overall interaction with even technical people gives strong this is a sales org that happens to own an OLAP db product vibe.

andiz 5 hours ago | parent [-]

I have to very much disagree on that. All pruning techniques in Snowflake work equally well both on their proprietary format as well for Iceberg tables. Iceberg is nowadays a first-class citizen in Snowflake, with pruning working at the file level, row group level, and page level. Same is true for other query optimization techniques. There is even a paper on that: https://arxiv.org/abs/2504.11540

Where pruning differences might arise for Iceberg tables is the structure of Parquet files and the availability of metadata. Both depend on the writer of the Parquet files. Metadata might be completely missing (e.g., no per column min/max), or partially missing (e.g., no page indexes), which will indeed impact the perf. This is why it's super important to choose a writer that produces rich metadata. The metadata can be backfilled / recomputed after the fact by the querying engine, but it comes at a cost.

Another aspect is storage optimization: The ability to skip / prune files is intrinsically tied to the storage optimization quality of the table. If the table is neither clustered nor partitioned, or if the table has sub-optimally sized files, then all of these things will severely impact any engine's ability to skip files or subsets thereof.

I would be very curious if you can find a query on an Iceberg table that shows a better partition elimination rate in a different system.