| ▲ | theLiminator 9 hours ago | ||||||||||||||||||||||
Yeah, i'm also similarly confused. > "SQL should be the first option considered for new data engineering work. It’s robust, fast, future-proof and testable. With a bit of care, it’s clear and readable." (over polars/pandas etc) SQL has nothing to do with fast. Not sure what makes it any more testable than polars? Future-proof in what way? I guess they mean your SQL dialect won't have breaking changes? | |||||||||||||||||||||||
| ▲ | wood_spirit 9 hours ago | parent | next [-] | ||||||||||||||||||||||
I’m also a duckdb convert. All my notebooks have moved from Pandas and polars to Duckdb. It is faster to write and faster to read (after you return to a notebook after time away) and often faster to run. Certainly not slower to run. My current habit is to suck down big datasets to parquet shards and then just query them with a wildcard in duckdb. I move to bigquery when doing true “big data” but a few GB of extract from BQ to a notebook VM disk and duckdb is super ergonomic and performant most of the time. It’s the sql that I like. Being a veteran of when the world went mad for nosql it is just so nice to experience the revenge of sql. | |||||||||||||||||||||||
| |||||||||||||||||||||||
| ▲ | RobinL 7 hours ago | parent | prev [-] | ||||||||||||||||||||||
Author here. I wouldn't argue SQL or duckdb is _more_ testable than polars. But I think historically people have criticised SQL as being hard to test. Duckdb changes that. I disagree that SQL has nothing to do with fast. One of the most amazing things to me about SQL is that, since it's declarative, the same code has got faster and faster to execute as we've gone through better and better SQL engines. I've seen this through the past five years of writing and maintaining a record linkage library. It generates SQL that can be executed against multiple backends. My library gets faster and faster year after year without me having to do anything, due to improvements in the SQL backends that handle things like vectorisation and parallelization for me. I imagine if I were to try and program the routines by hand, it would be significantly slower since so much work has gone into optimising SQL engines. In terms of future proof - yes in the sense that the code will still be easy to run in 20 years time. | |||||||||||||||||||||||
| |||||||||||||||||||||||