Remix.run Logo
anotherpaul 2 hours ago

What's the advantage over using Polars for the same task? It seems to me the natural competitor here and I vastly prefer the Polars syntax over SQL any day. So I was curious if I should try duckdb or stick with polars

falconroar 7 minutes ago | parent | next [-]

Polars has all of the benefits of DuckDB (to some degree), but also allows for larger-than-memory datasets.

steve_adams_86 an hour ago | parent | prev | next [-]

Polars would be better in some ways for sure. It was in one of my early prototypes. What put me off was that I was essentially designing my own database which I didn't trust as much as something like DuckDB.

Polars would let me have a lot of luxuries that are lost at the boundaries between my application and DuckDB, but those are weighed in the tradeoffs I was talking about. I do a lot of parsing at the boundaries to ensure data structures are sound, and otherwise DuckDB is enforcing strict schemas at runtime which provides as much safety as a dataset's schema requires. I do a lot of testing to ensure that I can trust how schemas are built and enforced as well.

Things like foreign keys, expressions that span multiple tables effortlessly, normalization, check constraints, unique constraints, and primary keys work perfectly right off the shelf. It's kind of perfect because the spec I'm supporting is fundamentally about normalized relational data.

Another consideration was that while Polars is a bit faster, we don't encounter datasets that require more speed. The largest dataset I've processed, including extensive transformations and complex validations (about as complex as they get in this spec), takes ~3 seconds for around 580k rows. That's on an M1 Max with 16GB of RAM, for what it's worth.

Our teams have written countless R scripts to do the same work with less assurance that the outputs are correct, having to relearn the spec each time, and with much worse performance (these people are not developers). So, we're very happy with DuckDB's performance despite that Polars would probably let us do it faster.

Having said that, if someone built the same project and chose Polars I wouldn't think they were wrong to do so. It's a great choice too, which is why your question is a good one.

microflash 2 hours ago | parent | prev [-]

Familiarity with SQL is a plus in my opinion. Also, DuckDB has SDKs in more languages compared to Polars.

steve_adams_86 31 minutes ago | parent [-]

I wasn't all that excited about SQL at first, but I've come around to it. Initially I really wanted to keep all of my data and operations in the application layer, and I'd gone to great lengths to model that to make it possible. I had this vision of all types of operations, queries, and so on being totally type safe and kept in a code-based registry such that I could do things like provide a GUI on top of data and functions I knew were 100% valid an compile-time. The only major drawback was that some kinds of changes to the application would require updating the repository.

I still love that idea but SQL turns out to be so battle-proven, reliable, flexible, capable, and well-documented that it's really hard to beat. After giving it a shot for a couple of weeks it became clear that it would yield a way more flexible and capable application. I'm confident enough that I can overcome the rough edges with the right abstractions and some polish over time.