Remix.run Logo
__mharrison__ 5 hours ago

When I teach, I use "big data" for data that won't fit in a single machine. "Small data" fits on a single machine in memory and medium data on disk.

Having said that duckDB is awesome. I recently ported a 20 year old Python app to modern Python. I made the backend swappable, polars or duckdb. Got a 40-80x speed improvement. Took 2 days.

ElectricalUnion 3 hours ago | parent | next [-]

The funny thing is that those days you can fit 64 TB of DDR5 in a single physical system (IBM Power Server), so almost all non data-lake-class data is "Small data".

AlotOfReading 2 hours ago | parent [-]

And a single machine can hold petabytes of disk for medium scale. There aren't many datasets exceeding that outside fundamental physics.

ladberg 5 hours ago | parent | prev [-]

I'm curious - what were you doing that polars was leaving a 40-80x speedup on the table? I've been happy with it's speed when held correctly, but it's certainly easy to hold it incorrectly and kill your perf if you're not careful

__mharrison__ 4 hours ago | parent | next [-]

20 year old BI app. Columnar DBs weren't really a thing. (MonetDB was brand new but not super stable. I committed the SQLAlchemy interface to it.)

5 hours ago | parent | prev | next [-]
[deleted]
devnotes77 5 hours ago | parent | prev | next [-]

Polars is fastest when you avoid eager eval mid-pipeline. If you see a 40x gap it's often from calling .collect() inside a loop or applying Python UDFs row-wise.

__mharrison__ 4 hours ago | parent [-]

App is now lazy!

dartharva 4 hours ago | parent | prev [-]

Might be tangential but in my recent experience polars kept crashing the python server with OOM errors whenever I tried to stream data from and into large parquet files with some basic grouping and aggregation.

Claude suggested to just use DuckDB instead and indeed, it made short work of it.