Remix.run Logo
noo_u 13 hours ago

I'd say the author's thoughts are valid for basic data processing. Outside of that, most of claims in this article, such as:

"We're moving towards a simpler world where most tabular data can be processed on a single large machine1 and the era of clusters is coming to an end for all but the largest datasets."

become very debatable. Depending on how you want to pivot/ scale/augment your data, even datasets that seemingly "fit" on large boxes will quickly OOM you.

The author also has another article where they claim that:

"SQL should be the first option considered for new data engineering work. It’s robust, fast, future-proof and testable. With a bit of care, it’s clear and readable." (over polars/pandas etc)

This does not map to my experience at all, outside of the realm of nicely parsed datasets that don't require too much complicated analysis or augmentation.

RobinL 12 hours ago | parent | next [-]

Author here. Re: 'SQL should be the first option considered', there are certainly advantages to other dataframe APIs like pandas or polars, and arguably any one is better in the moment than SQL. At the moment Polars is ascendent and it's a high quality API.

But the problem is the ecosystem hasn't standardised on any of them, and it's annoying to have to rewrite pipelines from one dataframe API.

I also agree you're gonna hit OOM if your data is massive, but my guess is the vast majority of tabular data people process is <10GB, and that'll generally process fine on a single large machine. Certainly in my experience it's common to see Spark being used on datasets that are no where big enough to need it. DuckDB is gaining traction, but a lot of people still seem unaware how quickly you can process multiple GB of data on a laptop nowadays.

I guess my overall position is it's a good idea to think about using DuckDB first, because often it'll do the job quickly and easily. There are a whole host of scenarios where it's inappropriate, but it's a good place to start.

noo_u 11 hours ago | parent | next [-]

I think both of us are ultimately wary of using the wrong tool for the job.

I see your point, even though my experience has been somewhat the opposite. E.g. a pipeline that used to work fast enough/at all up until some point in time because the scale of the data or requirements allowed it. Then some subset of these conditions changes, the pipeline cannot meet them, and one has to reverse engineer obscure SQL views/stored procedures/plugins, and migrate the whole thing to python or some compiled language.

I work with high density signal data now, and my SQL knowledge occupies the "temporary solution" part of my brain for the most part.

chaps 11 hours ago | parent | prev [-]

Yeah, my experiences match yours and I very, very much work with messy data (FOIA data), though I use postgres instead of duckdb.

Most of the datasets I work with are indeed <10GB but the ones that are much larger follow the same ETL and analysis flows. It helps that I've built a lot of tooling to help with types and memory-efficient inserts. Having to rewrite pipelines because of "that one dataframe API" is exactly what solidified my thoughts around SQL over everything else. So much of my life time has been lost trying to get dataframe and non-dataframe libraries to work together.

Thing about SQL is that it can be taken just about anywhere, so the time spent improving your SQL skills is almost always well worth it. R and pandas much less so.

gofreddygo 11 hours ago | parent [-]

I advocated for a SQL solution at work this week and it seems to have worked. My boss is wary of the old school style SQL databases with their constraints and just being a pain to work with. As a developer, these pains aren't too hard to get used to or automate or document away for me and never understood the undue dislike of sql.

The fact that I can use sqlite / local sql db for all kinds of development and reliably use the same code (with minor updates) in the cloud hosted solution is such a huge benefit that it undermines anything else that any other solution has to offer. I'm excited about the sql stuff I learned over 10 years ago being of of great use to me in the coming months.

The last product I worked heavily on used a nosql database and it worked fine till you start tweak it just a little bit - split entities, convert data types or update ids. Most of the data access layer logic dealt with conversion between data coming in from the database and the guardrails to keep the data integrity in check while interacting with the application models. To me this is something so obviously solved years ago with a few lines of constraints. Moving over to sql was totally impossible. Learned my lesson, advocated hard for sql. Hoping for better outcomes.

chaps 11 hours ago | parent [-]

I totally understand the apprehension towards SQL. It's esoteric as all hell and it tends to be managed by DBAs who you really only go to whenever there's a problem. Organizationally it's simpler to just do stuff in-memory without having to fiddle around with databases.

__mharrison__ 5 hours ago | parent | prev | next [-]

My experience is that if you want to do ML, viz, or advanced analytics, dataframes give a better experience.

If you are shuffling data around in pipelines, sure, go for SQL.

Readability is in the eye of the beholder. I much prefer dataframes for that, though a good chunk of the internet claims to throw up in their mouths upon seeing it...

alastairr 10 hours ago | parent | prev | next [-]

I'm running duckdb over 500gb of parquet on a largish desktop (50gb ram) and it's been smooth & fast. I guess OOM issues will matter at some point, but I think it's going to be in the top 1% of real world use cases.

jauntywundrkind 10 hours ago | parent [-]

With a decent SSD (or eight), spilling to disk is really not bad these days! Yes!

And if that's still not enough, if you just need to crunch data a couple times a week, it's not unreasonable to get a massive massive cloud box with ridiculous amounts of ram or ram+SSD. I7i or i8g boxes. Alas, we have cheap older gen epycs & some amazing cheap motherboards but RAM prices to DIY are off the charts unbelievable, but so be it.

hnthrowaway0315 12 hours ago | parent | prev | next [-]

SQL is popular because everyone can learn and start using it after a while. I agree that Python sometimes is a better tool but I don't see SQL going away anytime.

From my experience, the data modelling side is still overwhelmingly in SQL. The ingestion side is definitely mostly Python/Scala though.

physicsguy 7 hours ago | parent | prev | next [-]

You can get 32TiB of RAM instances on AWS these days

mr_toad 3 hours ago | parent | next [-]

Which is a lot for a single user, but when you have a dozens or hundreds of analysts who all want to run their own jobs on your hundred terabyte data warehouse then even the largest single machine wont cut it.

RobinL 7 hours ago | parent | prev | next [-]

Exactly - these huge machines are surely eating a lot into the need for distributed systems like Spark. So much less of a headache to run as well

jeffbee 7 hours ago | parent | prev [-]

That sounds damned near useless for typical data analysis purposes and I would very much prefer a distributed system to a system that would take an hour to fill main memory over its tiny network port. Also, those cost $400/hr and are specifically designed for businesses where they have backed themselves into a corner of needing to run a huge SAP HANA instance. I doubt they would even sell you one before you prove you have an SAP license.

For a tiny fraction of the cost you can get numerous nodes with 600gbps ethernet ports that can fill their memory in seconds.

theLiminator 9 hours ago | parent | prev [-]

Yeah, i'm also similarly confused.

> "SQL should be the first option considered for new data engineering work. It’s robust, fast, future-proof and testable. With a bit of care, it’s clear and readable." (over polars/pandas etc)

SQL has nothing to do with fast. Not sure what makes it any more testable than polars? Future-proof in what way? I guess they mean your SQL dialect won't have breaking changes?

wood_spirit 9 hours ago | parent | next [-]

I’m also a duckdb convert. All my notebooks have moved from Pandas and polars to Duckdb. It is faster to write and faster to read (after you return to a notebook after time away) and often faster to run. Certainly not slower to run.

My current habit is to suck down big datasets to parquet shards and then just query them with a wildcard in duckdb. I move to bigquery when doing true “big data” but a few GB of extract from BQ to a notebook VM disk and duckdb is super ergonomic and performant most of the time.

It’s the sql that I like. Being a veteran of when the world went mad for nosql it is just so nice to experience the revenge of sql.

theLiminator 9 hours ago | parent [-]

I personally find polars easier to read/write than sql. Especially when you start doing UDFs with numpy/et. al. I think for me, duckdb's clear edge is the cli experience.

> It is faster to write and faster to read

At least on clickbench, polars and duckdb are roughly comparable (with polars edging out duckdb).

erikcw 3 hours ago | parent | next [-]

I use them both depending on which feels more natural for the task, often within the same project. The interop is easy and very high performance thanks to Apache Arrow: `df = duckdb.sql(sql).pl()` and `result = duckdb.sql("SELECT * FROM df")`.

9 hours ago | parent | prev [-]
[deleted]
RobinL 7 hours ago | parent | prev [-]

Author here. I wouldn't argue SQL or duckdb is _more_ testable than polars. But I think historically people have criticised SQL as being hard to test. Duckdb changes that.

I disagree that SQL has nothing to do with fast. One of the most amazing things to me about SQL is that, since it's declarative, the same code has got faster and faster to execute as we've gone through better and better SQL engines. I've seen this through the past five years of writing and maintaining a record linkage library. It generates SQL that can be executed against multiple backends. My library gets faster and faster year after year without me having to do anything, due to improvements in the SQL backends that handle things like vectorisation and parallelization for me. I imagine if I were to try and program the routines by hand, it would be significantly slower since so much work has gone into optimising SQL engines.

In terms of future proof - yes in the sense that the code will still be easy to run in 20 years time.

theLiminator 7 hours ago | parent [-]

> I disagree that SQL has nothing to do with fast. One of the most amazing things to me about SQL is that, since it's declarative, the same code has got faster and faster to execute as we've gone through better and better SQL engines.

Yeah, but SQL isn't really portable between query all query engines. You always have to be speaking the same dialect. Also, SQL isn't the only "declarative" dsl, polars's lazyframe api is similarly declarative. Technically Ibis's dataframe dsl also works as a multi-frontend declarative query language. Or even substrait.

Anways my point is that SQL is not inherently a faster paradigm than "dataframes", but that you're conflating declarative query planning with SQL.