Remix.run Logo
crystal_revenge 6 hours ago

You can assert whatever you want, but Polars is a great answer. The performance improvements are secondary to me compared to the dramatic improvement in interface.

Today all serious DS work will ultimately become data engineering work anyway. The time when DS can just fiddle around in notebooks all day has passed.

this_user 4 hours ago | parent | next [-]

Pandas is widely adopted and deeply integrated into the Python ecosystem. Meanwhile, Polars remains a small niche, and it's one of those hype technologies that will likely be dead in 3 years once most of its users realise that it offers them no actual practical advantages over Pandas.

If you are dealing with huge data sets, you are probably using Spark or something like Dask already where jobs can run in the cloud. If you need speed and efficiency on your local machine, you use NumPy outright. And if you really, really need speed, you rewrite it in C/C++.

Polars is trying to solve an issue that just doesn't exist for the vast majority of users.

stdbrouw 4 hours ago | parent | next [-]

Arguably Spark solves a problem that does not exist anymore: single node performance with tools like DuckDB and Polars is so good that there’s no need for more complex orchestration anymore, and these tools are sufficiently user-friendly that there is little point to switching to Pandas for smaller datasets.

crystal_revenge 2 hours ago | parent | prev | next [-]

> Pandas is widely adopted and deeply integrated into the Python ecosystem.

This is pretty laughable. Yes there are very DS specific tools that make good use of Pandas, but `to_pandas` in Polars trivially solves this. The fact that Pandas always feels like injecting some weird DSL into existing Python code bases is one of the major reasons why I really don't like it.

> If you are dealing with huge data sets, you are probably using Spark or something like Dask already where jobs can run in the cloud. If you need speed and efficiency on your local machine, you use NumPy outright. And if you really, really need speed, you rewrite it in C/C++.

Have you used Polars at all? Or for that matter written significant Pandas outside of a notebook? The number one benefit of Polars, imho, is that Polars works using Expressions that allow you to trivially compose and reuse fundamental logic when working with data in a way the works well with other Python code. This solves the biggest problem with Pandas is that it does not abstract well.

Not to mention that Pandas is really poor dataframe experience outside of it's original use case which was financial time series. The entire multi-index experience is awful and I know that either you are calling 'reset_index' multiple times in your Pandas logic or you have bugs.

minimaxir 2 hours ago | parent | prev [-]

> once most of its users realise that it offers them no actual practical advantages over Pandas

What? Speed and better nested data support (arrays/JSON) alone are extremely useful to every data scientist.

My produtivity skyrocketed after switching from pandas to polars.

SiempreViernes 2 hours ago | parent | prev [-]

>Today DS work will ultimately become data engineering work anyway.

Oh yeah? Well in my ivory tower the work stops being serious once it becomes engineering, how do you like that elitism?!

crystal_revenge an hour ago | parent [-]

"Data Science" has never been related to academic research, it has always emerged in a business context. I wouldn't say that researchers at Deep Mind are "data scientists", they are academic researchers who focus on shipping papers. If you're in a pure research environment, nobody cares if you write everything in Matlab.

But the last startup I was at tried to take a similar approach to research was unable to ship a functioning product and will likely disappear in a year from now. FAIR has been largely disbanded in favor of the way more shipping-centric MSL, and the people I know at Deep Mind are increasingly finding themselves under pressure to actually produce things.

Since you've been hanging out in an ivory tower then you might be unaware that during the peek DS frenzy (2016-2019) there were companies where data scientists were allowed to live entirely in notebooks and it was someone else's problem to ship their notebooks. Today if you have that expectation you won't last long at most companies, if you can even find a job in the first place.

On top of that, I know quite a few people at the major LLM teams and, based on my conversations, all of them are doing pretty serious data engineering work to get things shipped even if they were hired for there modeling expertise. It's honestly hard to even run serious experiments at the scale of modern day LLMs without being pretty proficient at data engineering related tasks.