| ▲ | edschofield 4 hours ago |
| The design of Pandas is inferior in every way to Polars: API, memory use, speed, expressiveness. Pandas has been strictly worse since late 2023 and will never close the gap. Polars is multithreaded by default, written in a low-level language, has a powerful query engine, supports lazy, out-of memory execution, and isn’t constrained by any compatibility concerns with a warty, eager-only API and pre-Arrow data types that aren’t nullable. It’s probably not worth incurring the pain of a compatibility-breaking Pandas upgrade. Switch to Polars instead for new projects and you won’t look back. |
|
| ▲ | sampo 2 hours ago | parent | next [-] |
| Historically 18 years ago, Pandas started as a project by someone working in finance to use Python instead of Excel, yet be nicer than using just raw Python dicts and Numpy arrays. For better or worse, like Excel and like the simpler programming languages of old, Pandas lets you overwrite data in place. Prepare some data df_pandas = pd.DataFrame({'a': [1, 2, 3, 4, 5], 'b': [10, 20, 30, 40, 50]})
df_polars = pl.from_pandas(df_pandas)
And then df_pandas.loc[1:3, 'b'] += 1
df_pandas
a b
0 1 10
1 2 21
2 3 31
3 4 41
4 5 50
Polars comes from a more modern data engineering philosopy, and data is immutable. In Polars, if you ever wanted to do such a thing, you'd write a pipeline to process and replace the whole column. df_polars = df_polars.with_columns(
pl.when(pl.int_range(0, pl.len()).is_between(1, 3))
.then(pl.col("b") + 1)
.otherwise(pl.col("b"))
.alias("b")
)
If you are just interactively playing around with your data, and want to do it in Python and not in Excel or R, Pandas might still hit the spot. Or use Polars, and if need be then temporarily convert the data to Pandas or even to a Numpy array, manipulate, and then convert back.P.S. Polars has an optimization to overwite a single value df_polars[4, 'b'] += 5
df_polars
┌─────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1 ┆ 10 │
│ 2 ┆ 21 │
│ 3 ┆ 31 │
│ 4 ┆ 41 │
│ 5 ┆ 55 │
└─────┴─────┘
But as far as I know, it doesn't allow slicing or anything. |
|
| ▲ | satvikpendem 41 minutes ago | parent | prev | next [-] |
| "If I have seen further, it is by standing on the shoulders of giants" - Isaac Newton Polars is great, but it is better precisely because it learned from all the mistakes of Pandas. Don't besmirch the latter just because it now has to deal with the backwards compatibility of those mistakes, because when it first started, it was revolutionary. |
| |
| ▲ | vegabook 27 minutes ago | parent | next [-] | | "revolutionary"? It just copied and pasted the decades-old R (previous "S") dataframe into Python, including all the paradigms (with worse ergonomics since it's not baked into the language). | | |
| ▲ | data-ottawa 3 minutes ago | parent [-] | | No other modern language will compete with R on ergonomics because of how it allows functions to read the context they’re called in, and S expressions are incredibly flexibly. The R manual is great. To say pandas just copied it but worse is overly dismissive. The core of pandas has always been indexing/reindexing, split-apply-combine, and slicing views. It’s a different approach than R’s data tables or frames. |
| |
| ▲ | Xunjin 36 minutes ago | parent | prev [-] | | Indeed, even Rust was created learning with the mistakes of memory management and known patterns like the famous RAII. |
|
|
| ▲ | data-ottawa 13 minutes ago | parent | prev | next [-] |
| Pandas deserves a ton of respect in my opinion. I built my career on knowing it well and using it daily for a decade, so I’m biased. Pandas created the modern Python data stack when there was not really any alternatives (except R and closed source). The original split-apply-combine paradigm was well thought out, simple, and effective, and the built in tools to read pretty much anything (including all of your awful csv files and excel tables) and deal with timestamps easily made it fit into tons of workflows. It pioneered a lot, and basically still serves as the foundation and common format for the industry. I always recommend every member of my teams read Modern Pandas by Tom Augspurger when they start, as it covers all the modern concepts you need to get data work done fast and with high quality. The concepts carry over to polars. And I have to thank the pandas team for being a very open and collaborative bunch. They’re humble and smart people, and every PR or issue I’ve interacted with them on has been great. Polars is undeniably great software, it’s my standard tool today. But they did benefit from the failures and hard edges of pandas, pyspark, dask, the tidyverse, and xarray. It’s an advantage pandas didn’t have, and they still pay for. I’m not trying to take away from polars at all. It’s damn fast — the benchmarks are hard to beat. I’ve been working on my own library and basically every optimization I can think of is already implemented in polars. I do have a concern with their VC funding/commercialization with cloud. The core library is MIT licensed, but knowing they’ll always have this feauture wall when you want to scale is not ideal. I think it limits the future of the library a lot, and I think long term someone will fill that niche and the users will leave. |
|
| ▲ | noo_u an hour ago | parent | prev | next [-] |
| Polars took a lot of ideas from Pandas and made them better - calling it "inferior in every way" is all sorts of disrespectful :P Unfortunately, there are a lot of third party libraries that work with Pandas that do not work with Polars, so the switch, even for new projects, should be done with that in mind. |
| |
| ▲ | skylurk an hour ago | parent | next [-] | | Luckily, polars has .to_pandas() so you can still pass pandas dataframes to the libraries that really are still stuck on that interface. I maintain one of those libraries and everything is polars internally. | | |
| ▲ | adolph 6 minutes ago | parent | next [-] | | > pandas dataframes Didn't Pandas move to Arrow, matching Polars, in version 2? | |
| ▲ | noo_u an hour ago | parent | prev [-] | | to_pandas has a dependency on pandas - it is not the biggest of deals, but worth keeping in mind. |
| |
| ▲ | 21 minutes ago | parent | prev [-] | | [deleted] |
|
|
| ▲ | rich_sasha 3 hours ago | parent | prev | next [-] |
| I almost fully agree. I would add that Pandas API is poorly thought through and full of footguns. Where I certainly disagree is the "frame as a dict of time series" setting, and general time series analysis. The feel is also different. Pandas is an interactive data analysis container, poorly suited for production use. Polars I feel is the other way round. |
| |
| ▲ | thelastbender12 2 hours ago | parent | next [-] | | I think that's a fair opinion, but I'd argue against it being poorly thought out - pandas HAS to stick with older api decisions (dating back to before data science was a mature enough field, and it has pandas to thank for much of it) for backwards compatibility. | | |
| ▲ | ohyoutravel 2 hours ago | parent | next [-] | | Well this is like saying Python must maintain backwards compatibility with Python 2 primitives for all time. It’s simply not true. It’s not easy to deprecate an old API, but it’s doable and there are playbooks for it. Pandas is good, I’ve used it extensively, but agree it’s not fit for production use. They could catch up to the state of the art, but that requires them being very opinionated and willing to make some unpopular decisions for the greater good. | | |
| ▲ | cruffle_duffle 9 minutes ago | parent [-] | | Why though? polars sounds like the rewrite! It’s okay to cycle into a new library. Let pandas do its thing and polars slowly take over as new projects overtake. There is nothing wrong with this and it happens all the time. Like jquery, which hasn’t fundamentally changed since I was a wee lad doing web dev. They didn’t make major changes despite their approach to web dev being replaced by newer concepts found on angular, backbone, mustache, and eventually react. And that is a good thing. What I personally don’t want is something like angular that basically radically changed between 1.0 and 2.0. Might as well just call 2.0 something new. Note: I’ve never heard of polars until this comment thread. Can’t wait to try it out. |
| |
| ▲ | ptman 2 hours ago | parent | prev [-] | | 3.0 is the perfect place to break compat |
| |
| ▲ | sirfz 3 hours ago | parent | prev [-] | | I think that's a sane take. Indeed, I think most data analysts find it much easier to use pandas over polars when playing with data (mainly the bracket syntax is faster and mostly sensible) |
|
|
| ▲ | v3ss0n 4 hours ago | parent | prev | next [-] |
| Sounds too much like an advertisement.
Also we need to watch out when diving into Polars . Polars is VC backed Opensource project with cloud offering , which may become an opencore project - we know how those goes. |
| |
| ▲ | gkbrk 3 hours ago | parent [-] | | > we know how those go They get forked and stay open source? At least this is what happens to all the popular ones. You can't really un-open-source a project if users want to keep it open-source. | | |
| ▲ | stingraycharles 3 hours ago | parent [-] | | Depends on your definition of popular; plenty of examples where the business interests don't align well with open source. |
|
|
|
| ▲ | rdedev 17 minutes ago | parent | prev | next [-] |
| While polars is better if you work with predefined data formats, pandas is imo still better as a general purpose table container. I work with chemical datasets and this always involves converting SMILES string to Rdkit Molecule objects. Polars cannot do this as simply as calling .map on pandas. Pandas is also much better to do EDA. So calling it worse in every instance is not true. If you are doing pure data manipulation then go ahead with polars |
|
| ▲ | lairv 2 hours ago | parent | prev | next [-] |
| I would agree if not for the fact that polars is not compatible with Python multiprocessing when using the default fork method, the following script hangs forever (the pandas equivalent runs): import polars as pl
from concurrent.futures import ProcessPoolExecutor
pl.DataFrame({"a": [1,2,3], "b": [4,5,6]}).write_parquet("test.parquet")
def read_parquet():
x = pl.read_parquet("test.parquet")
print(x.shape)
with ProcessPoolExecutor() as executor:
futures = [executor.submit(read_parquet) for _ in range(100)]
r = [f.result() for f in futures]
Using thread pool or "spawn" start method works but it makes polars a pain to use inside e.g. PyTorch dataloader |
| |
| ▲ | skylurk an hour ago | parent | next [-] | | You are not wrong, but for this example you can do something like this to run in threads: import polars as pl
pl.DataFrame({"a": [1, 2, 3]}).write_parquet("test.parquet")
def print_shape(df: pl.DataFrame) -> pl.DataFrame:
print(df.shape)
return df
lazy_frames = [
pl.scan_parquet("test.parquet")
.map_batches(print_shape)
for _ in range(100)
]
pl.collect_all(lazy_frames, comm_subplan_elim=False)
(comm_subplan_elim is important) | |
| ▲ | ritchie46 an hour ago | parent | prev | next [-] | | Python 3.14 "spawns" by default. However, this is not a Polars issue. Using "fork" can leave ANY MUTEX in the system process invalid (a multi-threaded query engine has plenty of mutexes). It is highly unsafe and has the assumption that none of you libraries in your process hold a lock at that time. That's an assumption that's not PyTorch dataloaders to make. | |
| ▲ | schmidtleonard an hour ago | parent | prev [-] | | I can't believe parallel processing is still this big of a dumpster fire in python 20 years after multi-core became the rule rather than the exception. Do they really still not have a good mechanism to toss a flag on a for loop to capture embarrassing parallelism easily? | | |
| ▲ | skylurk an hour ago | parent | next [-] | | This is one of the reasons I use polars. | |
| ▲ | ritchie46 an hour ago | parent | prev | next [-] | | Polars does that for you. | |
| ▲ | lairv an hour ago | parent | prev [-] | | Well I think ProcessPoolExecutor/ThreadPoolExecutor from concurrent.futures were supposed to be that |
|
|
|
| ▲ | 2 hours ago | parent | prev | next [-] |
| [deleted] |
|
| ▲ | bhadass 2 hours ago | parent | prev [-] |
| why not just go full bore to duckdb? |
| |
| ▲ | vegabook 13 minutes ago | parent [-] | | because method chaining in Polars is much more composable and ergonomic than SQL once the pipeline gets complex which makes it superior in an exploratory "data wrangling" environment. While DuckDB now has its own new expressions pipeline implementation it's way worse than Polars'. DuckDB has other advantages though but Polars is a much cleaner Pandas replacement. Earlier versions of DuckDB were also crashy whereas polars feels carved out of granite. |
|