| ▲ | curiousgal 7 hours ago | |
Semi tangent but I am curious. for those with more experience in python, do you just pass around generic Pandas Dataframes or do you parse each row into an object and write logic that manipulates those instead? | ||
| ▲ | tomtom1337 6 hours ago | parent | next [-] | |
Definitely do not parse each row into eg pydantic models. You lose the entire performance benefit of pandas / polars by doing this. If you need it, use a dataframe validation library to ensure that values are within certain ranges. There are not yet good, fast implementations of proper types in Python dataframes (or databases for that matter) that I am aware of. | ||
| ▲ | lmeyerov 7 hours ago | parent | prev | next [-] | |
Pass as immutable values, and try to enforce schema (eg, arrow) to keep typed & predictable. This is generally easy by ensuring initial data loads get validated, and then basic testing of subsequent operations goes far. If python had dependent types, that's how i'd think about them, and keeping them typed would be even easier, eg, nulls sneaking in unexpectedly and breaking numeric columns When using something like dask, which forces stronger adherence to typings, this can get more painful | ||
| ▲ | adammarples 6 hours ago | parent | prev | next [-] | |
Speaking personally, I try not to write code that passes around dataframes at all. I only really want to interact with them when I have to in order to read/write parquet. | ||
| ▲ | whalesalad 6 hours ago | parent | prev [-] | |
The circumstances where you would use one or the other are vastly different. A dataframe is an optimized datastructure for dealing with columnar data, filtering, sorting, aggregating, etc. So if that is what you are dealing with, use a dataframe. The goal is more about cleaning and massaging data at the perimeter (coming in, and going out) versus what specific tool (a collection of objects vs a dataframe) is used. | ||