Remix.run Logo
RobinL 15 hours ago

I think a lot of this comes down to the question: Why aren't tables first class citizens in programming languages?

If you step back, it's kind of weird that there's no mainstream programming language that has tables as first class citizens. Instead, we're stuck learning multiple APIs (polars, pandas) which are effectively programming languages for tables.

R is perhaps the closest, because it has data.frame as a 'first class citizen', but most people don't seem to use it, and use e.g. tibbles from dplyr instead.

The root cause seems to be that we still haven't figured out the best language to use to manipulate tabular data yet (i.e. the way of expressing this). It feels like there's been some convergence on some common ideas. Polars is kindof similar to dplyr. But no standard, except perhaps SQL.

FWIW, I agree that Python is not great, but I think it's also true R is not great. I don't agree with the specific comparisons in the piece.

RodgerTheGreat 14 hours ago | parent | next [-]

There are a number of dynamic languages to choose from where tables/dataframes are truly first-class datatypes: perhaps most notably Q[0]. There are also emerging languages like Rye[1] or my own Lil[2].

I suspect that in the fullness of time, mainstream languages will eventually fully incorporate tabular programming in much the same way they have slowly absorbed a variety of idioms traditionally seen as part of functional programming, like map/filter/reduce on collections.

[0] https://en.wikipedia.org/wiki/Q_(programming_language_from_K...

[1] https://ryelang.org/blog/posts/comparing_tables_to_python/

[2] http://beyondloom.com/tools/trylil.html

liveranga 6 hours ago | parent | next [-]

Nushell is another one with tables built-in:

https://www.nushell.sh/book/working_with_tables.html

middayc 11 hours ago | parent | prev [-]

Another page about Rye tables: https://ryelang.org/cookbook/working-with/tables/

OkayPhysicist 9 hours ago | parent | prev | next [-]

There's a number of structures that I think are missing in our major programming languages. Tables are one. Matrices are another. Graphs, and relatedly, state machines are tools that are grossly underused because of bad language-level support. Finally, not a structure per se, but I think most languages that are batteries-included enough to included a regex engine should have a a full-fledged PEG parsing engines. Most, if not all, Regex horror stories derive from a simple "Regex is built in".

What tools are easily available in a language, by default, shape the pretty path, and by extension, the entire feel of the language. An example that we've largely come around on is key-value stores. Today, they're table stakes for a standard library. Go back to 90's, and the most popular languages at best treated them as second-class citizens, more like imported objects than something fundamental like arrays. Sure, you can implement a hash map in any language, or import some else's implementation, but oftentimes you'll instead end up with nightmarish, hopefully-synchronized arrays, because those are built-in, and ready at hand.

jltsiren 6 hours ago | parent | next [-]

When there is no clear canonical way of implementing something, adding it to a programming language (or a standard library) is risky. All too often, you realize too late that you made a wrong choice, and then you add a second version. And a third. And so on. And then you end up with a confusing language full of newbie traps.

Graphs are a good example, as they are a large family of related structures. For example, are the edges undirected, directed, or something more exotic? Do the nodes/edges have identifiers and/or labels? Are all nodes/edges of the same type, or are there multiple types? Can you have duplicate edges between the same nodes? Does that depend on the types of the nodes/edges, or on the labels?

throwaway2037 9 hours ago | parent | prev [-]

    > There's a number of structures that I think are missing in our major programming languages. Tables are one. Matrices are another.
I disagree. Most programmers will go their entire career and never need a matrix data structure. Sure, they will use libraries that use matrices, but never use them directly themselves. It seems fine that matrices are not a separate data type in most modern programming languages.
OkayPhysicist 9 hours ago | parent [-]

Unless you think "most programmers" === "shitty webapp developers", I strongly disagree. Matrices are first class, important components in statistics, data analysis, graphics, video games, scientific computing, simulation, artificial intelligence and so, so much more.

And all of those programmers are either using specialized languages, (suffering problems when they want to turn their program into a shitty web app, for example), or committing crimes against syntax like

rotation_matrix.matmul(vectorized_cat)

lock1 7 hours ago | parent | next [-]

That's needlessly aggressive. Ignoring webapps, you could do gamedev without even knowing what a matrix is.

You don't even need such construction in most native applications, embedded systems, and OS kernel development.

throwaway2037 4 hours ago | parent | next [-]

This is my exactly point. Even in a highly specialised library for pricing securities, the amount of code that uses matrices is surprisingly small.

theamk 4 hours ago | parent | prev [-]

I am working in embedded. Had to optimize weights for an embedded algorithm, decided to use linear regression and thus needed matrices.

And if you do robotics, the chances of encountering a matrix are very high.

habinero 3 hours ago | parent | prev [-]

I don't see why the majority of engineers need to cater to your niche use cases. It's a programming language, you can just make the library if it doesn't exist. Nobody's stopping you.

Plus, plenty of third party projects have been incorporated into the Python standard library.

genidoi 2 hours ago | parent | prev | next [-]

This is an interesting observation. One possible explanation for a lack of robust first class table manipulation support in mainstream languages could be due to the large variance in real-world table sizes and the mutually exclusive subproblems that come with each respective jump in order-of-magnitude row size.

The problems that one might encounter in dealing with a 1m row table are quite different to a 1b row table, and a 1b row table is a rounding error compared to the problems that a 1t row table presents. A standard library needs to support these massive variations at least somewhat gracefully and that's not a trivial API surface to design.

don-bright 3 hours ago | parent | prev | next [-]

Every copy of Microsoft Excel includes Power Query which is in the M language and has tables as a type. Programs are essentially transformations of table columns and rows. Not sure if its mainstream but is widely available. M language is also included in other tools like PowerBI and Power Automate.

riskassessment 10 hours ago | parent | prev | next [-]

> R is perhaps the closest, because it has data.frame as a 'first class citizen', but most people don't seem to use it, and use e.g. tibbles from dplyr instead.

Everyone in R uses data.frame because tibble (and data.table) inherits from data.frame. This means that "first class" (base R) functions work directly on tibble/data.table. It also makes it trivial to convert between tibble, data.table, and data.frames.

maest 7 hours ago | parent | prev | next [-]

> Why aren't tables first class citizens in programming languages?

They are in q/kdb and it's glorious. Sql expressions are also first class citizens and it makes it very pleasant to write code

paddleon 14 hours ago | parent | prev | next [-]

> R is perhaps the closest, because it has data.frame as a 'first class citizen', but most people don't seem to use it, and use e.g. tibbles from dplyr instead.

You're forgetting R's data.table, https://cran.r-project.org/web/packages/data.table/vignettes...,

which is amazing. Tibbles only wins because they fought the docs/onboarding battle better, and dplyr ended up getting industry buy-in.

elehack 11 hours ago | parent | next [-]

And readability. data.table is very capable, but the incantations to use it are far less obvious (both for reading and writing) than dplyr.

But you can have the best of both worlds with https://dtplyr.tidyverse.org/, using data.table's performance improvements with dplyr syntax.

extr 11 hours ago | parent | prev [-]

Yeah data.table is just about the best-in-class tool/package for true high-throughput "live" data analysis. Dplyr is great if you are learning the ropes, or want to write something that your colleagues with less experience can easily spot check. But in my experience if you chat with people working in the trenches of banks, lenders, insurance companies, who are running hundreds of hand-spun crosstabs/correlational analyses daily, you will find a lot of data.table users.

Relevant to the author's point, Python is pretty poor for this kind of thing. Pandas is a perf mess. Polars, duckdb, dask etc, are fine perhaps for production data pipelines but quite verbose and persnickety for rapid iteration. If you put a gun to my head and told me to find some nuggets of insight in some massive flat files, I would ask for an RStudio cloud instance + data.table hosted on a VM with 256GB+ of RAM.

riidom 10 hours ago | parent | prev | next [-]

PyTorch was first only Torch, and in Lua. I didn't follow it too close at the time, but apparently due to popular demand it got redone in Python and voila PyTorch.

brikym an hour ago | parent | prev | next [-]

This. I really really want some kind of data frame which has actual compile time typing my LSP/IDE can understand. Kusto query language (Azure Data Explorer) has it and the auto completion and error checking is extremely useful. But kusto query language is really just limited to one cloud product.

nextos 14 hours ago | parent | prev | next [-]

I don't think this is the real problem. In R and Julia tables are great, and they are libraries. The key is that these languages are very expressive and malleable.

Simplifying a lot, R is heavily inspired by Scheme, with some lazy evaluation added on top. Julia is another take at the design space first explored by Dylan.

Iwan-Zotow 6 hours ago | parent [-]

R was clone of S

kelipso 15 hours ago | parent | prev | next [-]

People use data.table in R too (my favorite among those but it’s been a few years). data.table compared to dplyr is quite a contrast in terms of language to manipulate tabular data.

jna_sh 15 hours ago | parent | prev | next [-]

I know the primary data structure in Lua is called a table, but I’m not very familiar with them and if they map to what’s expected from tables in data science.

Jtsummers 15 hours ago | parent | next [-]

Lua's tables are associative arrays, at least fundamentally. There's more to it than that, but it's not the same as the tables/data frames people are using with pandas and similar systems. You could build that kind of framework on top of Lua's tables, though.

https://www.lua.org/pil/2.5.html

TheSoftwareGuy 15 hours ago | parent | prev | next [-]

IIRC those are basically hash tables, which are first-class citizens in many languages already

15 hours ago | parent | prev [-]
[deleted]
IgorPartola 10 hours ago | parent | prev | next [-]

SQL is not just about a table but multiple tables and their relationships. If it was just about running queries against a single table then basic ordering, filtering, aggregation, and annotation would be easy to achieve in almost any language.

Soon as you start doing things like joins, it gets complicated but in theory you could do something like an API of an ORM to do most things. With using just operators you quickly run into the fact that you have to overload (abuse) operators or write a new language with different operator semantics:

  orders * customers | (customers.id == orders.customer_id | orders.amount > Decimal(‘10.00’)
Where * means cross product/outer join and | means filter. Once you add an ordering operator, a group by, etc. you basically get SQL with extra steps.

But it would be nice to have it built in so talking to a database would be a bit more native.

sgarland 7 hours ago | parent [-]

Every time I see stuff like this (Google’s new SQL-ish language with pipes comes to mind), I am baffled. SQL to me is eminently readable, and flows beautifully.

For reference, I think the same is true of Python, so it’s not like I’m a Perl wizard or something.

RA_Fisher 9 hours ago | parent | prev | next [-]

R’s the best, bc it’s been a statistical analysis language from the beginning in 1974 (and was built and developed for the purpose of analysis / modeling). Also, the tidyverse is marvelous. It provides major productivity in organizing and augmenting the data. Then there’s ggplot, the undisputed best graphical visualization system + built-ins like barplot(), or plot().

But ultimately data analysis is going beyond Python and R into the realm of Stan and PyMC3, probabilistic programming languages. It’s because we want to do nested integrals and those software ecosystems provide the best way to do it (among other probabilistic programming languages). They allow us to understand complex situations and make good / valuable decisions.

kevinhanson 15 hours ago | parent | prev | next [-]

this is my biggest complaint about SAS--everything is either a table or text.

most procs use tables as both input and output, and you better hope the tables have the correct columns.

you want a loop? you either get an implicit loop over rows in a table, write something using syscalls on each row in a table, or you're writing macros (all text).

m_mueller 2 hours ago | parent | prev | next [-]

Fortran gives you that and more, it has first class multidimensional arrays, including matrix operations.

127 10 hours ago | parent | prev | next [-]

Because there's no obvious universal optimal data structure for heterogeneous N-dimensional data with varying distributions? You can definitely do that, but it requires an order of magnitude more resource use as baseline.

6 hours ago | parent | prev | next [-]
[deleted]
alexnewman 11 hours ago | parent | prev | next [-]

APL Is great

smartmic 2 hours ago | parent | next [-]

Agreed. I once used it for data preparation for a data science project (GNU APL). After a steep learning curve, it felt very much like writing math formulas — it was fun and concise, and I liked it very much. However, it has zero adoption in today's data science landscape. Sharing your work is basically impossible. If you're doing something just for yourself, though, I would probably give it a chance again.

7thaccount 10 hours ago | parent | prev [-]

Perfect solution for doing analysis on tables. Wes McKinney (inventor of pandas is rumored to have been inspired by it too).

My problem with APL is 1.) the syntax is less amazing at other more mundane stuff, and 2.) the only production worthy versions are all commercial. I'm not creating something that requires me to pay for a development license as well as distribution royalties.

CivBase 15 hours ago | parent | prev | next [-]

What is a table other than an array of structs?

thom 14 hours ago | parent | next [-]

It’s not that you can’t model data that way (or indeed with structs of arrays), it’s just that the user experience starts to suck. You might want a dataset bigger than RAM, or that you can transparently back by the filesystem, RAM or VRAM. You might want to efficiently index and query the data. You might want to dynamically join and project the data with other arrays of structs. You might want to know when you’re multiplying data of the wrong shapes together. You might want really excellent reflection support. All of this is obviously possible in current languages because that’s where it happens, but it could definitely be easier and feel more of a first class citizen.

RobinL 14 hours ago | parent | prev | next [-]

I would argue that's about how the data is stored. What I'm trying to express is the idea of the programming language itself supporting high level tabular abstractions/transformations such as grouping, aggregation, joins and so on.

p1necone 11 hours ago | parent | next [-]

Implementing all of those things is an order of magnitude more complex than any other first class primitive datatype in most languages, and there's no obvious "one right way" to do it that would fit everyones use cases - seems like libraries and standalone databases are the way to do it, and that's what we do now.

camdenreslink 14 hours ago | parent | prev | next [-]

Sounds a lot like LINQ in .NET (which is usually compatible with ORMs actually querying tables).

CivBase 14 hours ago | parent | prev [-]

Ah, that makes more sense. Thanks for the clarification.

FridgeSeal 9 hours ago | parent | prev | next [-]

Well it could be a struct of arrays.

Nitpicking aside, a nice library for doing “table stuff” without “the whole ass big table framework” would be nice.

It’s not hard to roll this stuff by hand, but again, a nicer way wouldn’t be bad.

ModernMech 10 hours ago | parent | prev [-]

The difference is semantics.

What is a paragraph but an array of sentences? What is a sentence but an array of words? What's a word but an array of letters? You can do this all the way down. Eventually you need to assign meaning to things, and when you do, it helps to know what the thing actually is, specifically, because an array of structs can be many things that aren't a table.

ModernMech 10 hours ago | parent | prev | next [-]

It makes sense from a historical perspective. Tables are a thing in many languages, just not the ones that mainstream devs use. In fact, if you rank programming languages by usage outside of devs, the top languages all have a table-ish metaphor (SQL, Excel, R, Matlab).

The languages devs use are largely Algol derived. Algol is a language that was used to express algorithms, which were largely abstractions over Turing machines, which are based around an infinite 1D tape of memory. This model of 1D memory was built into early computers, and early operating systems and early languages. We call it "mechanical sympathy".

Meanwhile, other languages at the same time were invented that weren't tied so closely to the machine, but were more for the purpose of doing science and math. They didn't care as much about this 1D view of the world. Early languages like Fortran and Matlab had notions of 2D data matrices because math and science had notions of 2D data matrices. Languages like C were happy to support these things by using an array of pointers because that mapped nicely to their data model.

The same thing can be said for 1-based and 0-based indexing -- languages like Matlab, R, and Excel are 1-based because that's how people index tables; whereas languages like C and Java are 0-based because that's how people index memory.

constantcrying 12 hours ago | parent | prev | next [-]

>Why aren't tables first class citizens in programming languages?

Matlab has them, in fact it has multiple competing concepts of it.

11 hours ago | parent | prev | next [-]
[deleted]
dm319 9 hours ago | parent | prev | next [-]

Dplyr is quite happy with data.frame. R is built around tabular data. Other statistical languages are too, such as Stata.

getnormality 7 hours ago | parent | prev [-]

Saying that SQL is the standard for manipulating tabular data is like saying that COBOL is the standard for financial transactions. It may be true based on current usage, but nobody thinks it's a good idea long term. They're both based on the outdated idea that a programming language should look like pidgin English rather than math.

Iwan-Zotow 6 hours ago | parent [-]

In R data.table is basically SQL in another shape