Excellent article - except that the author probably should have gated their substantiation of the claim behind a cliffhanger, as other commenters have mentioned.

The author's priorities are sensible, and indeed with that set of priorities, it makes sense to end up near R. However, they're not universal among data scientists. I've been a data scientist for eight years, and have found that this kind of plotting and dataframe wrangling is only part of the work. I find there is usually also some file juggling, parsing, and what the author calls "logistics". And R is terrible at logistics. It's also bad at writing maintainable software.

If you care more about logistics and maintenance, your conclusion is pushed towards Python - which still does okay in the dataframes department. If you're ALSO frequently concerned about speed, you're pushed towards Julia.

None of these are wrong priorities. I wish Julia was better at being R, but it isn't, and it's very hard to be both R and useful for general programming.

Edit: Oh, and I should mention: I also teach and supervise students, and I KEEP seeing students use pandas to solve non-table problems, like trying to represent a graph as a dataframe. Apparently some people are heavily drawn to use dataframes for everything - if you're one of those people, reevaluate your tools, but also, R is probably for you.

▲

ActorNightly 12 hours ago | parent | next [-]

>Excellent article

Except its not. Data science in python pretty much requires you to use numpy. So his example of mean/variance code is a dumb comparison. Numpy has mean and variance functions built in for arrays.

Even when using raw python in his example, some syntax can be condesed quite a bit:

groups = defaultdict(list) [groups[(row['species'], row['island'])].append(row['body_mass_g']) for row in filtered]

It takes the same amount of mental effort to learn python/numpy as it does with R. The difference is, the former allows you to integrate your code into any other applicaiton.

	▲	dragonwriter 10 hours ago \| parent \| next [-]
		> Numpy has mean and variance functions built in for arrays. Even outside of Numpy, the stdlib has the statistics packages which provides mean, variance, population/sample standard deviation, and other statistics functions for normal iterables. The attempt to make Python out-of-the-box code look bad was either deliberately constructed to exaggerate the problems complained of, or was the product of a very convenient ignorance of the applicable parts of Python and its stdlib.
	▲	ModernMech 10 hours ago \| parent \| prev [-]
		I dunno. Numpy has its own data types, its own collections, its own semantics which are all different enough from Python, I think it's fair to consider it a DSL on its own. It'd be one thing if it was just, operator overloading to provide broadcasting for python, but Numpy's whole existence is to patch the various shortcomings Python has in DS.

▲

a_bonobo 3 hours ago | parent | prev [-]

>I find there is usually also some file juggling, parsing, [...]

I'd say I'm 50/50 Python/R for exactly this reason: I write Python code on HPC or a server to parse many, many files, then I get some kind of MB-scale summary data I analyse locally in R.

R is not good at looping over hundreds of files in the gigabytes, Python is not good at making pretty insights from the summary. A tool for every task.