Remix.run Logo
lottin 5 days ago

Looking at the R code in this article, I'm having a hard time understanding the appeal of tidyverse.

ngriffiths 5 days ago | parent | next [-]

For me the appeal is less that tidyverse is great and more that the R standard library is horrible. It's full of esoteric names, inconsistent use and order of parameters, unreasonable default behavior, behavior that surprises you coming from other programming experience. It's all in a couple massive packages instead of broken up into manageable pieces.

Tidyverse is imperfect and it feels heavy-handed and awkward to replace all the major standard library functions, but Tidyverse stuff is way more ergonomic.

lottin 5 days ago | parent [-]

I think the R standard library is quite excellent. It pretty much follows the Unix philosophy of "doing one thing right". The only exception being `reshape` which tries to do too many things, but it can usually be avoided. It isn't inconsistent. I think the problem is the lack of tutorials that explain how to use all the data manipulation tools effectively, because there are quite a lot of functions and it isn't easy to figure out how to use them together to accomplish practical things. Tidyverse may be consistent with itself, but it's inconsistent with everything else. Either you only use tidyverse, or your program looks like an inconsistent mess.

ngriffiths 4 days ago | parent [-]

Honestly, it might partly be that I've used R somewhat irregularly and I put a lot of value in design choices that "make sense" and are easier to remember. I'm sure once you are intimately familiar with the whole base language you can be really happy and productive with it.

> I think the problem is the lack of tutorials that explain how to use all the data manipulation tools effectively, because there are quite a lot of functions and it isn't easy to figure out how to use them together to accomplish practical things.

Most languages solve this problem by not cramming quite a lot of functions in one package and using shared design concepts to make it easier to fit them together. I don't think tutorials would solve these problems effectively but I guess it makes sense that they affect newer users the most.

> Tidyverse may be consistent with itself, but it's inconsistent with everything else.

Yeah, totally agree and I really dislike this part.

gjf 5 days ago | parent | prev | next [-]

Author here; I think I understand where you might be coming from. I find functional nature of R combined with pipes incredibly powerful and elegant to work with.

OTOH in a pipeline, you're mutating/summarising/joining a data frame, and it's really difficult to look at it and keep track of what state the data is in. I try my best to write in a way that you understand the state of the data (hence the tables I spread throughout the post), but I do acknowledge it can be inscrutable.

lottin 5 days ago | parent [-]

A "pipe" is simply a composition of functions. Tidyverse adds a different syntax for doing function composition, using the pipe operator, which I don't particularly like. My general objection to Tidyverse is that it tries to reinvent everything but the end result is a language that is less practical and less transparent than standard R.

mi_lk 5 days ago | parent [-]

Can you rewrite some of those snippets in standard R w/o Tidyverse? Curious what it would look like

lottin 5 days ago | parent | next [-]

I didn't rewrite the whole thing. But here's the first part. It uses the `histogram` function from the lattice package.

    population_data <- data.frame(
        uniform = runif(10000, min = -20, max = 20),
        normal = rnorm(10000, mean = 0, sd = 4),
        binomial = rbinom(10000, size = 1, prob = .5),
        beta = rbeta(10000, shape1 = .9, shape2 = .5),
        exponential = rexp(10000, .4),
        chisquare = rchisq(10000, df = 2)
    )
    
    histogram(~ values|ind, stack(population_data),
              layout = c(6, 1),
              scales = list(x = list(relation="free")),
              breaks = NULL)
    
    take_random_sample_mean <- function(data, sample_size) {
        x <- sample(data, sample_size)
        c(mean = mean(x), sd = sqrt(var(x)))
    }
    
    sample_statistics <- replicate(20000, sapply(population_data, take_random_sample_mean, 60))
    
    sample_mean <- as.data.frame(t(sample_statistics["mean", , ]))
    sample_sd <- as.data.frame(t(sample_statistics["sd", , ]))
    
    histogram(sample_mean[["uniform"]])
    histogram(sample_mean[["binomial"]])
    
    histogram(~values|ind, stack(sample_mean), layout = c(6, 1),
              scales = list(x = list(relation="free")),
              breaks = NULL)
kgwgk 5 days ago | parent | prev | next [-]

The following code essentially redoes what the code up to the first conf_interval block does there. Which one is more clear may be debatable but it's shorter by a factor of two and faster by a factor of ten (45 seconds vs 4 for me).

    sample_size <- 60
    sample_meansB <- lapply(population_dataB, function(x){
 t(apply(replicate(20000, sample(x, sample_size)), 2, function(x) c(sample_mean=mean(x), sample_sd=sd(x))))
    })
    lapply(sample_meansB, head) ## check first rows

    population_data_statsB <- lapply(population_dataB, function(x) c(population_mean=mean(x), 
             population_sd=sd(x), 
             n=length(x)))
    do.call(rbind, population_data_statsB) ## stats table

    cltB <- mapply(function(s, p) (s[,"sample_mean"]-p["population_mean"])/(p["population_sd"]/sqrt(sample_size)),
     sample_meansB, population_data_statsB)
    head(cltB) ## check first rows

    small_sample_size <- 6 
    repeated_samplesB <- lapply(population_dataB, function(x){
 t(apply(replicate(10000, sample(x, small_sample_size)), 2, function(x) c(sample_mean=mean(x), sample_sd=sd(x))))
    })

    conf_intervalsB <- lapply(repeated_samplesB, function(x){
 sapply(c(lower=0.025, upper=0.975), function(q){
     x[,"sample_mean"]+qnorm(q)*x[,"sample_sd"]/sqrt(small_sample_size)
 })})

    within_ci <- mapply(function(ci, p) (p["population_mean"]>ci[,"lower"]&p["population_mean"]<ci[,"upper"]),
   conf_intervalsB, population_data_statsB)
    apply(within_ci, 2, mean) ## coverage
One can do simple plots similar to the ones in that page as follows:

    par(mfrow=c(2,3), mex=0.8)
    for (d in colnames(population_dataB)) plot(density(population_dataB[,d], bw="SJ"), main=d, ylab="", xlab="", las=1, bty="n")
    for (d in colnames(cltB)) plot(density(cltB[,d], bw="SJ"), main=d, ylab="", xlab="", las=1, bty="n")
    for (d in colnames(cltB)) { qqnorm(cltB[,d], main=d, ylab="", xlab="", las=1, bty="n"); qqline(cltB[,d], col="red") }
apwheele 5 days ago | parent | prev [-]

I mean, for the main simulation I would do it like this:

    set.seed(10)
    n <- 10000; samp_size <- 60
    df <- data.frame(
        uniform = runif(n, min = -20, max = 20),
        normal = rnorm(n, mean = 0, sd = 4),
        binomial = rbinom(n, size = 1, prob = .5),
        beta = rbeta(n, shape1 = .9, shape2 = .5),
        exponential = rexp(n, .4),
        chisquare = rchisq(n, df = 2)
    )
    
    sf <- function(df,samp_size){
        sdf <- df[sample.int(nrow(df),samp_size),]
        colMeans(sdf)
    }
    
    sim <- t(replicate(20000,sf(df,samp_size)))
I am old, so I do not like tidyverse either -- I can concede it is of personal preference though. (Personally do not agree with the lattice vs ggplot comment for example.)
pks016 5 days ago | parent | prev | next [-]

Somehow, tidyverse didn't click with me. I still use it sometimes. But now I primarily use base R and data.table

RA_Fisher 5 days ago | parent | prev | next [-]

Why? The tidyverse is so readable, elegant, compositional, functional and declarative. It allows me to produce a lot more and higher quality than I could without it. ggplot2 is the best visualization software hands down, and dplyr leverages Unix’s famous point free programming style (that reduces the surface area for errors).

lottin 5 days ago | parent [-]

I disagree. In this example tidyverse looks convoluted compared to just using an array and apply. ggplot2 is okay but we already had lattice. Lattice does everything ggplot2 does and produces much better-looking plots IMO.

RA_Fisher 5 days ago | parent [-]

I like simplicity and I love a good base R idiom, but there's a lot less consistency in base R compared to the tidyverse (and that comes with a productivity penalty).

Lattice is really low-level. It's like doing vis with matplotlib (requires a lot of time and hair-pulling). Higher level interfaces boost productivity.

ekianjo 5 days ago | parent | prev [-]

the equivalent in any other language would be an ugly, unreadable, inconsistent mess.