I do not understand what the first tests are supposed to do. The author says the:

> Your hypothesis is: layout influences signup behavior.

I would expect that then the null hypothesis is that *layout does not influence signup behavior*. I would think that then an ANOVA (or an equivalent linear model) to be what tests this hypothesis, where you test the 4 layouts (or the 4 new layouts plus a control?) in one factor. If you get a significant p-value (no multiple tests required) you go on with post-hoc tests to look into comparisons between the different layouts (for 4 layouts, it should be 6 tests). But then you can use ways to control for multiple comparisons that are not as strict as just dividing your threshold by the number of comparisons, eg with Tukey's test.

But here I assume there is a control (as in some for users are still presented the old layout?) and each layout is compared to that control? If I would see that distribution of p-values I would just intuitively think that the experiment is underpowered. P-values from null tests are supposed to be distributed uniformly between 0 and 1, while these cluster around 0.05. It rather seems like a situation that it is hard to make inferences from because of issues in designing the experiment itself.

For example, I would rather have fewer layouts, driven by some expert design knowledge, rather than a lot of randomish layouts. The first increases statistical power, because the fewer tests you investigate, the less you have to adjust your p-values. But also, the fewer layouts you have, the more users you have per group (as the test is between groups) which also increases statistical power. The article is not wrong overall about how to control p-values etc, but I think that this knowledge is important not just to "do the right analysis" but, even more importantly, understand the limitations of an experimental design and structure it in a way that it may succeed in telling you something. To this end, g*power [0] is a useful tool that eg can let one calculate sample size in advance based on predicted effect size and power required.

[0] https://www.psychologie.hhu.de/arbeitsgruppen/allgemeine-psy...

▲

majormajor 11 hours ago | parent [-]

The fraction of A/B tests I've seen personally that mentioned ANOVA at all is very small. Or thought that critically about experiment design. Understanding of p values is also generally poor; prob/stat education in engineering and business degrees seems to be the least-covered-or-respected type of math.

Even at places that want to ruthlessly prioritize velocity over rigor I think it would be better to at least switch things up and worry more about effect size than p-value. Don't bother waiting to see if marginal effects are "significant" statistically if they aren't significant from the POV of "we need to do things that can 10x our revenue since we're a young startup."

▲

fho 9 hours ago | parent | next [-]

> mentioned ANOVA at all is very small

That's because nobody learns how to do statistics and/or those who do are not really interested in it.

I taught statistics to biology students. Most them treated the statistics (and programming) courses like chores. Out of 300-ish students per year we had one or two that didn't leave uni mostly clueless about statistics.

▲

TeMPOraL 8 hours ago | parent [-]

FWIW, universities are pitching statistics the same way as every other subject, i.e. not at all. They operate under a delusion that students are deaperately interested in everything and grateful for the privilege of being taught by a prestigious institution. That may have been the case 100 years ago, but it hasn't been for decades now.

For me, stats was something I had to re-learn years after graduating, after I realized their importance (not just practical, but also epistemological). During university years, whatever interest I might have had, got extinguished the second the TA started talking about those f-in urns filled with colored balls.

▲

fho 6 hours ago | parent | next [-]

Also part of the problem:

> those f-in urns filled with colored balls.

I did my Abitur [1] in 2005, back then that used to be high school material.

When I was teaching statistics we had to cut more and more content from the courses in favor of getting people up to speed on content that they should have known from school.

[1] https://en.m.wikipedia.org/wiki/Abitur

▲

yusina 8 hours ago | parent | prev | next [-]

And you didn't have the mental capacity to abstract from the colored balls to whatever application domain you were interested in? Does everything have to come pre-digested for students so they don't have to do their own thinking?

	▲	stirfish 7 hours ago \| parent [-]
		Hey yusina, that's pretty rude. What's a different way you could ask your question?

▲

fn-mote 8 hours ago | parent | prev [-]

> They operate under a delusion that students are desperately interested in everything

In the US, students are the paying customers. The consequence for not learning everything is lowered skills available for the job market (engineering) or life (philosophy?).

To me it is preferable that students who do not understand are not rated highly by the university (=do not get top marks), but “forcing” the students to learn statistics? That doesn’t make much sense.

Also, there’s nothing wrong with learning something after uni. Every skill I use in my job was developed post-degree. Really.

▲

enaaem 6 hours ago | parent | prev [-]

Instead of trying to make p-values work. What if we just stopped teaching p-values and confidence intervals, and just teach Bayesian credible intervals and log odds ratios? Are there problems that can only be solved with p-values?