▲ | freehorse 11 hours ago | ||||||||||||||||||||||||||||||||||||||||||||||||||||
I do not understand what the first tests are supposed to do. The author says the: > Your hypothesis is: layout influences signup behavior. I would expect that then the null hypothesis is that *layout does not influence signup behavior*. I would think that then an ANOVA (or an equivalent linear model) to be what tests this hypothesis, where you test the 4 layouts (or the 4 new layouts plus a control?) in one factor. If you get a significant p-value (no multiple tests required) you go on with post-hoc tests to look into comparisons between the different layouts (for 4 layouts, it should be 6 tests). But then you can use ways to control for multiple comparisons that are not as strict as just dividing your threshold by the number of comparisons, eg with Tukey's test. But here I assume there is a control (as in some for users are still presented the old layout?) and each layout is compared to that control? If I would see that distribution of p-values I would just intuitively think that the experiment is underpowered. P-values from null tests are supposed to be distributed uniformly between 0 and 1, while these cluster around 0.05. It rather seems like a situation that it is hard to make inferences from because of issues in designing the experiment itself. For example, I would rather have fewer layouts, driven by some expert design knowledge, rather than a lot of randomish layouts. The first increases statistical power, because the fewer tests you investigate, the less you have to adjust your p-values. But also, the fewer layouts you have, the more users you have per group (as the test is between groups) which also increases statistical power. The article is not wrong overall about how to control p-values etc, but I think that this knowledge is important not just to "do the right analysis" but, even more importantly, understand the limitations of an experimental design and structure it in a way that it may succeed in telling you something. To this end, g*power [0] is a useful tool that eg can let one calculate sample size in advance based on predicted effect size and power required. [0] https://www.psychologie.hhu.de/arbeitsgruppen/allgemeine-psy... | |||||||||||||||||||||||||||||||||||||||||||||||||||||
▲ | majormajor 11 hours ago | parent [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||
The fraction of A/B tests I've seen personally that mentioned ANOVA at all is very small. Or thought that critically about experiment design. Understanding of p values is also generally poor; prob/stat education in engineering and business degrees seems to be the least-covered-or-respected type of math. Even at places that want to ruthlessly prioritize velocity over rigor I think it would be better to at least switch things up and worry more about effect size than p-value. Don't bother waiting to see if marginal effects are "significant" statistically if they aren't significant from the POV of "we need to do things that can 10x our revenue since we're a young startup." | |||||||||||||||||||||||||||||||||||||||||||||||||||||
|