Completely agree. The sign up flow for your startup does not need the same rigor as medical research. You don’t need transportation engineering standards for your product packaging, either. They’re just totally different levels of risk.

I could write pages on this (I’ve certainly spoken for hours) but the adoption of a scientific research mindset is very limiting for A/B testing. You don’t need all the status quo bias of null hypothesis testing.

At the same time, it’s quite impressive how people are able to adapt. An organization experienced with A/B testing will start doing things like multi variate correction in their heads.

For anyone spinning this stuff up, go Bayesian from the start. You’ll end up there, whether you realize it or not. (People will look at p-values in consideration of prior evidence).

0.05 (or any Bayesian equivalent) is not a magic number. It’s really quite high for a default. Harder sciences (the ones not in replication crisis) use much stricter values by default.

Adjust the confidence required to the cost of the change and the risk of harm. If you’re at the point of testing, the cost of change may be zero (content). It may be really high, it may be net negative!

But in most cases, at a startup, you should be going after wins that are way more impactful and end up having p-values lower than 0.05, anyway. This is easy to say, but don’t waste your time coming up with methods to squeeze out more signal. Just (just lol) make better changes to your product so that the methods don’t matter. If p=0.00001, that’s going to be a better signal than p=0.05 with every correction in this article.

If you’re going to pick any fanciness from the start (besides Bayes) make it anytime valid methods. You’re certainly already going to be peaking (as you should) so have your data reflect that.

▲

yorwba 14 hours ago | parent | next [-]

> You don’t need all the status quo bias of null hypothesis testing.

You don't have to make the status quo be the null hypothesis. If you make a change, you probably already think that your change is better or at least neutral, so make that the null. If you get a strong signal that your change is actually worse, rejecting the null, revert the change.

Not "only keep changes that are clearly good" but "don't keep changes that are clearly bad."

	▲	scott_w 14 hours ago \| parent [-]
		This is a reasonable approach, particularly when you’re looking at moving towards a bigger redesign that might not pay off right away. I’ve seen it called “non-inferiority test,” if you’re curious.

▲

parpfish 14 hours ago | parent | prev | next [-]

Especially for startups with a small user base.

Not many users means that getting to stat sig will take longer (if at all).

Sometimes you just need to trust your design/product sense and assert that some change you’re making is better and push it without an experiment. Too often people use experimentation for CYA reasons so they can never be blamed for making a misstep

	▲	scott_w 14 hours ago \| parent [-]
		100% this. I’ve seen people get too excited to A/B test everything even when it’s not appropriate. For us, changing prices was a common A/B test when the relatively low number of conversions meant the tests took 3 months to run! I believe we’ve moved away from that, now. The company has a large user base, it’s just SaaS doesn’t have the same conversion # as, say, e-commerce.

▲

bigfudge 13 hours ago | parent | prev [-]

The idea you should be going after bigger wins than .05 misses the point. The p value is a function of the effect size and the sample size. If you have a big effect you’ll see it even with small data.

Completely agree on the Bayesian point though, and the importance of defining the loss function. Getting people used to talking about the strength of the evidence rather than statistical significance is a massive win most of the time.

	▲	travisjungroth 4 hours ago \| parent [-]
		> If you have a big effect you’ll see it even with small data. That’s in line with what I was saying so I’m not sure where I missed the point. P-value a function of effect size, variance and sample size. Bigger wins would be those that have a larger effect and more consistent effect, scaled to the number of users (or just get more users).