Remix.run Logo
Jemaclus 18 hours ago

> This isn't academic nit-picking. It's how medical research works when lives are on the line. Your startup's growth deserves the same rigor.

But does it, really? A lot of companies sell... well, let's say "not important" stuff. Most companies don't cost peoples' lives when you get it wrong. If you A/B test user signups for a startup that sells widgets, people aren't living or dying based on the results. The consequences of getting it wrong are... you sell fewer widgets?

While I understand the overall point of the post -- and agree with it! -- I do take issue with this particular point. A lot of companies are, arguably, _too rigorous_ when it comes to testing.

At my last company, we spent 6 weeks waiting for stat sig. But within 48 hours, we had a positive signal. Conversion was up! Not statistically significant, but trending in the direction we wanted. But to "maintain rigor," we waited 6 weeks before turning it... and the final numbers were virtually the same as the 48 hour numbers.

Note: I'm not advocating stopping tests as soon as something shows trending in the right direction. The third scenario on the post points this out as a flaw! I do like their proposal for "peeking" and subsequent testing.

But, really, let's just be realistic about what level of "rigor" is required to make decisions. We aren't shooting rockets into space. We're shipping software. We can change things if we get them wrong. It's okay. The world won't end.

IMO, the right framing here is: your startup deserves to be as rigorous as is necessary to achieve its goals. If its goals are "stat sig on every test," then sure, treat it like someone might die if you're wrong. (I would argue that you have the wrong goals, in this case, but I digress...)

But if your goals are "do no harm, see if we're heading in the right vector, and trust that you can pivot if it turns out you got a false positive," then you kind of explicitly don't need to treat it with the same rigor as a medical test.

travisjungroth 16 hours ago | parent | next [-]

Completely agree. The sign up flow for your startup does not need the same rigor as medical research. You don’t need transportation engineering standards for your product packaging, either. They’re just totally different levels of risk.

I could write pages on this (I’ve certainly spoken for hours) but the adoption of a scientific research mindset is very limiting for A/B testing. You don’t need all the status quo bias of null hypothesis testing.

At the same time, it’s quite impressive how people are able to adapt. An organization experienced with A/B testing will start doing things like multi variate correction in their heads.

For anyone spinning this stuff up, go Bayesian from the start. You’ll end up there, whether you realize it or not. (People will look at p-values in consideration of prior evidence).

0.05 (or any Bayesian equivalent) is not a magic number. It’s really quite high for a default. Harder sciences (the ones not in replication crisis) use much stricter values by default.

Adjust the confidence required to the cost of the change and the risk of harm. If you’re at the point of testing, the cost of change may be zero (content). It may be really high, it may be net negative!

But in most cases, at a startup, you should be going after wins that are way more impactful and end up having p-values lower than 0.05, anyway. This is easy to say, but don’t waste your time coming up with methods to squeeze out more signal. Just (just lol) make better changes to your product so that the methods don’t matter. If p=0.00001, that’s going to be a better signal than p=0.05 with every correction in this article.

If you’re going to pick any fanciness from the start (besides Bayes) make it anytime valid methods. You’re certainly already going to be peaking (as you should) so have your data reflect that.

yorwba 13 hours ago | parent | next [-]

> You don’t need all the status quo bias of null hypothesis testing.

You don't have to make the status quo be the null hypothesis. If you make a change, you probably already think that your change is better or at least neutral, so make that the null. If you get a strong signal that your change is actually worse, rejecting the null, revert the change.

Not "only keep changes that are clearly good" but "don't keep changes that are clearly bad."

scott_w 13 hours ago | parent [-]

This is a reasonable approach, particularly when you’re looking at moving towards a bigger redesign that might not pay off right away. I’ve seen it called “non-inferiority test,” if you’re curious.

parpfish 13 hours ago | parent | prev | next [-]

Especially for startups with a small user base.

Not many users means that getting to stat sig will take longer (if at all).

Sometimes you just need to trust your design/product sense and assert that some change you’re making is better and push it without an experiment. Too often people use experimentation for CYA reasons so they can never be blamed for making a misstep

scott_w 13 hours ago | parent [-]

100% this. I’ve seen people get too excited to A/B test everything even when it’s not appropriate. For us, changing prices was a common A/B test when the relatively low number of conversions meant the tests took 3 months to run! I believe we’ve moved away from that, now.

The company has a large user base, it’s just SaaS doesn’t have the same conversion # as, say, e-commerce.

bigfudge 12 hours ago | parent | prev [-]

The idea you should be going after bigger wins than .05 misses the point. The p value is a function of the effect size and the sample size. If you have a big effect you’ll see it even with small data.

Completely agree on the Bayesian point though, and the importance of defining the loss function. Getting people used to talking about the strength of the evidence rather than statistical significance is a massive win most of the time.

travisjungroth 2 hours ago | parent [-]

> If you have a big effect you’ll see it even with small data.

That’s in line with what I was saying so I’m not sure where I missed the point.

P-value a function of effect size, variance and sample size. Bigger wins would be those that have a larger effect and more consistent effect, scaled to the number of users (or just get more users).

epgui 18 hours ago | parent | prev | next [-]

It does, if you assume you care about the validity of the results or about making changes that improve your outcomes.

The degree of care can be different in less critical contexts, but then you shouldn’t lie to yourself about how much you care.

renjimen 18 hours ago | parent [-]

But there’s an opportunity cost that needs to be factored in when waiting for a stronger signal.

Nevermark 17 hours ago | parent | next [-]

One solution is to gradually move instances to you most likely solution.

But continue a percentage of A/B/n testing as well.

This allows for a balancing of speed vs. certainty

imachine1980_ 17 hours ago | parent [-]

do you use any tool for this, or simply crunk up slightly the dial each day

hruk 3 hours ago | parent | next [-]

We've used this Python package to do this: https://github.com/bayesianbandits/bayesianbandits

travisjungroth 17 hours ago | parent | prev [-]

There are multi armed bandit algorithms for this. I don’t know the names of the public tools.

This is especially useful for something where the value of the choice is front loaded, like headlines.

scott_w 12 hours ago | parent | prev | next [-]

There is but you can decide that up front. There’s tools that will show you how long it’ll take to get statistical significance. You can then decide if you want to wait that long or have a softer p-value.

epgui 15 hours ago | parent | prev [-]

Even if you have to be honest with yourself about how much you care about being right, there’s still a place for balancing priorities. Two things can be true at once.

Sometimes someone just has to make imperfect decisions based on incomplete information, or make arbitrary judgment calls. And that’s totally fine… But it shouldn’t be confused with data-driven decisions.

The two kinds of decisions need to happen. They can both happen honestly.

sweezyjeezy 3 hours ago | parent | prev | next [-]

Yes 100% this. If you're comparing two layouts there's no great reason to treat one as a 'treatment' and one as a 'control' as in medicine - the likelihood is they are both equally justified. If you run an experiment and get p=0.93 on a new treatment - are you really going to put money on that result being negative, and not updating the layout?

The reason we have this stuff in medicine is because it is genuinely important, and because a treatment often has bad side-effects, it's worse to give someone a bad treatment than to give them nothing, that's the point of the Hypocratic oath. You don't need this for your dumb B2C app.

jjmarr 15 hours ago | parent | prev | next [-]

Can this be solved by setting p=0.50?

Make your expectations explicit instead of implicit. 0.05 is completely arbitrary. If you are comfortable with a 50/50 chance of being right, make your threshold less rigorous.

scott_w 12 hours ago | parent [-]

I think at that point you may as well skip the test and just make the change you clearly want to make!

bigfudge 11 hours ago | parent [-]

Or collect some data and see if the net effect is positive. It’s possibly worth collecting some data though to rule out negative effects?

scott_w 9 hours ago | parent [-]

Absolutely, you can still analyse the outcomes and try to draw conclusions. This is true even for A/B testing.

yusina 10 hours ago | parent | prev | next [-]

I see where you are coming from, and overtesting is a thing, but I really believe that the baseline of quality of all software out there is terrible. We are just so used to it and it's been normalized. But there is really no day going by during which I'm not annoyed by a bug that somebody with more attention to quality would have not let through.

It's not about space rocket type of rigor, but it's about a higher bar than the current state.

(Besides, Elon's rockets are failing left and right, in contrast to what NASA achieved in the 60s, so there are some lessons there too.)

scott_w 13 hours ago | parent | prev | next [-]

> The consequences of getting it wrong are... you sell fewer widgets?

If that’s the difference between success and failure then that is pretty important to you as a business owner.

> do no harm, see if we're heading in the right vector, and trust that you can pivot if it turns out you got a false positive

That’s a reasonable, and in plenty of contexts the absolute best, approach to take. But don’t call it A/B testing, because it’s not.

brian-armstrong 17 hours ago | parent | prev | next [-]

The thing is though, you're just as likely to be not improving things.

I think we can realize another reason to just ship it. Startups need to be always moving. You need to keep turning the wheel to help keep everyone busy and keep them from fretting about your slow growth or high churn metrics. Startups need lots of fighting spirit. So it's still probably better to ship it rather than admit defeat and suffer bad vibes.

scott_w 13 hours ago | parent [-]

Allow me to rephrase what I think you’re saying:

Startups need to ship because they need to have a habit of moving constantly to survive. Stasis is death for a startup.

psychoslave 13 hours ago | parent | prev | next [-]

> We aren't shooting rockets into space.

Most of us don't, indeed. So still aligned with your perspective, it's good to take in consideration what we are currently working on, and what will be the possible implication. Sometimes the line is not so obvious though. If we design a library or framework which is not very specific to a inconsequential outcome, it's no longer obvious what policy make the more sense.

BrenBarn 15 hours ago | parent | prev | next [-]

The other thing is that in those medical contexts, the choice is often between "use this specific treatment under consideration, or do nothing (i.e., use existing known treatments)". Is anyone planning to fold their startup if they can't get a statistically significant read on which website layout is best? Another way to phrase "do no harm" is to say that a null result just means "there is no reason to change what you're doing".

bobbruno 6 hours ago | parent | prev | next [-]

It's not a matter of life and death, I agree - to some extent. Startups have very limited resources, and ignoring inconclusive results in the long term means you're spending these resources without achieving any bottom line results. If you do that too much/too long, you'll run out of funding and the startup will die.

The author didn't go into why companies do this (ignoring or misreading test results). Putting lack of understanding aside, my anecdotal experience from the time I worked as a data scientist boils down to a few major reasons:

- Wanting to be right. Being a founder requires high self-confidence, that feeling of "I know I'm right". But feeling right doesn't make one right, and there's plenty of evidence around that people will ignore evidence against their beliefs, even rationalize the denial (and yes, the irony of that statement is not lost on me); - Pressure to show work: doing the umpteenth UI redesign is better than just saying "it's irrelevant" in your performance evaluation. If the result is inconclusive, the harm is smaller than not having anything to show - you are stalling the conclusion that your work is irrelevant by doing whatever. So you keep on pushing them and reframing the results into some BS interpretation just to get some more time.

Another thing that is not discussed enough is what all these inconclusive results would mean if properly interpreted. A long sequence of inconclusive UI redesign experiments should trigger a hypothesis like "does the UI matter"? But again, those are existentially threatening questions for the people in the best position to come up with them. If any company out there were serious about being data-driven and scientific, they'd require tests everywhere, have external controls on quality and rigour of those and use them to make strategic decisions on where they invest and divest. At the very least, take them as a serious part of their strategy input.

I'm not saying you can do everything based on tests, nor that you should - there are bets on the future, hypothesis making on new scenarios and things that are just too costly, ethically or physically impossible to test. But consistently testing and analysing test results could save a lot of work and money.

syntacticsalt 12 hours ago | parent | prev | next [-]

> Most companies don't cost peoples' lives when you get it wrong.

True, but it usually costs money to fix it. I think the themes of "this only matters if lives are on the line" or "it's too rigorous" are straw-men.

We have limited resources -- time, money, people. We'd like to avoid deploying those resources badly. Statistical inference can be one way to give us more information so we avoid using our resources badly, but as you note, statistical inference also has costs: we have to spend resources to get the data we need to do the inference, plus other costs. We can estimate the costs of getting sufficient data using sample size estimation methods. For go/no-go decision-making, if the cost of getting the decision wrong isn't something like at least 10x the cost of doing the statistical inference, I don't think it's worth doing the inference. It may be worth doing the inference for _other_ reasons, but those reasons are out of scope.

As an example, a common use of statistical inference in medical research is to compare the efficacy of a treatment with a placebo. Some of the motivation is to decide whether to invest more resources in developing the treatment, not because people will die if they get a false positive stating that the treatment is effective when it isn't.

> A lot of companies are, arguably, _too rigorous_ when it comes to testing.

My experience in industry has been the opposite. Companies like the idea of data-driven decision-making, but then they discover pain points. They should have some idea of how much of a change they're looking to detect (i.e., an effect size). They should estimate how much data they're likely to need to run their tests (i.e., sample size estimation). They have to consider other issues like model misfit, calibration, multiple-testing corrections, and so on. Then they also have to rig up the infra to be able to _do_ the testing, collect the data, analyze the results, and communicate the results to their internal stakeholders. These pain points are why companies like Eppo and StatSig exist -- A/B testing ends up being more high-touch than developers expect.

Messing up any one of these issues can yield "flaky tests," which developers hate. Failing to gather a sufficiently large sample size for a given effect size is a pretty common failure mode.

> But to "maintain rigor," we waited 6 weeks before turning it... and the final numbers were virtually the same as the 48 hour numbers.

It's difficult to tell precisely what you mean by "maintain rigor" here. The only context I can gather is that whatever procedure you were using needed more data in order to satisfy the preconditions of the test needed for the nominal design criteria of the test -- usually, its nominal false positive rate. I don't think this is an issue of rigor -- it's an issue of statistical modeling and correctness.

Sometimes, it's possible to use different methods that may require less data at the cost of more (or different) modeling assumptions. Failing to satisfy the assumptions of a test can increase its false positive rate. Whether that matters is really up to you.

> I do like their proposal for "peeking" and subsequent testing.

What the post is suggesting is not a proposal, but a standard class of frequentist statistical inference methods called sequential testing. Daniël Lakens has a good online textbook (https://lakens.github.io/statistical_inferences/) that briefly discusses these methods in Chapter 10 and provides further references.

> We're shipping software. We can change things if we get them wrong.

That's usually true -- as long as you have the resources needed to make those changes, and are willing to spend them that way.

> IMO, the right framing here is: your startup deserves to be as rigorous as is necessary to achieve its goals.

While I don't disagree with the sentiment, I think you're conflating rigor with correctness here.

> If its goals are "stat sig on every test", then sure, treat it like someone might die if you're wrong.

I think that's a false equivalence. Even the American Statistical Association has issued a statement on p-values (see https://www.amstat.org/asa/files/pdfs/p-valuestatement.pdf) that includes "Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold."

> But if your goals are "do no harm, see if we're heading in the right vector, and trust that you can pivot if it turns out you got a false positive," then you kind of explicitly don't need to treat it with the same rigor as a medical test.

If those are your goals, just ship it; I don't think it makes sense to justify the effort to test in this situation, especially if, as you argue, it's financially feasible to roll back the change or pivot if it doesn't work.

AIFounder 17 hours ago | parent | prev [-]

[dead]