Remix.run Logo
ec109685 15 hours ago

Keep in mind that Frequent A/B tests burn statistical “credit.” Any time you ship a winner at p = 0.05 you’ve spent 5 % of your false-positive budget. Do that five times in a quarter and the chance at least one is noise is 1 – 0.95⁵ ≈ 23 %.

There are several approaches you can take to reduce that source of error:

Quarterly alpha ledger

Decide how much total risk you want this quarter (say 10 %). Divide the remaining α by the number of experiments left and make that the threshold for the next launch. Forces the “is this button-color test worth 3 % of our credibility?” conversation. More info: “Sequential Testing in Practice: Why Peeking Is a Problem and How to Fix It” (https://medium.com/@aisagescribe/sequential-testing-in-pract...).

Benjamini–Hochberg (BH) for metric sprawl

Once you watch a dozen KPIs, Bonferroni buries real lifts. BH ranks all the p-values at the end, then sets the cut so that, say, only 5 % of declared winners are false positives. You keep power, and you can run the same BH step on the primary metric from every experiment each quarter to catch lucky launches. More info: “Controlling False Discoveries: A Guide to BH Correction in Experimentation” (https://www.statsig.com/perspectives/controlling-false-disco...).

Bayesian shrinkage + 5 %

“ghost” control for big fleets FAANG-scale labs run hundreds of tests and care about 0.1 % lifts. They pool everything in a simple hierarchical model; noisy effects get pulled toward the global mean, so only sturdy gains stay above water. Before launch, they sanity-check against a small slice of traffic that never saw any test. Cuts winner’s-curse inflation by ~30 %. Clear explainer: “How We Avoid A/B Testing Errors with Shrinkage” (https://eng.wealthfront.com/2015/10/29/how-we-avoid-ab-testi...) and (https://www.statsig.com/perspectives/informed-bayesian-ab-te...)

<10 tests a quarter: alpha ledger or yolo; dozens of tests and KPIs: BH; hundreds of live tests: shrinkage + ghost control.

akoboldfrying 14 hours ago | parent [-]

> the chance at least one is noise is 1 – 0.95⁵ ≈ 23 %

Yes, but that's not really the big deal that you're making it out to be, since it's (usually) not an all-or-nothing thing. Usually, the wins are additive. The chance of each winner being genuine is still 95% (assuming no p-hacking), and so the expected number of wins out of those 5 will be be 0.95 * 5 = 4.75 wins (by linearity of expectation), which is a solid win rate.

kgwgk 6 hours ago | parent | next [-]

>> the chance at least one is noise is 1 – 0.95⁵ ≈ 23 %

> The chance of each winner being genuine is still 95%

Not really. It depends on what’s the unknown (but fixed in a frequentist analysis like this one) difference between the options - or absence thereof.

If there is no real difference it’s 100% noise and each winner is genuine with probability 0%. If the difference is huge the first number is close to 0% and the second number is close to 100%.

ec109685 13 hours ago | parent | prev [-]

Good point. The 23% in the example refers to the worst case where 5 tests are all null throughout the period.