▲ | cornel_io 13 hours ago | |
Even though this post says exactly the thing that most Proper Analysts will say, and write long LinkedIn posts about where other Proper Analysts congratulate them on standing up for Proper Analysis in the face of Evil And Stupid Business Dummies who just want to make bad decisions based on too little data, it's wrong. The Jedi Bell Curve meme is in full effect on this topic, and I say this as someone who took years to get over the midwit hump and correct my mistaken beliefs. The business reality is, you aren't Google. You can't collect a hundred million data points for each experiment that you run so that you can reliably pick out 0.1% effects. Most experiments will have a much shorter window than any analyst wants them to, and will have far too few users, with no option to let them run longer. You still have to make a damned decision, now, and move on to the next feature (which will also be tested in a heavily underpowered manner). Posts like this say that you should be really, REALLY careful about this, and apply Bonferonni corrections and make sure you're not "peeking" (or if you do peek, apply corrections that are even more conservative), preregister, etc. All the math is fine, sure. But if you take this very seriously and are in the situation that most startups are in where the data is extremely thin and you need to move extremely fast, the end result is that you should reject almost every experiment (and if you're leaning on tests, every feature). That's the "correct" decision, academically, because most features lie in the sub 5% impact range on almost any metric you care about, and with a small number of users you'll never have enough power to pick out effects that small (typically you'd want maybe 100k, depending on the metric you're looking at, and YOU probably have a fraction of that many users). But obviously the right move is not to just never change the product because you can't prove that the changes are good - that's effectively applying a very strong prior in favor of the control group, and that's problematic. Nor should you just roll out whatever crap your product people throw at the wall: while there is a slight bias in most experiments in favor of the variant, it's very slight, so your feature designers are probably building harmful stuff about half the time. You should apply some filter to make sure they're helping the product and not just doing a random walk through design space. The best simple strategy in a real world where most effect sizes are small and you never have the option to gather more data really is to do the dumb thing: run experiments for as long as you can, pick whichever variant seems like it's winning, rinse and repeat. Yes, you're going to be picking the wrong variant way more often than your analysts would prefer, but that's way better than never changing the product or holding out for the very few hugely impactful changes that you are properly powered for. On average, over the long run, blindly picking the bigger number will stack small changes, and while a lot of those will turn out to be negative, your testing will bias somewhat in favor of positive ones and add up over time. And this strategy will provably beat one that does Proper Statistics and demands 95% confidence or whatever equivalent Bayesian criteria you use, because it leaves room to accept the small improvements that make up the vast majority of feature space. There's an equivalent and perhaps simpler way to justify this, which is to throw out the group labels: if we didn't know which one was the control and we had to pick which option was better, then quite obviously, regardless of how much data we have, we just pick the one that shows better results in the sample we have. Including if there's just a single user in each group! In an early product, this is TOTALLY REASONABLE, because your current product sucks, and you have no reason to think that the way it is should not be messed with. Late lifecycle products probably have some Chesterton's fence stuff going on, so maybe there's more of an argument to privilege the control, but those types of products should have enough users to run properly powered tests. |