I asked opus 4.6 how to administer an A/B test when data is sparse. My options are to look at conversion rate, look at revenue per customer, or something else. I will get about 10-20k samples, less than that will add to cart, less than that will begin checkout, and even less than that will convert. Opus says I should look at revenue per customers. I don't know the right answer, but I know it is not to look at revenue per customers -- that will have high variance due to outlier customers who put in a large order. To be fair, I do use opus frequently, and it often gives good enough answers. But you do have to be suspicious of its responses for important decisions.

Edit: Ha, and the report claims it's relatively good at business and finance...

Edit 2: After discussion in this thread, I went back to opus and asked it to link to articles about how to handle non-normally distributed data, and it actually did link to some useful articles, and an online calculator that I believe works for my data. So I'll eat some humble pie and say my initial take was at least partially wrong here. At the same time, it was important to know the correct question to ask, and honestly if it wasn't for this thread I'm not sure I would have gotten there.

▲

onion2k 20 hours ago | parent [-]

A/B tests are a statistical tool, and outliers will mess with any statistical measure. If your data is especially prone to that you should be using something that accounts for them, and your prompt to Opus should tell it to account for that.

A good way to use AI is to treat it like a brilliant junior. It knows a lot about how things work in general but very little about your specific domain. If your data has a particular shape (e.g lots of orders with a few large orders as outliers) you have to tell it that to improve the results you get back.

▲

pinkmuffinere 20 hours ago | parent [-]

I did tell it that I expect to see something like a power-law distribution in order value, so I think I pretty much followed your instructions here. Btw, if you do know the right thing to do in my scenario, I'd love to figure it out. This is not my area of expertise, and just figuring it out through articles so far.

▲

Karrot_Kream 19 hours ago | parent [-]

I recommend reading Wikipedia and talking to LLMs to get this one. Order values do follow power-law distributions (you're probably looking for an exponential or a Zipf distribution.) You want to ask how to perform a statistical test using these distributions. I'm a fan of Bayesian techniques here, but it's up to you if you want to use a frequentist approach. If you can follow some basic calculus you can follow the math for constructing these statistical tests, if not some searching will help you find the formulas you need.

▲

pinkmuffinere 19 hours ago | parent [-]

Thanks for the suggestions! I didn't want to do the math myself, but I did take your suggestion and found some articles discussing ways to make it work even with a non-normal distribution:

- https://cxl.com/blog/outliers/

- https://www.blastx.com/insights/the-best-revenue-significanc...

- (online tool to calculate significance) https://www.blastx.com/rpv-calculator

I'm not checking their math, but the articles make sense to me, and I trust they did implement it correctly. In the end the LLM did get me to the correct answer by suggesting the articles, so I guess I should eat some humble pie and say it _did_ help me. At the same time, if I didn't have the intuition that using rpv as-is in a t-test would be noisy, and the suggestions from this comment thread, I think I could have gone down the wrong path. So I'm not sure what my conclusion is -- maybe something like LLMs are helpful once you ask the right question.

▲

Karrot_Kream 18 hours ago | parent [-]

One heuristic I like to use when thinking about this question (and I honestly wish the answer space here were less emotionally charged, so we could all learn from each other) is that: LLMs need a human to understand the shape of the solution to check the LLM's work. In fields that I have confirmed expertise in, I can easily nudge and steer the LLM and only skim its output quickly to know if it's right or wrong. In fields I don't, I first ask the LLM for resources (papers, textbooks, articles, etc) and familiarize myself with some initial literature first. I then work with the LLMs slowly to make a solution. I've found that to work well so far.

(I also just love statistics and think it's some of the most applicable math to everyday life in everything from bus arrival times to road traffic to order values to financial markets.)

	▲	pinkmuffinere 9 hours ago \| parent [-]
		I think this is a _really_ insightful answer about effectively working with LLMs. And you’re winning me over on statistics too :)