Remix.run Logo
meander_water 9 hours ago

Overall really interesting read, but I'm having trouble processing this:

> OpenRouter performs internal categorization on a random sample comprising approximately 0.25% of all prompts

How can you arrive at any conclusion with such a small random sample size?

hoppoli 9 hours ago | parent | next [-]

Statistical significance comes mostly from N (number of samples) and the variance on the dimension you're trying to measure[1]. If the variance is high, you'll need higher N. If the variance is low, you'll need a lower N. The percentage of the population is not relevant (N = 1000 might be significant and it doesn't matter if it's 1% or 30% of the population)

[^1] This is a simplification. I should say that it depends on the standard error of your statistic, i.e, the thing you're trying to measure (If you're estimating the max of a population, that's going to require more samples than if you're estimating the mean). This standard error, in turn, will depend on the standard deviation of the dimension you're measuring. For example, if you're estimating the mean height, the relevant quantity is the standard deviation of height in the population.

piskov 9 hours ago | parent | prev | next [-]

https://en.wikipedia.org/wiki/Central_limit_theorem

For example, even 300 really random people is enough to correctly assertain the distribution of population for some measurement (say, some personality feauture).

That’s the basis of all polls and what have you

gerdesj 9 hours ago | parent [-]

I think you might be thrashing around 30 samples for a normal distribution and the Central Limit Theorem and accidentally added a zero!

(OK, on rereading, you did link to a WP article about CLT, so 30 it is!)

piskov 9 hours ago | parent [-]

You’re absolutely right! (c)

300 — I had in memory as a safe bet in a case of some skewed stuff like log-normal, exponential, etc.

abdullahkhalids 9 hours ago | parent | prev | next [-]

Because the accuracy of an estimated quantity mostly depends on the size of the sample, not on the size of the population [1]. This does require assumptions like somewhat homogenous population and normal distributions etc. However, these assumptions often hold.

[1] https://stats.stackexchange.com/questions/166/how-do-you-dec...

jfrbfbreudh 9 hours ago | parent | prev [-]

with enough samples