| ▲ | runarberg an hour ago | ||||||||||||||||
The interpretations of the p-value is also alarming. One of the first thing they teach you in statistics class is: “an absence of evidence is not evidence of absence”. This analysis showed that there is indeed an absence of evidence, but it concludes there is evidence of absence. Traditional p-hacking is done by oversampling and overtesting. If you do 20 analysis on average one will show p < 0.05 by random chance. This analysis is doing the inverse of that. Under-sampling, and concluding with p > 0.05 | |||||||||||||||||
| ▲ | xmddmx 24 minutes ago | parent | next [-] | ||||||||||||||||
The concept you need here is "Statistical Power". The ELI5 version is that there are two mistakes you can make when looking at a P value: Type I error, where your P value is falsely low. In the experiment being discussed here, it would lead one to conclude that AI code is worse. Otherwise known as a false positive. Type II error, where your P value is falsely high, leading you to conclude that AI code is no different. Otherwise known as a false negative. https://en.wikipedia.org/wiki/Power_(statistics) One can calculate statistical power for a given experimental protocol. My hunch is that if you did this, you would find this experiment is grossly under-powered. This means you can't make the "absence of evidence" claim. | |||||||||||||||||
| ▲ | logicprog an hour ago | parent | prev [-] | ||||||||||||||||
> This analysis showed that there is indeed an absence of evidence, but it concludes there is evidence of absence. I tried pretty hard to avoid saying that, can you point me at how to rephrase? The point I'm trying to make is just that there is absolutely no evidence at all for what people are saying with such absolutism and claimed objectivity (that Claude made rsync worse), and thus it doesn't justify the outrage. > Under-sampling, and concluding with p > 0.05 How would I avoid under-sampling here? And if you're going to say it's because I only have 2 data points, well, the side making the positive claim — that Claude made rsync worse — only had two as well, and unremarkable ones at that, as I've tried very hard to show. | |||||||||||||||||
| |||||||||||||||||