„This renders the meaning of significance-testing unclear; it is calculating precisely the odds of the data under scenarios known a priori to be false.“

I cannot see the problem in that. To get to meaningful results we often calculate with simplyfied models - which are known to be false in a strict sense. We use Newtons laws - we analyze electric networks based on simplifications - a bank-year used to be 360 days! Works well.

What did i miss?

▲

bjornsing 4 days ago | parent | next [-]

The problem is basically that you can always buy a significant result with money (large enough N always leads to ”significant” result). That’s a serious issue if you see research as pursuit of truth.

▲

syntacticsalt 4 days ago | parent [-]

Reporting effect size mitigates this problem. If observed effect size is too small, its statistical significance isn't viewed as meaningful.

▲

bjornsing 4 days ago | parent [-]

Sure (and of course). But did you see the effect size histogram in the OP?

▲

syntacticsalt 3 days ago | parent [-]

Are you referring to the first figure, from Smith, et al, 2007? If so, I couldn't evaluate whether gwern's claim makes sense without reading that paper to get an idea of, e.g., sample size and how they control for false positives. I don't think it's self-evident from that figure alone.

One rule of thumb for interpreting (presumably Pearson) correlation coefficients is given in [0] and states that correlations with magnitude 0.3 or less are negligible, in which case most of the bins in that histogram correspond to cases that aren't considered meaningful.

[0]: https://pmc.ncbi.nlm.nih.gov/articles/PMC3576830/table/T1/

▲

bjornsing 2 days ago | parent [-]

I’m not arguing that there’s something fundamentally wrong with mathematics or the scientific method. I’m arguing that the social norms around how we do science in practice have some serious flaws. Gwern points out one of them. One that IMHO is quite interesting.

EDIT: I also get the feeling that you think it’s okay to do an incorrect hypothesis test (c > 0), as long as you also look at the effect size. I don’t think it is. You need to test the c > 0.3 hypothesis to get a mathematically sound hypothesis test. How many papers do that?

	▲	syntacticsalt a day ago \| parent [-]
		My opinion of Gwern's piece is that some of the arguments he makes don't require correlations. For example, A/B tests of differences in means using a zero difference null hypothesis will reject the null, given enough data. In that A/B testing scenario, I think if someone wants to test whether the difference is zero, that's fine, but if the effect size is small, they shouldn't claim that there's any meaningful difference. I believe the pharma literature calls this scenario equivalence testing. Assuming a positive difference in means is desirable, I think testing for a null hypothesis of a change of at least some positive value (e.g., +5% of control) is a better idea. I believe the pharma literature calls this scenario superiority testing. I believe superiority testing is preferable to equivalence testing, and in professional settings, I have made this case to managers. I have not succeeded in persuading them, and thus do the equivalence testing they request. I don't think the idea of a zero null hypothesis is necessarily mathematically unsound. In cases like the difference in means, a zero null hypothesis is well-posed. However, I agree with you that there are better practices, like a null hypothesis incorporating a nonzero effect. I don't entirely agree with the arguments Gwern puts forth in the Implications section because some of them seem at odds with one another. Betting on sparsity would imply neglecting some of the correlations he's arguing are so essential to capture. The bit about algorithmic bias strikes me as a bizarre proposition to include with little supporting evidence, especially when there are empirical examples of algorithmic bias. What I find lacking about Gwern's piece is that it's a bit like lighting a match to widespread statistical practice, and then walking away. Yes, I think null hypothesis statistical testing is widely overused, and that statistical significance alone is not a good determinant of what constitutes a "discovery". I agree that modeling is hard, and that "everything is correlated" is, to an extent, true because the correlations are not literally or exactly zero. But if you're going to take the strong stance that null hypothesis statistical testing is meaningless, I believe you need to provide some kind of concrete alternative. I don't think Gwern's piece explicitly advocates an alternative, and it only hints the alternative might be causal inference. Asking people who may not have much statistics training to leap from frequentist concepts taught in high school to causal inference would be a big ask. If Gwern isn't asking that, then I'd want to know what a suggested alternative would be. Notably, Gwern does not mention testing for nonzero positive effects (e.g., in the vein of the "c > 0.3" case above). If there isn't an alternative, I'm not sure what the argument is. Don't use statistics, perhaps? It's tough to say.

▲

PeterStuer 4 days ago | parent | prev | next [-]

Back when I wrote a loan repayment calculator, there were 47 common different ways to 'day count' (used in calculating payments for incomplete repayment periods, e.g in monthly payments, what is the 1st-13th of aug 2025 as a fraction of aug 2025?).

▲

thyristan 4 days ago | parent | prev | next [-]

There is a known maximum error introduced by those simplifications. Put the other way around, Einstein is a refinement of Newton. Special relativity converges towards Newtonian motion for low speeds.

You didn't really miss anything. The article is incomplete, and wrongly suggests that something like "false" even exists in statistics. But really something is only false "with a x% probability of it actually being true nonetheless". Meaning that you have to "statistic harder" if you want to get x down. Usually the best way to do that is to increase the number of tries/samples N. What the article gets completely wrong is that for sufficiently large N, you don't have to care anymore, and might as well use false/true as absolutes, because you pass the threshold of "will happen once within the lifetime of a bazillion universes" or something.

Problem is, of course, that lots and lots of statistics are done with a low N. Social sciences, medicine, and economy are necessarily always in the very-low-N range, and therefore always have problematic statistics. And try to "statistic harder" without being able to increase N, thereby just massaging their numbers enough to get a desired conclusion proved. Or just increase N a little, claiming to have escaped the low-N-problem.

▲

syntacticsalt 4 days ago | parent [-]

A frequentist interpretation of inference assumes parameters have fixed, but unknown values. In this paradigm, it is sensible to speak of the statement "this parameter's value is zero" as either true or false.

I do not think it is accurate to portray the author as someone who does not understand asymptotic statistics.

▲

thyristan 4 days ago | parent [-]

> it is sensible to speak of the statement "this parameter's value is zero" as either true or false.

Nope. The correct way is rather something like "the measurements/polls/statistics x ± ε are consistent with this parameter's true value to be zero", where x is your measured value and ε is some measurement error, accuracy or statistical deviation. x will never really be zero, but zero can be within an interval [x - ε; x + ε].

	▲	syntacticsalt 3 days ago \| parent [-]
		As you yourself point out, a consistent estimator of a parameter converges to that parameter's value in the infinite sample limit. That limit is zero or it's not.

▲

whyever 4 days ago | parent | prev [-]

It's a quantitative problem. How big is the error introduced by the simplification?