| ▲ | tgv 3 days ago | |
> The bet is that the effect size (if any) will be large enough to be informative despite the noise. But you have no grounds to ascribe it to the posited difference. Finding no effect might yield more information, but that's hard: given the amount of noise, you're bound to find a great many effects. > Have you seen this done? Not in LLMs, but there have been experiments with regularizing languages, and getting people to learn them in Second Language Acquisition (L2) studies. But what I've seen is inconclusive and sometimes outright contradictory. I think people have also looked via information theory at this. Probably using Markov models. > Fedorenko's own comparison to "early LLMs" suggests she thinks the analogy has some merit. I don't think she can seriously entertain that thought. We simply know practically nothing about language processes in the brain. What we know about the hardware is very different from LLMs, early or not. Just to give an indication of how much we don't know: the Stroop effect (https://en.wikipedia.org/wiki/Stroop_effect) is almost 100 years old. We have no idea what causes it. There's no working model of word recognition. There are only vague suggestions about the origin of the delay. We have no clue how the visual signals for the color and the letters are separated, where they join again, and how that's related to linguistic knowledge. And that's almost 100 years of very, very much research. IF you go to Google Scholar and type "Stroop task", you'll get 197.000 (!) hits. That's nearly 200k articles etc. resulting in no knowledge whatsoever about a very simple, artificial task. | ||
| ▲ | adamzwasserman 2 days ago | parent [-] | |
On effect size: my primary goal at this stage is falsification. If French and English models show no meaningful differences at matched compute, that's informative: it would support the scaling hypothesis. If they do differ, I'll need to be careful about causal claims, but it would at least challenge the "transformers are magic" framing that treats architecture as the main story. The L2 regularization and information theory pointers are helpful, it will go on my reading list. If you have favorites, I'll start there. On the "we know nothing" point: I'm sympathetic. The Stroop example is exactly why I'm skeptical of strong claims in either direction. 197k papers and no mechanism suggests language processing has properties we don't yet have frameworks to describe. That's not mysticism. It's just acknowledging the gap between phenomenon and explanation. | ||