Remix.run Logo
adamzwasserman 2 days ago

On effect size: my primary goal at this stage is falsification. If French and English models show no meaningful differences at matched compute, that's informative: it would support the scaling hypothesis. If they do differ, I'll need to be careful about causal claims, but it would at least challenge the "transformers are magic" framing that treats architecture as the main story.

The L2 regularization and information theory pointers are helpful, it will go on my reading list. If you have favorites, I'll start there.

On the "we know nothing" point: I'm sympathetic. The Stroop example is exactly why I'm skeptical of strong claims in either direction. 197k papers and no mechanism suggests language processing has properties we don't yet have frameworks to describe. That's not mysticism. It's just acknowledging the gap between phenomenon and explanation.