I keep wondering whether people have actually examined how this work draws its conclusions before citing it.

This is science at its worst, where you start at an inflammatory conclusion and work backwards. There is nothing particularly novel presented here, especially not in the mathematics; obviously performance will degrade on out-of-distribution tasks (and will do so for humans under the same formulation), but the real question is how out-of-distribution a lot of tasks actually are if they can still be solved with CoT. Yes, if you restrict the dataset, then it will perform poorly. But humans already have a pretty large visual dataset to pull from, so what are we comparing to here? How do tiny language models trained on small amounts of data demonstrate fundamental limitations?

I'm eager to see more works showing the limitations of LLM reasoning, both at small and large scale, but this ain't it. Others have already supplied similar critiques, so let's please stop sharing this one around without the grain of salt.

▲

ipaddr 3 days ago | parent [-]

"This is science at its worst, where you start at an inflammatory conclusion and work backwards"

Science starts with a guess and you run experiments to test.

▲

hodgehog11 3 days ago | parent [-]

True, but the experiments are engineered to give results they want. It's a mathematical certainty that the performance will drop off here, but is not an accurate assessment of what is going on at scale. If you present an appropriately large and well-trained model with in-context patterns, it often does a decent job, even when it isn't trained on them. By nerfing the model (4 layers), the conclusion is foregone.

I honestly wish this paper actually showed what it claims, since it is a significant open problem to understand CoT reasoning relative to the underlying training set.

▲

lossolo 3 days ago | parent [-]

Without a provable hold out, claim that "large models do fine on unseen patterns" is unfalsifiable. In controlled from scratch training, CoT performance collapses under modest distribution shift, even with plausible chains. If you have results where the transformation family is provably excluded from training and a large model still shows robust CoT, please share them. Otherwise this paper’s claim stands for the regime it tests.

	▲	simianwords 3 days ago \| parent \| next [-]
		I don't buy this for the simple fact that benchmarks show much better performance on thinking than on non thinking models. Benchmarks already consider the generalisation and "unseen patterns" aspect. What would be your argument against 1. COT models performing way better in benchmarks than normal models 2. people choose to use the COT models in day to day life because they actually find that it gives better performance
	▲	buildbot 3 days ago \| parent \| prev \| next [-]
		This paper's claim holds - for 4 layer models. Models improve on out of context examples dramatically at larger scales.
	▲	hodgehog11 3 days ago \| parent \| prev [-]
		> claim that "large models do fine on unseen patterns" is unfalsifiable I know what you're saying here, and I know it is primarily a critique of my phrasing, but establishing something like this is the objective of in-context learning theory and mathematical applications of deep learning. It is possible to prove that sufficiently well-trained models will generalize for certain unseen classes of patterns, e.g. transformer acting like gradient descent. There is still a long way to go in the theory---it is difficult research! > performance collapses under modest distribution shift The problem is that the notion of "modest" depends on the scale here. With enough varied data and/or enough parameters, what was once out-of-distribution can become in-distribution. The paper is purposely ignorant of this fact. Yes, the claims hold for tiny models, but I don't think anyone ever doubted this.