▲ | hodgehog11 3 days ago | |||||||||||||||||||
True, but the experiments are engineered to give results they want. It's a mathematical certainty that the performance will drop off here, but is not an accurate assessment of what is going on at scale. If you present an appropriately large and well-trained model with in-context patterns, it often does a decent job, even when it isn't trained on them. By nerfing the model (4 layers), the conclusion is foregone. I honestly wish this paper actually showed what it claims, since it is a significant open problem to understand CoT reasoning relative to the underlying training set. | ||||||||||||||||||||
▲ | lossolo 3 days ago | parent [-] | |||||||||||||||||||
Without a provable hold out, claim that "large models do fine on unseen patterns" is unfalsifiable. In controlled from scratch training, CoT performance collapses under modest distribution shift, even with plausible chains. If you have results where the transformation family is provably excluded from training and a large model still shows robust CoT, please share them. Otherwise this paper’s claim stands for the regime it tests. | ||||||||||||||||||||
|