It's like taking a student who wins a gold in IMO math, but can't solve easier math problems

I've tried to think of specific follow-up questions that will help me understand your point of view, but other than "Cite some examples of easier problems than a successful IMO-level model will fail at," I've got nothing. Overfitting is always a risk, but if you can overfit to problems you haven't seen before, that's the fault of the test administrators for reusing old problem forms or otherwise not including enough variety.

GPT itself suggests[1] that problems involving heavy arithmetic would qualify, and I can see that being the case if the model isn't allowed to use tools. However, arithmetic doesn't require much in the way of reasoning, and in any case the best reasoning models are now quite decent at unaided arithmetic. Same for the tried-and-true 'strawberry' example GPT cites, involving introspection of its own tokens. Reasoning models are much better at that than base models. Unit conversions were another weakness in the past that no longer seems to crop up much.

So what would some present-day examples be, where models that can perform complex CoT tasks fail on simpler ones in ways that reveal that they aren't really "thinking?"

1: https://chatgpt.com/share/695be256-6024-800b-bbde-fd1a44f281...

▲

lazarus01 2 days ago | parent [-]

In response to your direct question -> https://gail.wharton.upenn.edu/research-and-insights/tech-re...

“ This indicates that while CoT can improve performance on difficult questions, it can also introduce variability that causes errors on “easy” questions the model would otherwise answer correctly.”

Other response to strawberry example; There are 25,000 people employed globally that repair broken responses and create training data, a big whack-a-mole effort to remediate embarrassing errors.

	▲	CamperBob2 2 days ago \| parent [-]
		(Shrug) Ancient models are ancient. Please provide specific examples that back up your point, not obsolete .PDFs to comb through.