The other commenter is more articulate, but you simply cannot draw the conclusion from this paper that reasoning models don't work well. They trained tiny little models and showed they don't work. Big surprise! Meanwhile every other piece of evidence available shows that reasoning models are more reliable at sophisticated problems. Just a few examples.

- https://arcprize.org/leaderboard

- https://aider.chat/docs/leaderboards/

- https://arstechnica.com/ai/2025/07/google-deepmind-earns-gol...

Surely the IMO problems weren't "within the bounds" of Gemini's training data.

▲

robrenaud 3 days ago | parent [-]

The Gemini IMO result used a specifically fine tuned model for math.

Certainly they weren't training on the unreleased problems. Defining out of distribution gets tricky.

	▲	simianwords 3 days ago \| parent \| next [-]
		>The Gemini IMO result used a specifically fine tuned model for math. This is false. https://x.com/YiTayML/status/1947350087941951596 This is false even for the OpenAI model https://x.com/polynoamial/status/1946478250974200272 "Typically for these AI results, like in Go/Dota/Poker/Diplomacy, researchers spend years making an AI that masters one narrow domain and does little else. But this isn’t an IMO-specific model. It’s a reasoning LLM that incorporates new experimental general-purpose techniques."
	▲	Workaccount2 3 days ago \| parent \| prev [-]
		Every human taking that exam has fine tuned for math, specifically on IMO problems.