Yeah that's one example, but I suspect they train the model on entire sequences of tool calls, so unless you prompt the model exactly as them you won't get the same results.

There's a reason they won the agent race, their models are trained to use their own tools.

▲

libraryofbabel a day ago | parent [-]

Agree, the RLVR tasks are probably long series of tool calls at this point doing complex tasks in some simulated dev environment.

That said, I think it's hard to say how much of a difference it really makes in terms of making Claude Code specifically better than other coding agents using the same LLM (versus just making the LLM better for all coding agents using roughly similar tools). There is probably some difference, but you'd need to run a lot of benchmarks to find out.

	▲	aszen a day ago \| parent [-]
		Agreed it probably contributes to the model improving for all agents but crucially it is verifiably better against their own agent. So they get a good feedback loop to improve both