Outcome Rewards Do Not Guarantee Verifiable or Causally Important Reasoning

	▲	Outcome Rewards Do Not Guarantee Verifiable or Causally Important Reasoning(arxiv.org)
		1 points by krackers 11 hours ago \| 1 comments

	▲	krackers 11 hours ago \| parent [-]
		The fact that COTs often "hallucinate" was known anecdotally, but they study it more systematically here and provide ways to mitigate. Apparently SFT'ing on "meaningful" reasoning traces provides enough of a scaffold so that later RL results in meaningful/"truthful" traces rather than the appearance of reasoning. See also the author's summary at https://x.com/qinan_yu/status/2049865788304380239