Remix.run Logo
Outcome Rewards Do Not Guarantee Verifiable or Causally Important Reasoning(arxiv.org)
1 points by krackers 11 hours ago | 1 comments
krackers 11 hours ago | parent [-]

The fact that COTs often "hallucinate" was known anecdotally, but they study it more systematically here and provide ways to mitigate. Apparently SFT'ing on "meaningful" reasoning traces provides enough of a scaffold so that later RL results in meaningful/"truthful" traces rather than the appearance of reasoning. See also the author's summary at https://x.com/qinan_yu/status/2049865788304380239