| ▲ | Outcome Rewards Do Not Guarantee Verifiable or Causally Important Reasoning(arxiv.org) | |
| 1 points by krackers 11 hours ago | 1 comments | ||
| ▲ | krackers 11 hours ago | parent [-] | |
The fact that COTs often "hallucinate" was known anecdotally, but they study it more systematically here and provide ways to mitigate. Apparently SFT'ing on "meaningful" reasoning traces provides enough of a scaffold so that later RL results in meaningful/"truthful" traces rather than the appearance of reasoning. See also the author's summary at https://x.com/qinan_yu/status/2049865788304380239 | ||