| ▲ | zahlman 2 days ago | |||||||||||||||||||||||||||||||
I thought I'd read a lot of these threads this year, and also discussed off-site the use of coding agents and the technology behind them; but this is genuinely the first time I've seen the term "RLVR". | ||||||||||||||||||||||||||||||||
| ▲ | HarHarVeryFunny 2 days ago | parent [-] | |||||||||||||||||||||||||||||||
RLVR "reinforcement learning for verifiable rewards" refers to RL used to encourage reasoning towards achieving long-horizon goals in areas such as math and programming, where the correctness/desirability of a generated response (or perhaps an individual reasoning step) can be verified in some way. For example generated code can be verified by compiling and running it, or math results verified by comparing to known correct results. The difficulty of using RL more generally to promote reasoning is that in the general case it's hard to define correctness and therefore quantify a reward for the RL training to use. | ||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||