| ▲ | HarHarVeryFunny 2 days ago | |||||||
RLVR "reinforcement learning for verifiable rewards" refers to RL used to encourage reasoning towards achieving long-horizon goals in areas such as math and programming, where the correctness/desirability of a generated response (or perhaps an individual reasoning step) can be verified in some way. For example generated code can be verified by compiling and running it, or math results verified by comparing to known correct results. The difficulty of using RL more generally to promote reasoning is that in the general case it's hard to define correctness and therefore quantify a reward for the RL training to use. | ||||||||
| ▲ | zahlman 2 days ago | parent | next [-] | |||||||
> The difficulty of using RL more generally to promote reasoning is that in the general case it's hard to define correctness and therefore quantify a reward for the RL training to use. Ah, hence the "HF" angle. | ||||||||
| ||||||||
| ▲ | somewhereoutth a day ago | parent | prev [-] | |||||||
> generated code can be verified by compiling and running it I think this gets to the crux of the issue with LLMs for coding (and indeed 'test orientated development'). For anything beyond a most basic level of complexity (i.e. anything actually useful), code cannot be verified by compiling and running it. It can only be verified - to a point - by skilled human inspection/comprehension. That is the essence of code really, a definition of action, given by humans, to a machine for running with /a prior/ unenumerated inputs. Otherwise it is just a fancy lookup table. By definition then not all inputs and expected outputs can be tabulated, tested for, or rewarded for. | ||||||||
| ||||||||