▲ | SEGyges 2 days ago | |
My short explanation would be that even for RL, you are training on a next token objective; but the next token is something that has been selected very very carefully for solving the problem, and was generated by the model itself. So you're amplifying existing trajectories in the model by feeding the model's outputs back to itself, but only when those outputs solve a problem. This elides the kl penalty and the odd group scoring, which are the same in the limit but vastly more efficient in practice. |