Remix.run Logo
antonvs an hour ago

Gemini didn't really say that exactly, did it? Because it's oversimplified to the point of being wrong.

“Policy” here refers to a probability distribution, i.e. a function that, given some context, assigns probabilities to possible next tokens. It's what a model’s behavior looks like when viewed through an RL lens.

The paper discusses “on-policy” and “off-policy” training, which is central to its idea.

Off-policy training is what happens in standard supervised fine-tuning (SFT): the model is trained on examples that were produced independently of the model. This means that the examples have a different distribution than what the model produces. This can have a negative effect on previously learned capabilities.

On-policy training (in this context) uses data generated by the model itself. It samples the model’s own outputs, scores them against whatever results are being trained for, and updates the model based on those scores. This reinforces certain aspects of the model's own pretrained behavior, so is a "gentler" way to change the model's behavior. The authors claim that this reduces "catastrophic forgetting" and other negative consequences of SFT.