Wtf is a policy? Is this some sort of RL thing that I'm too ML to understand?

Gemini tells me it's the probability of the next token for an LLM. Okay then.

Ifkaluva 10 minutes ago | parent | next [-]

It’s quite common these days to treat an LLM as a policy in the sense that it takes as a “state” the previous context, and its task is to choose a continuation, as an “action”. It gets a “reward” from a reward model that was trained on human preferences, or from a verifiable source, such as passing test cases.

This framing has been active for several years, as it’s the framing that enables RLHF and RLVR. RLHF itself is quite old, I think since the original chatGPT.

▲

mountainriver an hour ago | parent | prev | next [-]

What is this comment? It’s an RL paper, these are standard RL terms

▲

greesil an hour ago | parent [-]

It's a comment. On Hacker News. Not the RL subreddit, or whatever. I'm just amazed at the jargon. I'm sure it's useful, but one could just call it model output.

	▲	antonvs 3 minutes ago \| parent [-]
		> one could just call it model output. That would be incorrect. My other reply attempts to address this.

▲

antonvs 5 minutes ago | parent | prev [-]

Gemini didn't really say that exactly, did it? Because it's oversimplified to the point of being wrong.

“Policy” here refers to a probability distribution, i.e. a function that, given some context, assigns probabilities to possible next tokens. It's what a model’s behavior looks like when viewed through an RL lens.

The paper discusses “on-policy” and “off-policy” training, which is central to its idea.

Off-policy training is what happens in standard supervised fine-tuning (SFT): the model is trained on examples that were produced independently of the model. This means that the examples have a different distribution than what the model produces. This can have a negative effect on previously learned capabilities.

On-policy training (in this context) uses data generated by the model itself. It samples the model’s own outputs, scores them against whatever results are being trained for, and updates the model based on those scores. This reinforces certain aspects of the model's own pretrained behavior, so is a "gentler" way to change the model's behavior. The authors claim that this reduces "catastrophic forgetting" and other negative consequences of SFT.