Self-Distillation Enables Continual Learning [pdf]

From Jan 2026.

This is very interesting:

"Empirical Validation. While we cannot verify these theoretically, we evaluate each empirically. We use the Qwen-2.5-7B-Instruct model (Hui et al., 2024) as the base policy and the ToolAlpaca dataset (Tang et al., 2023). In this benchmark, the model receives a tool-API specification and a user request, and must identify the correct tool call. Without demonstrations, the base model solves only 42% of examples. When provided with the appropriate demonstration c for each prompt x , the teacher achieves a 100% success rate. To further test reward proximity, we manually inspected 50 teacher reasoning traces. In all cases, not only were the final tool calls correct, but the intermediate chain-of-thought was valid and semantically grounded. This suggests that the teacher is reconstructing a correct reasoning process rather than merely copying the expert output. These observations provide evidence for the first requirement, that the demonstration-conditioned model behaves as an optimal policy."

▲

airstrike an hour ago | parent | prev | next [-]

Both title and abstract feel a little too confident, which ironically makes me more skeptical rather than less.

I find the choice of the words "enable" in the title and "establishing" at the end of the abstract to be particularly jarring.

▲

greesil 32 minutes ago | parent | prev [-]

Wtf is a policy? Is this some sort of RL thing that I'm too ML to understand?

Gemini tells me it's the probability of the next token for an LLM. Okay then.

▲

mountainriver 24 minutes ago | parent [-]

What is this comment? It’s an RL paper, these are standard RL terms

	▲	greesil 19 minutes ago \| parent [-]
		It's a comment. On Hacker News. Not the RL subreddit, or whatever. I'm just amazed at the jargon. I'm sure it's useful, but one could just call it model output.