| ▲ | Self-Distillation Enables Continual Learning [pdf](arxiv.org) | ||||||||||||||||
| 18 points by teleforce 3 hours ago | 5 comments | |||||||||||||||||
| ▲ | ArchieScrivener an hour ago | parent | next [-] | ||||||||||||||||
From Jan 2026. This is very interesting: "Empirical Validation. While we cannot verify these theoretically, we evaluate each empirically. We use the Qwen-2.5-7B-Instruct model (Hui et al., 2024) as the base policy and the ToolAlpaca dataset (Tang et al., 2023). In this benchmark, the model receives a tool-API specification and a user request, and must identify the correct tool call. Without demonstrations, the base model solves only 42% of examples. When provided with the appropriate demonstration c for each prompt x , the teacher achieves a 100% success rate. To further test reward proximity, we manually inspected 50 teacher reasoning traces. In all cases, not only were the final tool calls correct, but the intermediate chain-of-thought was valid and semantically grounded. This suggests that the teacher is reconstructing a correct reasoning process rather than merely copying the expert output. These observations provide evidence for the first requirement, that the demonstration-conditioned model behaves as an optimal policy." | |||||||||||||||||
| ▲ | airstrike an hour ago | parent | prev | next [-] | ||||||||||||||||
Both title and abstract feel a little too confident, which ironically makes me more skeptical rather than less. I find the choice of the words "enable" in the title and "establishing" at the end of the abstract to be particularly jarring. | |||||||||||||||||
| ▲ | greesil 32 minutes ago | parent | prev [-] | ||||||||||||||||
Wtf is a policy? Is this some sort of RL thing that I'm too ML to understand? Gemini tells me it's the probability of the next token for an LLM. Okay then. | |||||||||||||||||
| |||||||||||||||||