| ▲ | teleforce 13 hours ago | |
Please check the recent self-distillation work by MIT-ETH, UCLA and Apple [1],[2],[3],[4],[5]. Given the release timelines I suspect all 4.x after Opus 4 are probably self-distillation based fine-tuned models. The latest paper by Apple is focusing on code generation using the simple technique hence the name simple self-distillation (SSD) [4],[5]. I've got a strong feeling that self-distillation is the second best thing happened to LLM after transformer breakthrough. [1]Self-Distillation Enables Continual Learning [pdf] (25 comments): https://news.ycombinator.com/item?id=48165265 [2] Self-Distillation Enables Continual Learning: https://arxiv.org/abs/2601.19897 [3] Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models: https://arxiv.org/abs/2601.18734 [4] Embarrassingly simple self-distillation improves code generation (201 comments): https://news.ycombinator.com/item?id=47637757 [5] Embarrassingly Simple Self-Distillation Improves Code Generation: | ||
| ▲ | rao-v 9 hours ago | parent [-] | |
So first - these are terrific papers and I'd not seen some of them before. Having said that, I don't think these are classic student teacher distillation from random (which was my point). In fact, the "Embarrassingly Simple Self-Distillation" paper is using exactly what I was talking about "fine-tune on those samples with standard supervised fine-tuning". | ||