One mind-bending thing is that self-distillation, meaning distilling one model into another of the same architecture, number of parameters, etc., also often works! https://arxiv.org/abs/2206.08491