| ▲ | rao-v 3 hours ago | |
It’s really worth distinguishing between old-fashioned student teacher distillation (ie at the level of layers, weights and distributions) and large scale synthetic dataset creation. The latter is much better (since you can clean up, review, update responses and filter your datasets). I suspect nobody is doing real student teacher distillation, it’s just easier to do a bunch of training on the same giant corpus then post train on the synthetic corpus with its reasoning traces etc. (which might have been generated by a bigger better LLM) | ||
| ▲ | ACCount37 8 minutes ago | parent | next [-] | |
A reason to do student-teacher distillation is that soft target logits in general are a richer medium than text that tokenizes to hard targets. More steering signal per teacher token. And running ultra large 10T tier models in autoregressive generation mode can get expensive. So there are reasons not to reduce to text only synthetics. | ||
| ▲ | thisisaman408 12 minutes ago | parent | prev [-] | |
[dead] | ||