| ▲ | ACCount37 an hour ago | |
A reason to do student-teacher distillation is that soft target logits in general are a richer medium than text that tokenizes to hard targets. More steering signal per teacher token. And running ultra large 10T tier models in autoregressive generation mode can get expensive. So there are reasons not to reduce to text only synthetics. | ||