| ▲ | girvo 16 hours ago | |
> I suspect nobody is doing real student teacher distillation It gets used for quantisation, basically recovering accuracy for lower quants (Nvidia calls it QAD). Can’t speak to how widespread it is though | ||
| ▲ | rao-v 15 hours ago | parent [-] | |
Yes absolutely! I should have been more specific - I don’t believe people are using it to train 30B models from 300B models (and I’d love to learn that I’m off about this) | ||