| ▲ | MobiusHorizons 7 hours ago | ||||||||||||||||||||||
I am by no means an expert, but I think it is a process that allows training LLMs from other LLMs without needing as much compute or nearly as much data as training from scratch. I think this was the thing deepseek pioneered. Don’t quote me on any of that though. | |||||||||||||||||||||||
| ▲ | tensor 3 hours ago | parent | next [-] | ||||||||||||||||||||||
No, distillation is far older than deepseek. Deepseek was impressive because of algorithmic improvements that allowed them to train a model of that size with vastly less compute than anyone expected, even using distillation. I also haven’t seen any hard data on how much they do use distillation like techniques. They for sure used a bunch of synthetic generated data to get better at reasoning, something that is now commonplace. | |||||||||||||||||||||||
| |||||||||||||||||||||||
| ▲ | tickerticker 3 hours ago | parent | prev [-] | ||||||||||||||||||||||
Yes. They bounced millions of queries off of ChatGPT to teach/form/train their DeepSeek model. This bot-like querying was the "distillation." | |||||||||||||||||||||||
| |||||||||||||||||||||||