Remix.run Logo
_fizz_buzz_ 7 hours ago

> their main trick for model improvement is distilling the SOTA models

Could you elaborate? How is this done and what does this mean?

MobiusHorizons 7 hours ago | parent [-]

I am by no means an expert, but I think it is a process that allows training LLMs from other LLMs without needing as much compute or nearly as much data as training from scratch. I think this was the thing deepseek pioneered. Don’t quote me on any of that though.

tensor 3 hours ago | parent | next [-]

No, distillation is far older than deepseek. Deepseek was impressive because of algorithmic improvements that allowed them to train a model of that size with vastly less compute than anyone expected, even using distillation.

I also haven’t seen any hard data on how much they do use distillation like techniques. They for sure used a bunch of synthetic generated data to get better at reasoning, something that is now commonplace.

MobiusHorizons 4 minutes ago | parent [-]

Thanks it seems I conflated.

tickerticker 3 hours ago | parent | prev [-]

Yes. They bounced millions of queries off of ChatGPT to teach/form/train their DeepSeek model. This bot-like querying was the "distillation."

orbital-decay 23 minutes ago | parent | next [-]

They definitely didn't. They demonstrated their stuff long before OAI and the models were nothing like each other.

SirMaster 2 hours ago | parent | prev [-]

Why would OpenAI allow someone to do that?

MadnessASAP an hour ago | parent [-]

They didn't, but how do you stop it? Presuming the scale that OpenAI is running at?