> their main trick for model improvement is distilling the SOTA models

Could you elaborate? How is this done and what does this mean?

I am by no means an expert, but I think it is a process that allows training LLMs from other LLMs without needing as much compute or nearly as much data as training from scratch. I think this was the thing deepseek pioneered. Don’t quote me on any of that though.

▲

tensor 3 hours ago | parent | next [-]

No, distillation is far older than deepseek. Deepseek was impressive because of algorithmic improvements that allowed them to train a model of that size with vastly less compute than anyone expected, even using distillation.

I also haven’t seen any hard data on how much they do use distillation like techniques. They for sure used a bunch of synthetic generated data to get better at reasoning, something that is now commonplace.

	▲	MobiusHorizons 4 minutes ago \| parent [-]
		Thanks it seems I conflated.

▲

tickerticker 3 hours ago | parent | prev [-]

Yes. They bounced millions of queries off of ChatGPT to teach/form/train their DeepSeek model. This bot-like querying was the "distillation."

▲

orbital-decay 23 minutes ago | parent | next [-]

They definitely didn't. They demonstrated their stuff long before OAI and the models were nothing like each other.

▲

SirMaster 2 hours ago | parent | prev [-]

Why would OpenAI allow someone to do that?

	▲	MadnessASAP an hour ago \| parent [-]
		They didn't, but how do you stop it? Presuming the scale that OpenAI is running at?