| ▲ | supern0va 6 hours ago |
| >It is almost guaranteed that a 60-90B model can outperform current SOTA in coding tasks within 2-3 years. I don't disagree, but how much of this ends up being distillation? I can't help but imagine that 4.8 was probably trained in part by leveraging Mythos. If the very large models turn out to be very expensive to run relative to the benefits, it's possible that they could end up still being trained, but ultimately used as a tool to create smaller models that are nearly as effective. I'm curious if someone here with a stronger background in the space has a similar intuition or not. |
|
| ▲ | ACCount37 34 minutes ago | parent | next [-] |
| Scale is always desirable, and there are always gains from scale. It's a matter of whether you can afford training and inference at increased scale. There is a real trend of smaller models becoming more "capability-dense" - i.e. the best 8Bs of today beat the best 32Bs of 2 years ago. This is in part a product of distillation being used to train the smaller models. But people consistently underestimate how "capability hungry" the world is. There are diminishing returns on model capabilities in some sort of "summarize the search results" applications, but as capabilities improve, LLMs enter, get their footing in and begin to dominate new niches. At times, expensive, highly desirable niches. I do not expect anyone at the frontier to pop up and say "no reason to train a new model" within the following decade. There will always be a demand for an LLM that's 5-10% more capable and more reliable at some advanced task, and generational upgrades will keep delivering those 5-10%. From increased scale and improved training both. |
|
| ▲ | rao-v 4 hours ago | parent | prev | next [-] |
| It’s really worth distinguishing between old-fashioned student teacher distillation (ie at the level of layers, weights and distributions) and large scale synthetic dataset creation. The latter is much better (since you can clean up, review, update responses and filter your datasets). I suspect nobody is doing real student teacher distillation, it’s just easier to do a bunch of training on the same giant corpus then post train on the synthetic corpus with its reasoning traces etc. (which might have been generated by a bigger better LLM) |
| |
| ▲ | girvo 19 minutes ago | parent | next [-] | | > I suspect nobody is doing real student teacher distillation It gets used for quantisation, basically recovering accuracy for lower quants (Nvidia calls it QAD). Can’t speak to how widespread it is though | |
| ▲ | ACCount37 an hour ago | parent | prev | next [-] | | A reason to do student-teacher distillation is that soft target logits in general are a richer medium than text that tokenizes to hard targets. More steering signal per teacher token. And running ultra large 10T tier models in autoregressive generation mode can get expensive. So there are reasons not to reduce to text only synthetics. | |
| ▲ | thisisaman408 an hour ago | parent | prev [-] | | [dead] |
|
|
| ▲ | spwa4 5 hours ago | parent | prev | next [-] |
| > I don't disagree, but how much of this ends up being distillation? A lot, so you can bet tens of millions are flowing to congress to have distillation declared illegal before this happens. And then it'll happen anyway. |
| |
| ▲ | lambda 5 hours ago | parent [-] | | Distillation isn't only between different labs. A lab can train a large model, and then distill a smaller model from it that retains the majority of the useful capbility. I don't know well enough if there's any benefit of that over just training the smaller model directly, but I'll bet there are some times where that is useful. I could easily see it being easier to do the initial pre-training on a larger model but be able to distill everything useful down into a smaller model, essentially filtering out a lot of noise in the process. | | |
| ▲ | spwa4 5 hours ago | parent [-] | | There used to be training methods like that but I think they've been phased out in favor of letting small models evolve by rewriting their own training material. Surprisingly that's actually cheaper. |
|
|
|
| ▲ | onlyrealcuzzo 6 hours ago | parent | prev [-] |
| > I don't disagree, but how much of this ends up being distillation? You don't need distillation. They already have the training sets. It's MLA + MoE + Medusa (a better version of Speculative Decoding) + 1.58b (possibly - maybe nothing) + GRAM (which will almost certainly not turn out to be a nothing burger, but no one has quickly turned this around yet to prove it). |
| |
| ▲ | semiquaver 4 hours ago | parent | next [-] | | The frontier labs distill their own base models all day long. It’s not just something done by nefarious Chinese copycats. The knowledge embodied by the internal base models that we never see is much more powerful and useful than the much sparser raw training data | | |
| ▲ | coldtea 4 hours ago | parent | next [-] | | >It’s not just something done by nefarious Chinese copycats And even that would be rich as a accusation from SOTAs that depend on explicitly disregarding millions of training data intellectual property.. | |
| ▲ | manmal 4 hours ago | parent | prev | next [-] | | But how? The training data is the unadulterated content those models are based on? I genuinely don’t understand, no snark. | |
| ▲ | supern0va 4 hours ago | parent | prev [-] | | I think you replied to the wrong parent. |
| |
| ▲ | Philpax 5 hours ago | parent | prev | next [-] | | It wouldn't be data distillation: instead, it would be teacher-student distillation. The teacher model has stronger representations that the student can mimic, which would give it more capability over training on the data itself. | |
| ▲ | minimaltom 5 hours ago | parent | prev [-] | | Frontier labs have their own variants of MLA and certainly their own balance/scaling-laws for things like MoE vs FC vs Attn. MoE scales really well for inference with horizontal scaling + batching, which these guys luv. On the architectures side, I'm a lot more interesting in attention residuals than anything else, one of those things that seems obvious in hindsight and Kimi have proven it at scale. | | |
| ▲ | onlyrealcuzzo 5 hours ago | parent [-] | | > Frontier labs have their own variants of MLA Yes, variants typically 2-3x less good... Same with speculative decoding... They all do something, but there are known techniques that are substantially better - that just were't known when they started development of the previous models. | | |
| ▲ | amluto 4 hours ago | parent [-] | | How useful is speculative decoding in a batched setting where you get paid for throughput (aggregated across users) and you mostly don’t get paid for latency or single-session throughput? | | |
| ▲ | onlyrealcuzzo 4 hours ago | parent [-] | | It's useful at the local level, where there will be SOTA models developed... | | |
| ▲ | zozbot234 3 hours ago | parent [-] | | Local models are moving towards batched inference too, if only for non-interactive use. An early experimental patchset for DS4 (running DeepSeek V4 Flash) seems to show 2x aggregate tok/s decode when processing 8 streams concurrently, and more than 3x when processing as many as 32 streams concurrently. Note that prefill (which is not helped significantly by this change) then becomes a larger fraction of total wall-clock time, so the overall gain is lower (i.e. prefill is akin to a 'serial' task wrt. Amdahl's law). MTP will still be highly valuable for interactive use of course. |
|
|
|
|
|