Remix.run Logo
charcircuit 2 days ago

>the acceptance that there are emergent properties which appear out of nowhere is another way of saying our scaling laws don’t actually equip us to know what is coming.

Is this actually accepted? Ever since [0], I thought people recognized that they don't appear out of nowhere.

[0] https://arxiv.org/pdf/2304.15004

gwern 2 days ago | parent | next [-]

> I thought people recognized that they don't appear out of nowhere.

I don't think that paper is widely accepted. Have you seen the authors of that paper, or anyone else, use it to successfully predict (rather than postdict) anything?

charcircuit a day ago | parent [-]

I haven't paid attention and the paper seems to be arguing against the existence of the phenomenon of emergence behavior and is not related to predicting what is possible with greater scale.

gwern 16 hours ago | parent [-]

> is not related to predicting what is possible with greater scale.

If they can't predict new emergence, then 'explaining' old emergence by post hoc prediction with bizarre newly-invented metrics would seem to be irrelevant and just epicycles. You can always bend a line as you wish in curve-fitting by adding some parameters.

red75prime 2 days ago | parent | prev | next [-]

"Appear out of nowhere" looks like a straw-man. Anyway, there are newer papers. For example "Emergent Abilities in Large Language Models: A Survey"[0]

[0] https://arxiv.org/abs/2503.05788

Zigurd 2 days ago | parent [-]

I was struck by this in the abstract: The scaling of these models, accomplished by increasing the number of parameters and the magnitude of the training datasets, has been linked to various so-called emergent abilities that were previously unobserved. These emergent abilities, ranging from advanced reasoning and in-context learning to coding and problem-solving...

In my experience with agent assisted coding, how well it works seems very closely tied to the quantity and quality of training material. It also has some identifiable qualities like verifiability that make it a particularly good target for an LLM. I would not call that surprising or emergent.

godelski a day ago | parent | prev [-]

While I'm a big fan of that paper, as a ML researcher I can confidently state that it is not well accepted. I can also confidently state that it is not well known.

I think there is a critical flaw in the paper though. Not critical from a technical standpoint, but from a reviewer standpoint. They don't bridge the gap to the final step of transitioning to a hard loss. But you can easily experiment with this by yourself even on smaller models and datasets and it is pretty effective. I think the logic is quite straight forward though and this isn't actually necessary to prove their point, which is why I think they didn't do it. But most ML people are hyperfixated on benchmarks and empirical evidence. Hell, that's why we kinda killed small scale research. It isn't technically wrong to ask for more scale and more datasets but these types of questions are unbounded so can be used too much as a crutch.

FWIW I also think the original Emergent Abilities paper has a critical flaw. Look at their definition (emphasis my own)

  Specifically, we define emergent abilities of large language models as abilities that are not present in smaller-scale models but are present in large-scale models; ***thus they cannot be predicted by simply extrapolating the performance improvements on smaller-scale models.***
Certainly the mirage paper counters this. Most critiques I've heard are about hard loss vs soft loss, but that isn't that important. But what I think most people don't realize is how the loss landscape actually works. Why I like the mirage paper so much is it is really saying that the loss landscape is partially defined by the number of model parameters (something we already knew btw).

But I also don't know why we've accepted this definition of emergent abilities. It isn't useful.

Without their explicit distinction of extrapolating we'd call nearly every model emergent. Here's my proof: for any given model there is almost surely a smaller model that performs worse. Dumb, but that's the problem with the definition. But using their distinction we run into the problem of concluding things are emergent simply because we didn't realize we were doing things a certain way.

And using the more classic definition of emergence[0,1], distinguishing between strong and weak, we should recognize that all neural nets are by definition weakly emergent. (Emphasis from [0])

  high-level phenomenon is *strongly emergent* with respect to a low-level domain when the high-level phenomenon arises from the low-level domain, but truths concerning that phenomenon are not *deducible* even in principle from truths in the low-level domain.

  high-level phenomenon is *weakly emergent* with respect to a low-level domain when the high-level phenomenon arises from the low-level domain, but truths concerning that phenomenon are *unexpected* given the principles governing the low-level domain.
In physics we have plenty of examples of weakly emergent phenomena and no examples of strongly emergent phenomena. Though we do have things in suspect. Clearly neural nets (and arguably even GLMs and many other techniques) follow this. Especially as we have no formal theory. But that's also why physics only has things that are suspect. Weak emergence is not surprising to a neural network setting and I don't think discussion about it is generally productive.

But strong emergence requires a very difficult proof. We must prove that we aren't just so dumb that we do not know how to deduce the results but that we cannot deduce the results. It means there must be a process that results in an unrecoverable information loss. I think everyone should be quite suspicious of any claims of strong emergence when it comes to AI. I mean... we have the weights... so the results are de fact deducible...

So I don't know why we talk about emergence the way we do in ML. I frequently hear people say things are emergent phenomena because they didn't realize they were teaching the model certain capabilities but that doesn't mean someone else wouldn't be able to (and boy are there many "emergent phenomena" that ML people "can't" predict but a mathematician would).

[0] https://consc.net/papers/emergence.pdf

[1] https://arxiv.org/abs/2410.15468