Remix.run Logo
mohsen1 9 hours ago

It seems like we forget that LLMs are next token prediction systems. Using raw models without instruction following and chat completion bells and whistles will give you a better feeling of what LLMs are.

The current interface to LLMs are heavily biased towards "predict the next token in the context of a user with a helpful assistant" but LLMs are capable of other modes of next token prediction too.

Before the ChatGPT release people often measured LLM performance by how well they could produce a coherent story or a poem. that's where Anthropic model names are originating from I am guessing.

post-it 9 hours ago | parent | next [-]

> It seems like we forget that LLMs are next token prediction systems.

It's pretty clear to me that above a certain size threshold, LLMs are more than a sum of their parts. The sheer amount of training data seems to embed a higher level of reasoning.

gmueckl 8 hours ago | parent [-]

There cannot be any reasoning embedded in the model. The algorithm is literally "predict the most likely next token". Anything beyond that is just patterns in the predictions fooling us humans into ascribing more to the system than it is actually producing.

thepasch 8 hours ago | parent | next [-]

> The algorithm is literally "predict the most likely next token".

That's confusing the training objective with the learned behavior. It's like saying "Stockfish's algorithm is literally 'minimize this number', and therefore, it can't actually play Chess."

gmueckl 8 hours ago | parent [-]

Not a valid comparison. Chess algorithms are built around the rules of chess, most notably the turn taking nature of the game (min/max with alpha/beta pruning based on lists of valid moves in any position).

etskinner 5 hours ago | parent [-]

Who's to say the "rules of reasoning" aren't just predicting the next thing that an intelligent person (you) would do? Emergent behavior isn't magic, it's just emergent.

linzhangrun 6 hours ago | parent | prev | next [-]

Quantity change leads to quality change. You can check out this Kurzgesagt video on emergence: https://www.youtube.com/watch?v=16W7c0mb-rE

aspenmartin 8 hours ago | parent | prev | next [-]

This is just a misconception of how LLMs work and also what reasoning is.

“There cannot be any reasoning embedded in the model” a strong statement, what do you mean by reasoning because by any reasonable definition I’m aware of, they clearly are able to exhibit reasoning.

The fact that the pre training objective is next token loss has nothing to do with capabilities or their ability to reason. To be highly successful at next token prediction you NEED to reason. I’m quite confused here.

gmueckl 8 hours ago | parent | next [-]

LLM output produces the illusion of reasoning. The underlying computation, however, is not reasoning.

aspenmartin 8 hours ago | parent | next [-]

If you don’t mind actually taking a few more words to be more specific that would be helpful because what you’re saying doesn’t really make sense at all. You don’t need to trust that the reasoning traces are all faithful representation of an internal reasoning trace. Plenty of other ways to probe models (see anthropics work using circuit tracing).

gmueckl 8 hours ago | parent [-]

What else is there to say? LLMs can at most regurgitate approximations of human reasoning steps in the limited forms in which they may be expressed in the training data or interpolations thereof. That's the core essence of what they are. There is no proper reasoning to be found.

aspenmartin 8 hours ago | parent [-]

"at most" is wrong. RL with verifiable rewards takes you beyond quality and skills represented in training data, I'm not aware of meaningful fundamental limits here if you scale compute enough even though right now it's highly sample inefficient.

Since you refuse to actually define what you consider to be reasoning let me at least put one out there: a system exhibits reasoning when an answer depends on nontrivial intermediate computation over the problem. If you find problems with this, fine, but just make an effort to contribute an alternative.

If you increase test time compute you get better performance. If the model was just "interpolating" this wouldn't really work would it? Models can do FrontierMath expert problems (unpublished, expert authored, peer reviewed math problems) that require an insane amount of compositional reasoning. If they were regurgitating training data, that wouldn't really work would it? Chain of thought, while not always faithful to internal computation, improves performance. If the models were just regurgitating information, it wouldn't work that well would it?

"regurgitating training data" is also of course misleading. Yea they can memorize parts of the training data, but they generalize very well.

applicative 3 hours ago | parent | next [-]

There is the obvious limit that human text output is limited. To this you can add the specific testable training that pertains to code, but this degrades the weights for more general communication. Somehow the hype over the successes with coding in the last year or so made everyone forget the intrinsic limit posed by the exhaustion of real human text output, which is absolutely inescapable

7 hours ago | parent | prev [-]
[deleted]
thepasch 8 hours ago | parent | prev [-]

How do you define reasoning? What does a system have to functionally do in order to qualify for it?

gmueckl 8 hours ago | parent [-]

Reasoning includes things like proper use of logic. LLMs have been repeatedly shown to fail horribly at this.

They consistently fail at drawing basic logical conclusions because they cannot build a sufficiently abstract model of certain problems that allows them to grasp their true nature. In other words, the whole class of questions of the kind of "how many r's in strawberry" or "do I take the car to the car wash?" would be answered correctly and reliably.

aspenmartin 8 hours ago | parent [-]

> Reasoning includes things like proper use of logic. LLMs have been repeatedly shown to fail horribly at this.

That models cannot do ALL logic problems does not mean that they cannot properly use logic...they can write Lean-verified theorems. How is that not logic?

> They consistently fail at drawing basic logical conclusions because they cannot build a sufficiently abstract model of certain problems that allows them to grasp their true nature.

What does their "grasp[ing] their true nature" have anything to do with what they can do?

> In other words, the whole class of questions of the kind of "how many r's in strawberry" or "do I take the car to the car wash?" would be answered correctly and reliably.

Again, just because you have interesting failure modes or brittleness does not mean they do not reason.

gmueckl 7 hours ago | parent [-]

This is exactly backwards. The brittleness is because they emulate reasoning without actually algorithmically performing it.

Add.: I pointed to this class of problems specifically because they require the ability to abstract in a way that the question itself does not immediately suggest. Math problems are different in that they are described in terms of art that are closely related to certain patterns of manipulation (that is, the paper texts tend to contain both in close proximity to one another).

aspenmartin 6 hours ago | parent [-]

For you, a system needs to reason perfectly and flawlessly, all the time? So humans do not reason? Humans don't have brittle failure modes?

> they require the ability to abstract in a way that the question itself does not immediately suggest

yes, yet there are multitudes of other measurements of the same kind where LLMs reason perfectly well and better in many cases than a human could.

> Math problems are different in that they are described in terms of art that are closely related to certain patterns of manipulation (that is, the paper texts tend to contain both in close proximity to one another).

Is your logic really that math problems are actually easier to answer without reasoning and just by blending together closely related papers? I would definitely suggest reading the literature a bit more on this topic.

gmueckl 2 hours ago | parent [-]

Humans are not flawless, but they are much, much better at reasoning than LLMs are. LLMs can be made to fail quite reliably and easily because they cannot build proper manipulatable/predictive models. This is related to the point that Yann LeCunn makes when advocating for world models (for the physical world) with predictive power.

applicative 3 hours ago | parent | prev [-]

LLM output is a kind of dreaming but with the whole of past human text output as dream material. It turns out to be useful if you can direct the hallucination

wizzwizz4 8 hours ago | parent | prev [-]

LLMs are great big finite state machines. Finite state machines can perform mechanical reasoning. A priori, there can be reasoning embedded in the model. I agree that (these) LLMs don't generally reason (even when they're writing words like "I reason that, since X, we have Y, therefore Z"), but that's not because a model inherently cannot do that.

jhbadger 7 hours ago | parent | prev [-]

The problem with that argument is it is trivial to write a Markov chain program that takes in text and then can generate the most probable series of words given a starting word. I myself wrote such a program in BASIC on a 64K 8-bit computer in the 1980s after reading one of A.K. Dewdney's columns. That wasn't at all an LLM though. There's a connection, sure, but one that is equating a paper airplane to a jet airliner.

Charon77 7 hours ago | parent [-]

The issue with Markov Chain is you can't get good next token prediction on long enough context because once you see the last 1000 words instead of just 2, it's quite unlikely that your 'frequency' is populated for that exact combination, and markov chain don't work on token embedding that allows some encoding of meaning.

AlecSchueler an hour ago | parent [-]

> and markov chain don't work on token embedding that allows some encoding of meaning.

Working on an "encoding of meaning" sure sounds a lot like reasoning.