this example worked in 2021, it's 2026. wake up. these models are not just "finding the most likely next word based on what they've seen on the internet".

▲

strix_varius 3 hours ago | parent | next [-]

Well, yes, definitionally they are doing exactly that.

It just turns out that there's quite a bit of knowledge and understanding baked into the relationships of words to one another.

LLMs are heavily influenced by preceding words. It's very hard for them to backtrack on an earlier branch. This is why all the reasoning models use "stop phrases" like "wait" "however" "hold on..." It's literally just text injected in order to make the auto complete more likely to revise previous bad branches.

▲

jaccola 2 hours ago | parent | prev | next [-]

The person above was being a bit pedantic, and zealous in their anti-anthropomorphism.

But they are literally predicting the next token. They do nothing else.

Also if you think they were just predicting the next token in 2021, there has been no fundamental architecture change since then. All gains have been via scale and efficiency optimisations (not to discount that, an awful lot of complexity in both of these)

▲

nearbuy 2 hours ago | parent [-]

That's not what they said. They said:

> It's evaluation function simply returned the word "Most" as being the most likely first word in similar sentences it was trained on.

Which is false under any reasonable interpretation. They do not just return the word most similar to what they would find in their training data. They apply reasoning and can choose words that are totally unlike anything in their training data.

If you prompt it:

> Complete this sentence in an unexpected way: Mary had a little...

It won't say lamb. Any if you think whatever it says was in the training data, just change the constraints until you're confident it's original. (E.g. tell it every word must start with a vowel and it should mention almonds.)

"Predicting the next token" is also true but misleading. It's predicting tokens in the same sense that your brain is just minimizing prediction error under predictive coding theory.

	▲	hansmayer an hour ago \| parent [-]
		You are actually proving my point with your example, if you think about it a bit more.

▲

csomar 2 hours ago | parent | prev [-]

Unless LLMs architecture have changed, that is exactly what they are doing. You might need to learn more how LLMs work.

▲

andy12_ an hour ago | parent [-]

Unless the LLM is a base model or just a finetuned base model, it definitely doesn't predict words just based on how likely they are in similar sentences it was trained on. Reinforcement learning is a thing and all models nowadays are extensively trained with it.

If anything, they predict words based on a heuristic ensemble of what word is most likely to come next in similar sentences and what word is most likely to give a final higher reward.

▲

csomar 6 minutes ago | parent | next [-]

> If anything, they predict words based on a heuristic ensemble of what word is most likely to come next in similar sentences and what word is most likely to give a final higher reward.

So... "finding the most likely next word based on what they've seen on the internet"?

▲

hansmayer 43 minutes ago | parent | prev [-]

You know that when A. Karpathy released NanoLLM (or however it was called), he said it was mainly coded by hand as the LLMs were not helpful because "the training dataset was way off". So yeah, your argumentation actually "reinforces" my point.

▲

andy12_ 34 minutes ago | parent [-]

No, your opinion is wrong because the reason some models don't seem to have some "strong opinion" on anything is not related to predicting words based on how similar they are to other sentences in the training data. It's most likely related to how the model was trained with reinforcement learning, and most specifically, to recent efforts by OpenAI to reduce hallucination rates by penalizing guessing under uncertainty[1].

[1] https://cdn.openai.com/pdf/d04913be-3f6f-4d2b-b283-ff432ef4a...

▲

hansmayer 21 minutes ago | parent [-]

Well, you do understand the "penalising" or as the ML scientific community likes to call it - "adjusting the weights downwards" - is part of setting up the evaluation functions, for gasp - calculating the next most likely tokens, or to be more precise, tokens with the highest possible probability? You are effectively proving my point, perhaps in a bit hand-wavy fashion, that nevertheless still can be translated into the technical language.

	▲	andy12_ 2 minutes ago \| parent [-]
		You do understand that the mechanism through which an auto-regressive transformer works (predicting one token at a time) is completely unrelated to how a model with that architecture behaves or how it's trained, right? You can have both: - An LLM that works through completely different mechanisms, like predicting masked words, predicting the previous word, or predicting several words at a time. - A normal traditional program, like a calculator, encoded as an autoregressive transformer that calculates its output one word at a time [1][2] So saying "it predicts the next word" is a nothing-burger. That a program calculates its output one token at a time tells you nothing about its behavior. [1] https://arxiv.org/pdf/2106.06981 [2] https://wengsyx.github.io/NC/static/paper_iclr.pdf