Remix.run Logo
HarHarVeryFunny 3 days ago

The article isn't about LLMs storing things - it's about why they hallucinate, which is in large part due to the fact that they just deal in word statistics not facts, but also (the point of the article) that they have no episodic memories, or any personal experience of any sort for that matter.

Humans can generally differentiate between when they know something or not, and I'd agree with the article that this is because we tend to remember how we know things, and also have different levels of confidence according to source. Personal experience trumps watching someone else, which trumps hearing or being taught it from a reliable source, which trumps having read something on Twitter or some grafitti on a bathroom stall. To the LLM all text is just statistics, and it has no personal experience to lean to to self-check and say "hmm, I can't recall ever learning that - I'm drawing blanks".

Frankly it's silly to compare LLMs (Transformers) and brains. An LLM was only every meant to be a linguistics model, not a brain or cognitive architecture. I think people get confused because if spits out human text and so people anthropomorphize it and start thinking it's got some human-like capabilities under the hood when it is in fact - surprise surprise - just a pass-thru stack of Transformer layers. A language model.

ClaraForm 3 days ago | parent | next [-]

Hey, I know what the article wanted to say, see the last two-ish sentences of my previous response. My point, is that the article might be mis-interpreting what the causes and solutions for the problems it sees. Relying on the brain as an example of how to improve might be a mistaken premise, because maybe the brain isn't doing what the article thinks it's doing. So we're in agreement there, that the brain and LLMs are incomparable, but maybe the parts where they're comparable are more informative on the nature of hallucinations than the author may think.

n4r9 3 days ago | parent | next [-]

I think you can confidently say that brains do the following and LLMs don't:

* Continuously updates its state based on sensory data

* Retrieves/gathers information that correlates strongly with historic sensory input

* Is able to associate propositions with specific instances of historic sensory input

* Uses the above two points to verify/validate its belief in said propositions

Describing how memories "feel" may confuse the matter, I agree. But I don't think we should be quick to dismiss the argument.

HarHarVeryFunny 3 days ago | parent | prev [-]

But the thing is that humans don't hallucinate as much as LLMs, so it's the differences not similarities (such as they are) that are important to understand why that is.

It's pretty obvious that an LLM not knowing what it does or does not know is a major part of it hallucinating, while humans do generally know the limits of their own knowledge.

DavidSJ 3 days ago | parent | prev [-]

> An LLM was only every meant to be a linguistics model, not a brain or cognitive architecture.

See https://gwern.net/doc/cs/algorithm/information/compression/1... from 1999.

Answering questions in the Turing test (What are roses?) seems to require the same type of real-world knowledge that people use in predicting characters in a stream of natural language text (Roses are ___?), or equivalently, estimating L(x) [the probability of x when written by a human] for compression.

HarHarVeryFunny 3 days ago | parent [-]

I'm not sure what your point is?

Perhaps in 1999 it seemed reasonable to think that passing the Turing Test, or maximally compressing/predicting human text makes for a good AI/AGI test, but I'd say we now know better, and more to the point that does not appear to have been the motivation for designing the Transformer, or the other language models that preceded it.

The recent history leading to the Transformer was the development of first RNN then LSTM-based language models, then the addition of attention, with the primary practical application being for machine translation (but more generally for any sequence-to-sequence mapping task). The motivation for the Transformer was to build a more efficient and scalable language model by using parallel processing, not sequential (RNN/LSTM), to take advantage of GPU/TPU acceleration.

The conceptual design of what would become the Transformer came from Google employee Jakob Uzkoreit who has been interviewed about this - we don't need to guess the motivation. There were two key ideas, originating from the way linguists use syntax trees to represent the hierarchical/grammatical structure of a sentence.

1) Language is as much parallel as sequential, as can be seen by multiple independent branches of the syntax tree, which only join together at the next level up the tree

2) Language is hierarchical, as indicated by the multiple levels of a branching sytntax tree

Put together these two considerations suggests processing the entire sentence in parallel, taking advantage of GPU parallelism (not sequentially like an LSTM), and having multiple layers of such parallel processing to hierarchically process the sentence. This eventually lead to the stack of parallel-processing Transformer layers design, which did retain the successful idea of attention (thus the paper name "Attention is all you need [not RNNs/LSTMs]").

As far as the functional capability of this new architecture, the initial goal was just to be as good as the LSTM + attention language models it aimed to replace (but be more efficient to train & scale). The first realization of the "parallel + hierarchical" ideas by Uzkoreit was actually less capable than its predecesssors, but then another Google employee, Noam Shazeer, got involved and eventually (after a process of experimentation and ablation) arrived at the Transformer design which did perform well on the language modelling task.

Even at this stage, nobody was saying "if we scale this up it'll be AGI-like". It took multiple steps of scaling, from early Google's early Muppet-themed BERT (following their LSTM-based ELMo), to OpenAI's GPT-1, GPT-2 and GPT-3 for there to be a growing realization of how good a next-word predictor, with corresponding capabilities, this architecture was when scaled up. You can read the early GPT papers and see the growing level of realization - they were not expecting it to be this capable.

Note also that when Shazeer left Google, disappointed that they were not making better use of his Transformer baby, he did not go off and form an AGI company - he went and created Character.ai making fantasy-themed ChatBots (similar to Google having experimented with ChatBot use, then abandoning it, since without OpenAI's innovation of RLHF Transformer-based ChatBots were unpredictable and a corporate liability).

DavidSJ 2 days ago | parent [-]

> I'm not sure what your point is?

I was just responding to this claim:

> An LLM was only every meant to be a linguistics model, not a brain or cognitive architecture.

Plenty of people did in fact see a language model as a potential path towards intelligence, whatever might be said about the beliefs of Mr. Uszkoreit specifically.

There's some ambiguity as to whether you're talking about the transformer specifically, or language models generally. The "recent history" of RNNs and LSTMs you refer to dates back to before the paper I linked. I won't speak to the motivations or views of the specific authors of Vaswani et al, but there's a long history, both distant and recent, of drawing connections between information theory, compression, prediction, and intelligence, including in the context of language modeling.

HarHarVeryFunny 2 days ago | parent [-]

I was really talking about the Transformer specifically.

Maybe there was an implicit hope of a better/larger language model leading to new intelligent capabilities, but I've never seen the Transformer designers say they were targeting this or expecting any significant new capabilities even (to their credit) after it was already apparent how capable it was. Neither Google's initial fumbling of the tech or Shazeer's entertainment chatbot foray seem to indicate that they had been targeting, and/or realized they had achieved, a more significant advance than the more efficient seq-2-seq model which had been their proximate goal.

To me it seems that the Transformer is really one of industry/science's great accidental discoveries. I don't think it's just the ability to scale that made it so powerful, but more the specifics of the architecture, including the emergent ability to learn "induction heads" which seem core to a lot of what they can do.

The Transformer precursors I had in mind were recent ones, in particular Sutskever et als "Sequence to Sequence learning with Neural Networks [LSTM]" from 2014, and Bahdanau et als "Jointly learning to align & translate" from 2016, then followed by the "Attention is all you need" Transformer paper in 2017.

DavidSJ a day ago | parent [-]

Circling back to the original topic: at the end of the day, whether it makes sense to expect more brain-like behavior out of transformers than "mere" token prediction does not depend much on what the transformer's original creators thought, but rather on the strength of the collective arguments and evidence that have been brought to bear on the question, regardless of who from.

I think there has been a strong case that the "stochastic parrot" model sells language models short, but to what extent still seems to me an open question.

HarHarVeryFunny 17 hours ago | parent [-]

I'd say that whether to expect more brain-like capabilities out of Transformers is more an objective matter of architecture - what's missing - and learning algorithms, not "collective arguments". If a Transformer simply can't do something - has no mechanism to support it (e.g. learn at run time), then it can't do it, regardless of whether Sam Altman tells you it can, or tries to spin it as unimportant!

A Transformer is just a fixed size stack of transformer layers, with one-way data flow through this stack. It has no internal looping, no internal memory, no way to incrementally learn at runtime, no autonomy/curiosity/etc to cause it to explore and actively expose itself to learning situations (assuming it could learn, which it anyways can't), etc!

These are just some of the most obvious major gaps between the Transformer architecture and even the most stripped down cognitive architecture (vs language model) one might design, let alone an actual human brain which has a lot more moving parts and complexity to it.

The whole Transformer journey has been fascinating to watch, and highly informative as to how far language and auto-regressive prediction can take you, but without things like incremental learning and the drive to learn, all you have is a huge, but fixed, repository of "knowledge" (language stats), so you are in effect building a giant expert system. It may be highly capable and sufficient for some tasks, but this is not AGI - it's not something that could replace an intern and learn on the job, or make independent discoveries outside of what is already deducible from what is in the training data.

One of the really major gaps between an LLM and something capable of learning about the world isn't even the architecture with all it's limitations, but just the way they are trained. A human (and other intelligent animals) also learns by prediction, but the feedback loop when the prediction is wrong is essential - this is how you learn, and WHAT you can learn from incorrect predictions is limited by the feedback you receive. In the case of a human/animal the feedback comes from the real world, so what you are able to learn critically includes things like how your own actions affect the world - you learn how to be able to DO things.

An LLM also learns by prediction, but what it is predicting isn't real world responses to it's own actions, but instead just input continuations. It is being trained to be a passive observer of other people's "actions" (limited to the word sequences they generate) - to predict what they will do (say) next, as opposed to being an active entity that learns not to predict someone else's actions, but to predict it's own actions and real-world responses - how to DO things itself (learn on the job, etc, etc).