I was really talking about the Transformer specifically.

Maybe there was an implicit hope of a better/larger language model leading to new intelligent capabilities, but I've never seen the Transformer designers say they were targeting this or expecting any significant new capabilities even (to their credit) after it was already apparent how capable it was. Neither Google's initial fumbling of the tech or Shazeer's entertainment chatbot foray seem to indicate that they had been targeting, and/or realized they had achieved, a more significant advance than the more efficient seq-2-seq model which had been their proximate goal.

To me it seems that the Transformer is really one of industry/science's great accidental discoveries. I don't think it's just the ability to scale that made it so powerful, but more the specifics of the architecture, including the emergent ability to learn "induction heads" which seem core to a lot of what they can do.

The Transformer precursors I had in mind were recent ones, in particular Sutskever et als "Sequence to Sequence learning with Neural Networks [LSTM]" from 2014, and Bahdanau et als "Jointly learning to align & translate" from 2016, then followed by the "Attention is all you need" Transformer paper in 2017.

▲

DavidSJ a day ago | parent [-]

Circling back to the original topic: at the end of the day, whether it makes sense to expect more brain-like behavior out of transformers than "mere" token prediction does not depend much on what the transformer's original creators thought, but rather on the strength of the collective arguments and evidence that have been brought to bear on the question, regardless of who from.

I think there has been a strong case that the "stochastic parrot" model sells language models short, but to what extent still seems to me an open question.

	▲	HarHarVeryFunny 17 hours ago \| parent [-]
		I'd say that whether to expect more brain-like capabilities out of Transformers is more an objective matter of architecture - what's missing - and learning algorithms, not "collective arguments". If a Transformer simply can't do something - has no mechanism to support it (e.g. learn at run time), then it can't do it, regardless of whether Sam Altman tells you it can, or tries to spin it as unimportant! A Transformer is just a fixed size stack of transformer layers, with one-way data flow through this stack. It has no internal looping, no internal memory, no way to incrementally learn at runtime, no autonomy/curiosity/etc to cause it to explore and actively expose itself to learning situations (assuming it could learn, which it anyways can't), etc! These are just some of the most obvious major gaps between the Transformer architecture and even the most stripped down cognitive architecture (vs language model) one might design, let alone an actual human brain which has a lot more moving parts and complexity to it. The whole Transformer journey has been fascinating to watch, and highly informative as to how far language and auto-regressive prediction can take you, but without things like incremental learning and the drive to learn, all you have is a huge, but fixed, repository of "knowledge" (language stats), so you are in effect building a giant expert system. It may be highly capable and sufficient for some tasks, but this is not AGI - it's not something that could replace an intern and learn on the job, or make independent discoveries outside of what is already deducible from what is in the training data. One of the really major gaps between an LLM and something capable of learning about the world isn't even the architecture with all it's limitations, but just the way they are trained. A human (and other intelligent animals) also learns by prediction, but the feedback loop when the prediction is wrong is essential - this is how you learn, and WHAT you can learn from incorrect predictions is limited by the feedback you receive. In the case of a human/animal the feedback comes from the real world, so what you are able to learn critically includes things like how your own actions affect the world - you learn how to be able to DO things. An LLM also learns by prediction, but what it is predicting isn't real world responses to it's own actions, but instead just input continuations. It is being trained to be a passive observer of other people's "actions" (limited to the word sequences they generate) - to predict what they will do (say) next, as opposed to being an active entity that learns not to predict someone else's actions, but to predict it's own actions and real-world responses - how to DO things itself (learn on the job, etc, etc).