Remix.run Logo
CGMthrowaway 4 days ago

This I've read before and it was very helpful. It's probably where most of my understanding comes from.

If I'm interpreting it correctly, it sort of validates my intuition that attention heads are "multi-threaded markov chain models" , in other words if autocomplete just looks at level 1, a transformer looks at level 1 for every word in the input plus many layers deeper for every word (or token) in the input.. while bringing a huge pre-training dataset to bear.

If that's correct more or less, something that surprises me is how attention is often treated as some kind of "breakthrough" - it seems obvious to me that improving a markov chain recommendation would involve going deeper and dimensionalizing the context in a deeper way.. the technique appears the same just the amount of analysis is more. I'm not sure what I'm missing here. Perhaps adding those extra layers was a hard problem thta we hadnt figured out how to efficiently do yet (?)

photon_lines 3 days ago | parent [-]

So I posted this conversation between Ilya Sutskever (one of the creators of ChatGPT) and Lex Fridman within that blog post and I'll provide it again below because I think it does a good job of summarizing what exactly 'makes transformers work':

  Ilya Sutskever: Yeah, so the thing is the transformer is a combination of multiple ideas simultaneously of which attention is one.

  Lex Friedman: Do you think attention is the key?

  Ilya Sutskever: No, it's a key, but it's not the key. The transformer is successful because it is the simultaneous combination of multiple ideas. And if you were to remove either idea, it would be much less successful. So the transformer uses a lot of attention, but attention existed for a few years. So that can't be the main innovation. The transformer is designed in such a way that it runs really fast on the GPU. And that makes a huge amount of difference. This is one thing. The second thing is that transformer is not recurrent. And that is really important too, because it is more shallow and therefore much easier to optimize. So in other words, it uses attention, it is a really great fit to the GPU and it is not recurrent, so therefore less deep and easier to optimize. And the combination of those factors make it successful.
I'm not sure if the above answers your question, but I tend to think of transformers more-of as 'associative' networks (similar to humans) -- they miss many of the components which actually makes humans human (like imitation learning and consciousness (we still don't know what consciousness actually is)) but for the most part, the general architecture and the way they 'learn' I believe mimics a process similar to how regular humans learn: neurons the fire together, wire together (i.e. associative learning). This is what a huge large-language model is to me: a giant auto-associative network that can comprehend and organize information.