▲ | CGMthrowaway 4 days ago | |
This I've read before and it was very helpful. It's probably where most of my understanding comes from. If I'm interpreting it correctly, it sort of validates my intuition that attention heads are "multi-threaded markov chain models" , in other words if autocomplete just looks at level 1, a transformer looks at level 1 for every word in the input plus many layers deeper for every word (or token) in the input.. while bringing a huge pre-training dataset to bear. If that's correct more or less, something that surprises me is how attention is often treated as some kind of "breakthrough" - it seems obvious to me that improving a markov chain recommendation would involve going deeper and dimensionalizing the context in a deeper way.. the technique appears the same just the amount of analysis is more. I'm not sure what I'm missing here. Perhaps adding those extra layers was a hard problem thta we hadnt figured out how to efficiently do yet (?) | ||
▲ | photon_lines 3 days ago | parent [-] | |
So I posted this conversation between Ilya Sutskever (one of the creators of ChatGPT) and Lex Fridman within that blog post and I'll provide it again below because I think it does a good job of summarizing what exactly 'makes transformers work':
I'm not sure if the above answers your question, but I tend to think of transformers more-of as 'associative' networks (similar to humans) -- they miss many of the components which actually makes humans human (like imitation learning and consciousness (we still don't know what consciousness actually is)) but for the most part, the general architecture and the way they 'learn' I believe mimics a process similar to how regular humans learn: neurons the fire together, wire together (i.e. associative learning). This is what a huge large-language model is to me: a giant auto-associative network that can comprehend and organize information. |