▲ | photon_lines 5 days ago | |||||||
If you want my 'intuitive' explanation of how transformers work - you can find it here (if you're a visual learner -- I think you'll like this one) albeit it is a bit long: https://photonlines.substack.com/p/intuitive-and-visual-guid... | ||||||||
▲ | CGMthrowaway 4 days ago | parent [-] | |||||||
This I've read before and it was very helpful. It's probably where most of my understanding comes from. If I'm interpreting it correctly, it sort of validates my intuition that attention heads are "multi-threaded markov chain models" , in other words if autocomplete just looks at level 1, a transformer looks at level 1 for every word in the input plus many layers deeper for every word (or token) in the input.. while bringing a huge pre-training dataset to bear. If that's correct more or less, something that surprises me is how attention is often treated as some kind of "breakthrough" - it seems obvious to me that improving a markov chain recommendation would involve going deeper and dimensionalizing the context in a deeper way.. the technique appears the same just the amount of analysis is more. I'm not sure what I'm missing here. Perhaps adding those extra layers was a hard problem thta we hadnt figured out how to efficiently do yet (?) | ||||||||
|