▲ | cztomsik 2 days ago | ||||||||||||||||
I have no idea how it works actually (in google) but I wouldn't be surprised if it was just post-training because recently RWKV people did something similar: They replaced the whole attention mechanism with WKV (forward-only linear attention), and created such franken-stein just by post-training. The big wow moment about that is that it sort of implies that most of the useful knowledge is in the FFN, and attention itself is not that unique/important. https://substack.recursal.ai/p/qwerky-72b-and-32b-training-l... BTW: It could be also interesting to try use already trained attention and see how long the FFN itself takes in the gpt2 speedtraining (it would be against the rules but still very interesting IMHO - definitely something I'd like to read paper about) https://github.com/KellerJordan/modded-nanogpt Also, I read yesterday that at some point, the embeddings across all of the models are (very) comparable/similar, and a simple converter can be trained, and if both of these statements are true maybe we could just train everything much faster just by sharing fixed embeddings and attentions. | |||||||||||||||||
▲ | spwa4 2 days ago | parent | next [-] | ||||||||||||||||
Ever notice that attention is (with the highest respect to the original researchers) "just" inputting the entire past of the network into a reverse-MoE neural network? (meaning the expert is selecting parts of the input instead of parts of the neural network to execute) In a way everyone knew this would work. Nobody did it because it's so inefficient even R and Python users thought that it would be ridiculously slow (or simply couldn't execute it enough to train to a reasonable extent) | |||||||||||||||||
▲ | scotty79 2 days ago | parent | prev | next [-] | ||||||||||||||||
Attention is just completely arbitrary way to split the network so the learning can be parallelized. What contributed more towards success in my opinion are "shortcut connections" through layers which enable more influence on early layers during learning. | |||||||||||||||||
| |||||||||||||||||
▲ | jonahx a day ago | parent | prev | next [-] | ||||||||||||||||
So is the famous "Attention is all you need" wrong? | |||||||||||||||||
▲ | cubefox 2 days ago | parent | prev | next [-] | ||||||||||||||||
> Also, I read yesterday that at some point, the embeddings across all of the models are (very) comparable/similar, and a simple converter can be trained That was from here: https://news.ycombinator.com/item?id=44054425 | |||||||||||||||||
▲ | slickytail 2 days ago | parent | prev [-] | ||||||||||||||||
The relative unimportance of the exact SDPA attention in use in modern transformers is already known: https://arxiv.org/abs/2111.11418 The FFN, normalization, and residual connections are absolutely irreplaceable -- but attention can be replaced with almost any other layer that shares information between tokens, such as pooling, convolution, random mixing, etc. | |||||||||||||||||
|