Remix.run Logo
cztomsik 2 days ago

I have no idea how it works actually (in google) but I wouldn't be surprised if it was just post-training because recently RWKV people did something similar: They replaced the whole attention mechanism with WKV (forward-only linear attention), and created such franken-stein just by post-training.

The big wow moment about that is that it sort of implies that most of the useful knowledge is in the FFN, and attention itself is not that unique/important.

https://substack.recursal.ai/p/qwerky-72b-and-32b-training-l...

BTW: It could be also interesting to try use already trained attention and see how long the FFN itself takes in the gpt2 speedtraining (it would be against the rules but still very interesting IMHO - definitely something I'd like to read paper about) https://github.com/KellerJordan/modded-nanogpt

Also, I read yesterday that at some point, the embeddings across all of the models are (very) comparable/similar, and a simple converter can be trained, and if both of these statements are true maybe we could just train everything much faster just by sharing fixed embeddings and attentions.

spwa4 2 days ago | parent | next [-]

Ever notice that attention is (with the highest respect to the original researchers) "just" inputting the entire past of the network into a reverse-MoE neural network? (meaning the expert is selecting parts of the input instead of parts of the neural network to execute)

In a way everyone knew this would work. Nobody did it because it's so inefficient even R and Python users thought that it would be ridiculously slow (or simply couldn't execute it enough to train to a reasonable extent)

scotty79 2 days ago | parent | prev | next [-]

Attention is just completely arbitrary way to split the network so the learning can be parallelized.

What contributed more towards success in my opinion are "shortcut connections" through layers which enable more influence on early layers during learning.

grumbelbart2 2 days ago | parent [-]

> What contributed more towards success in my opinion are "shortcut connections" through layers which enable more influence on early layers during learning.

For those who don't know, that is the idea behind ResNet (He et al., Deep Residual Learning for Image Recognition, https://arxiv.org/abs/1512.03385), one of the most influential papers in deep learning of all time.

Residual connections make it possible to train networks that are arbitrarily deep. Before ResNet, networks that were too deep were essentially not trainable due to vanishing or exploding gradients.

jonahx a day ago | parent | prev | next [-]

So is the famous "Attention is all you need" wrong?

cubefox 2 days ago | parent | prev | next [-]

> Also, I read yesterday that at some point, the embeddings across all of the models are (very) comparable/similar, and a simple converter can be trained

That was from here: https://news.ycombinator.com/item?id=44054425

slickytail 2 days ago | parent | prev [-]

The relative unimportance of the exact SDPA attention in use in modern transformers is already known: https://arxiv.org/abs/2111.11418

The FFN, normalization, and residual connections are absolutely irreplaceable -- but attention can be replaced with almost any other layer that shares information between tokens, such as pooling, convolution, random mixing, etc.

cztomsik a day ago | parent [-]

hm, residual is what I would not expect, can you elaborate why?

simsla a day ago | parent [-]

Avoids vanishing gradients in deeper networks.

Also, most blocks with a residual approximate the identity function when initialised, so tend to be well behaved.