Remix.run Logo
spwa4 2 days ago

Ever notice that attention is (with the highest respect to the original researchers) "just" inputting the entire past of the network into a reverse-MoE neural network? (meaning the expert is selecting parts of the input instead of parts of the neural network to execute)

In a way everyone knew this would work. Nobody did it because it's so inefficient even R and Python users thought that it would be ridiculously slow (or simply couldn't execute it enough to train to a reasonable extent)