Remix.run Logo
killerstorm 5 days ago

No, there's a fundamental limitation of Transformer architecture:

  * information from the entire context has to be squeezed into an information channel of a fixed size; the more information you try to squeeze the more noise you get
  * selection of what information passes through is done using just dot-product
Training data isn't the problem.

In principle, as you scale transformer you get more heads and more dimensions in each vector, so bandwidth of attention data bus goes up and thus precision of recall goes up too.