▲ | killerstorm 5 days ago | |
No, there's a fundamental limitation of Transformer architecture:
Training data isn't the problem.In principle, as you scale transformer you get more heads and more dimensions in each vector, so bandwidth of attention data bus goes up and thus precision of recall goes up too. |