| ▲ | noosphr 2 hours ago | |
Yes, and it works in theory. Less so in practice. You saturate the memory of a b200 with a few dozen tokens on attentions higher than order 4. Training is even worse. To paraphrase Knuth: high order polynomials are much more unimaginably large than mere infinity. | ||