| ▲ | refulgentis 2 hours ago | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
n^2 isn't a setting someone chose, it's a mathematical consequence of what attention is. Here's what attention does: every token looks at every other token to decide what's relevant. If you have n tokens, and each one looks at n others, you get n * n = n^2 operations. Put another way: n^2 is when every token gets to look at every other token. What would n^3 be? n^10? (sibling comment has same interpretation as you, then handwaves transformers can emulate more complex systems) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | measurablefunc 2 hours ago | parent [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
There are lots more complicated operations than comparing every token to every other token & the complexity increases when you start comparing not just token pairs but token bigrams, trigrams, & so on. There is no obvious proof that all those comparisons would be equivalent to the standard attention mechanism of comparing every token to every other one. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||