| ▲ | jsenn 3 hours ago | |
> Section 2.6 gives the hidden state size per token, which, on first read, is strictly larger than the hidden state in normal attention This is where you’ve gone off track. The “hidden state” for their model is a fixed size thing, like in an RNN, not per token. For a transformer, the “hidden state” is called the KV cache, and it grows with sequence length. This is why their method is linear not quadratic. The Taylor Series they derive isn’t just for softmax (after all, real implementations of softmax will likely already use the Taylor series!), it’s for the entire tensor-level softmax(QK) computation. | ||