Remix.run Logo
yorwba 3 hours ago

> But the reason that attention is quadratic is that each token gets evaluated with respect to each other token. They haven't changed this at all. Section 2.5 seems like it's deferring this to an appendix.

They defer it to the appendix because it's a standard construction (Q'K)V = Q'(KV), where Q'K is an n×n matrix and requires O(n²) to compute, but KV has a constant size and can be computed in O(n) time, and the multiplication with Q' can also be done in O(n) time.

> Section 2.6 gives the hidden state size per token, which, on first read, is strictly larger than the hidden state in normal attention (in normal attention it's d_v * d_k -- I'm not sure where their +1 comes from).

Actually, their hidden state has a (large) constant size, so strike the words "per token" from section 2.6. In normal attention, the total state is n(d_v + d_k), but their state is basically (d_v + 1)D_k, where D_k is much larger than d_k, but independent of n. The +1 is because they also need to compute the normalization factor for the softmax.

It's true that a constant state size implies that you cannot use it to losslessly store arbitrarily large databases, but LLMs in practice cannot do this either, so there's no loss of capability in that sense. (In fact, if you use enough terms in the Taylor expansion to get the same result as standard attention to within machine precision, the resulting constant state size should give you an upper bound for the amount of data the LLM can effectively retrieve from its context.)