| ▲ | EGreg 11 hours ago | |||||||
The predicted KV vector is the expected KV vector under the model's distribution over next tokens, i.e. a weighted average over the vocabulary, not an actual sampled token. So no forward pass with a sampled token is involved. Yes, the exact computation is expensive (one forward pass per vocabulary token), which the paper acknowledges, and the practical section covers top-k approximations that capture most of the probability mass cheaply. The entropy bound holds regardless of approximation scheme -- it's a statement about the theoretical floor. The residual is small whenever the model assigns high probability to the actual next token, which is exactly what low perplexity means. | ||||||||
| ▲ | magicalhippo 10 hours ago | parent | next [-] | |||||||
> the practical section covers top-k approximations that capture most of the probability mass cheaply. You say cheaply, but top-k with k=20 still means 20 forward passes for each position in the predicted KV cache vector, no? So to compute the residual at position i+1 you need another 20 passes? It's late, perhaps I'm missing something. | ||||||||
| ▲ | aesthesia 11 hours ago | parent | prev [-] | |||||||
A top-k approximation still requires k forward passes; that's k times as expensive as just computing the exact value. Unless you're doing a prefix-unconditional prediction, in which case you still likely need quite a large token -> vector dictionary, and particularly for inner layers a significant amount of information left in the residual. | ||||||||
| ||||||||