| |
| ▲ | EGreg 11 hours ago | parent [-] | | The predicted KV vector is the expected KV vector under the model's distribution over next tokens, i.e. a weighted average over the vocabulary, not an actual sampled token. So no forward pass with a sampled token is involved. Yes, the exact computation is expensive (one forward pass per vocabulary token), which the paper acknowledges, and the practical section covers top-k approximations that capture most of the probability mass cheaply. The entropy bound holds regardless of approximation scheme -- it's a statement about the theoretical floor. The residual is small whenever the model assigns high probability to the actual next token, which is exactly what low perplexity means. | | |
| ▲ | magicalhippo 10 hours ago | parent | next [-] | | > the practical section covers top-k approximations that capture most of the probability mass cheaply. You say cheaply, but top-k with k=20 still means 20 forward passes for each position in the predicted KV cache vector, no? So to compute the residual at position i+1 you need another 20 passes? It's late, perhaps I'm missing something. | |
| ▲ | aesthesia 11 hours ago | parent | prev [-] | | A top-k approximation still requires k forward passes; that's k times as expensive as just computing the exact value. Unless you're doing a prefix-unconditional prediction, in which case you still likely need quite a large token -> vector dictionary, and particularly for inner layers a significant amount of information left in the residual. | | |
| ▲ | EGreg 10 hours ago | parent [-] | | the k forward passes for different candidate tokens share all their prefix computation -- the KV cache up to position i-1 is identical for all candidates, so you run one pass through the shared layers and then k cheap single-token extensions. At long context lengths the shared prefix dominates the cost. This is also structurally what speculative decoding already does, so the infrastructure largely exists. |
|
|
|