The predicted KV vector is the expected KV vector under the model's distribution over next tokens, i.e. a weighted average over the vocabulary, not an actual sampled token. So no forward pass with a sampled token is involved. Yes, the exact computation is expensive (one forward pass per vocabulary token), which the paper acknowledges, and the practical section covers top-k approximations that capture most of the probability mass cheaply. The entropy bound holds regardless of approximation scheme -- it's a statement about the theoretical floor. The residual is small whenever the model assigns high probability to the actual next token, which is exactly what low perplexity means.

▲

magicalhippo 10 hours ago | parent | next [-]

> the practical section covers top-k approximations that capture most of the probability mass cheaply.

You say cheaply, but top-k with k=20 still means 20 forward passes for each position in the predicted KV cache vector, no? So to compute the residual at position i+1 you need another 20 passes?

It's late, perhaps I'm missing something.

▲

aesthesia 11 hours ago | parent | prev [-]

A top-k approximation still requires k forward passes; that's k times as expensive as just computing the exact value. Unless you're doing a prefix-unconditional prediction, in which case you still likely need quite a large token -> vector dictionary, and particularly for inner layers a significant amount of information left in the residual.

	▲	EGreg 10 hours ago \| parent [-]
		the k forward passes for different candidate tokens share all their prefix computation -- the KV cache up to position i-1 is identical for all candidates, so you run one pass through the shared layers and then k cheap single-token extensions. At long context lengths the shared prefix dominates the cost. This is also structurally what speculative decoding already does, so the infrastructure largely exists.