The prediction being used is the model's prediction of the next token's KV vector, given all previous KV vectors. Because the model was trained on language, it has strong priors about what comes next. The residual, i.e the difference between the predicted next KV vector and the actual one -- is much smaller in entropy than the raw vector, for the same reason language model perplexity is low on fluent text.

▲

aesthesia 11 hours ago | parent [-]

What model is doing this prediction? The only way a transformer predicts the "next KV vector" is by sampling the next token and then running a forward pass with that token.

▲

EGreg 11 hours ago | parent [-]

The predicted KV vector is the expected KV vector under the model's distribution over next tokens, i.e. a weighted average over the vocabulary, not an actual sampled token. So no forward pass with a sampled token is involved. Yes, the exact computation is expensive (one forward pass per vocabulary token), which the paper acknowledges, and the practical section covers top-k approximations that capture most of the probability mass cheaply. The entropy bound holds regardless of approximation scheme -- it's a statement about the theoretical floor. The residual is small whenever the model assigns high probability to the actual next token, which is exactly what low perplexity means.

▲

magicalhippo 10 hours ago | parent | next [-]

> the practical section covers top-k approximations that capture most of the probability mass cheaply.

You say cheaply, but top-k with k=20 still means 20 forward passes for each position in the predicted KV cache vector, no? So to compute the residual at position i+1 you need another 20 passes?

It's late, perhaps I'm missing something.

▲

aesthesia 11 hours ago | parent | prev [-]

A top-k approximation still requires k forward passes; that's k times as expensive as just computing the exact value. Unless you're doing a prefix-unconditional prediction, in which case you still likely need quite a large token -> vector dictionary, and particularly for inner layers a significant amount of information left in the residual.

	▲	EGreg 10 hours ago \| parent [-]
		the k forward passes for different candidate tokens share all their prefix computation -- the KV cache up to position i-1 is identical for all candidates, so you run one pass through the shared layers and then k cheap single-token extensions. At long context lengths the shared prefix dominates the cost. This is also structurally what speculative decoding already does, so the infrastructure largely exists.