> The second layer, predictive delta coding, stores only the residual of each new KV vector from the model's own prediction of it

I don't understand this. The key and value vectors for any given layer + token are created by the model. By definition, they are exactly equal to the model's prediction of them!

Extreme KV cache compression is easy to get---you can get an infinite compression ratio by just regenerating the key and value vectors on every forward pass. The point of a KV cache is to reduce the amount of repeated computation during generation, though. Compression only helps if you have an efficient decompression algorithm.

▲

binsquare 12 hours ago | parent | next [-]

Incredulous claims and unreviewed paper.

Attention really is all you need.

▲

CyberDildonics 12 hours ago | parent [-]

I think you mean incredible claims. You would be incredulous about them.

	▲	binsquare 11 hours ago \| parent [-]
		You're right, thanks for the correction

▲

EGreg 11 hours ago | parent | prev [-]

The prediction being used is the model's prediction of the next token's KV vector, given all previous KV vectors. Because the model was trained on language, it has strong priors about what comes next. The residual, i.e the difference between the predicted next KV vector and the actual one -- is much smaller in entropy than the raw vector, for the same reason language model perplexity is low on fluent text.

▲

aesthesia 11 hours ago | parent [-]

What model is doing this prediction? The only way a transformer predicts the "next KV vector" is by sampling the next token and then running a forward pass with that token.

▲

EGreg 11 hours ago | parent [-]

The predicted KV vector is the expected KV vector under the model's distribution over next tokens, i.e. a weighted average over the vocabulary, not an actual sampled token. So no forward pass with a sampled token is involved. Yes, the exact computation is expensive (one forward pass per vocabulary token), which the paper acknowledges, and the practical section covers top-k approximations that capture most of the probability mass cheaply. The entropy bound holds regardless of approximation scheme -- it's a statement about the theoretical floor. The residual is small whenever the model assigns high probability to the actual next token, which is exactly what low perplexity means.

▲

magicalhippo 11 hours ago | parent | next [-]

> the practical section covers top-k approximations that capture most of the probability mass cheaply.

You say cheaply, but top-k with k=20 still means 20 forward passes for each position in the predicted KV cache vector, no? So to compute the residual at position i+1 you need another 20 passes?

It's late, perhaps I'm missing something.

▲

aesthesia 11 hours ago | parent | prev [-]

A top-k approximation still requires k forward passes; that's k times as expensive as just computing the exact value. Unless you're doing a prefix-unconditional prediction, in which case you still likely need quite a large token -> vector dictionary, and particularly for inner layers a significant amount of information left in the residual.

	▲	EGreg 10 hours ago \| parent [-]
		the k forward passes for different candidate tokens share all their prefix computation -- the KV cache up to position i-1 is identical for all candidates, so you run one pass through the shared layers and then k cheap single-token extensions. At long context lengths the shared prefix dominates the cost. This is also structurally what speculative decoding already does, so the infrastructure largely exists.