| ▲ | mbernstein 11 hours ago | |
This is a compute memory trade, not compression vs. turobquant? Lemma 1 is something like, "forward pass is deterministic because it's deterministic" which means the input tokens were always the lower bound...which isn't caching? Smells tautological. What am I missing? | ||
| ▲ | EGreg 11 hours ago | parent [-] | |
Well yeah, I just wrote it as a lemma, but it's basically close to tautological. Its only job is to formally ground the entropy argument that follows it. The interesting claim is what comes after: because KV vectors are deterministic functions of tokens, and because the model is a near-optimal predictor of its own distribution, the conditional entropy of each new KV vector given all previous ones is bounded by token-level perplexity. TurboQuant compresses against the marginal distribution of each vector in isolation -- that's the gap. And yes, it's a compute/memory tradeoff, all caching is. The claim is just that the memory floor is much lower than anyone had formally established. Whether the compute cost of getting there is worth it is a fair open question the paper doesn't settle. But what if it is? Caching is the thread running through most of my work, and I intend to find out. | ||