Remix.run Logo
himata4113 12 hours ago

The reasoning around the 900000x claim isn't sound and violates way too many information density principles.

I was incredibly curious since I had a pet theory in my mind about something extremely similar, but arrived at a conclusion that the time complexity of such cache would end up being extremely slow.

This is like saying that you've achieved single token compression when you're passing a single token into a model and letting it regenerate the entire output since at the end of the day models are probabilistic stateless devices. At that point you don't have a cache and are just replaying the tokens or have a caching algorithm with a complexity similar to that of a model defeating the purpose of such cache.

I've never considered that arXiv had a problem, now I do.

EGreg 12 hours ago | parent [-]

No, the 914,000x in the paper is talking about the ratio between two entropy floors, it's not a claim about practical compression. The point is that per-vector quantization has been chasing the wrong theoretical limit: the sequential entropy bound is just fundamentally lower, by that factor, because KV vectors aren't independent samples!

On complexity, that's fair concern, and the paper doesn't fully resolve it. But the analogy to "replaying tokens through the model" isn't exactly right. The delta coding layer uses the model's own next-token prediction, which is already happening during normal autoregressive inference. You're not adding a forward pass, you're using the one already running and storing only the residual, which is much smaller than the raw vector -- precisely because the model is a good predictor of its own next state.

The trie index lookup is O(sequence length), not O(model forward pass). Whether that's fast enough in practice at scale is actually a legitimate open question and I'd be the first to admit the paper doesn't settle it. But the contribution here is simply establishing that the bound exists and is dramatically lower than what the field has been targeting. That's what I wanted to put out. The engineering question of how close you can get is the natural next step.

Your pet theory about time complexity sounds interesting actually, did you write it up anywhere?