| ▲ | EGreg 12 hours ago | |
No, the 914,000x in the paper is talking about the ratio between two entropy floors, it's not a claim about practical compression. The point is that per-vector quantization has been chasing the wrong theoretical limit: the sequential entropy bound is just fundamentally lower, by that factor, because KV vectors aren't independent samples! On complexity, that's fair concern, and the paper doesn't fully resolve it. But the analogy to "replaying tokens through the model" isn't exactly right. The delta coding layer uses the model's own next-token prediction, which is already happening during normal autoregressive inference. You're not adding a forward pass, you're using the one already running and storing only the residual, which is much smaller than the raw vector -- precisely because the model is a good predictor of its own next state. The trie index lookup is O(sequence length), not O(model forward pass). Whether that's fast enough in practice at scale is actually a legitimate open question and I'd be the first to admit the paper doesn't settle it. But the contribution here is simply establishing that the bound exists and is dramatically lower than what the field has been targeting. That's what I wanted to put out. The engineering question of how close you can get is the natural next step. Your pet theory about time complexity sounds interesting actually, did you write it up anywhere? | ||