| ▲ | WithinReason 4 hours ago | |
> The tokens stored in a KV cache are not arbitrary floating-point data -- they are samples from the exact formal language the model was trained on, and the model is by construction a near-optimal predictor of that language. You can compress the KV cache to 0 bytes by just recomputing it every token. This observation is not worth an ArXiv paper though. | ||