▲ | Vetch 3 days ago | |
You are right and the idea of LLMs as lossy compression has lots of problems in general (LLMs are a statistical model, a function approximating the data generating process). Compression artifacts (which are deterministic distortions in reconstruction) are not the same as hallucinations (plausible samples from a generative model; even when greedy, this is still sampling from the conditional distribution). A better identification is with super-resolution. If we use a generative model, the result will be clearer than a normal blotchy resize but a lot of details about the image will have changed as the model provides its best guesses at what the missing information could have been. LLMs aren't meant to reconstruct a source even though we can attempt to sample their distribution for snippets that are reasonable facsimiles from the original data. An LLM provides a way to compute the probability of given strings. Once paired with entropy coding, on-line learning on the target data allows us to arrive at the correct MDL based lossless compression view of LLMs. |