Remix.run Logo
apwell23 4 hours ago

you can create original exactly with right prompt

sfink 2 hours ago | parent [-]

Yes. For example, you could always say "give me a jpeg image file that is encoded as the bytes 255, 216, 255, 224, 0, 16, 74, ...". But that's just pointing out that the input to your "LLM" function includes the prompt. It's f(model, prompt) = response.

It's not straightforward to prove that models have to be lossy. Sure, the training data is much larger than the model, but there is a huge amount of redundancy in the training data. You have to compare a hypothetically optimal compression of the training data to the size of the model to prove that it must be lossy. And yet, it's intuitively obvious that even the best lossless compression (measured in Kolmogorov complexity) of the training data is going to be vastly larger than the biggest models we have today.

You can always construct toy examples where this isn't the case. For example, you could just store all of the training data in your model, and train another part of the model to read it out. But that's not an LLM anymore. Similarly, you could make an LLM out of synthetic redundant data and it could achieve perfect recall. (Unless you're clever with how you generate it, though, any off the shelf compression algorithm is likely to produce something much much smaller.)