> This compression is lossy

Is compression really lossy? What is an example of lost knowledge?

Think about all the times in llm gets it wrong, the fact that would have helped to get it right is something that was lost. I suppose this isn't proof it's lossy just maybe we don't know how to get the data out.

Or look at it another way LLMs or just text prediction machines, whatever information doesn't help them predict the next token or conflicts with the likelihood of the next token is something that gets dropped.

Or look at it another way these things are often trained on the many terabytes of the internet yet even a 200 billion parameter network is 100 or 200 GB in size. So something is missing, and that is a way better compression ratio then the best known algorithms for lossless compression.

Or we can look at it another way, these things were never built to be lossless compression systems. We can know by looking at how these things are implemented that they don't retain everything they're trained on, they extract a bunch of statistics.

	▲	visarga 7 hours ago \| parent \| next [-]
		I think extraction from the model itself is a bad idea. But extraction from external sources, such as the deep research reports LLMs generate, or solving problems where we have validation of correctness is a good idea. The model is not validating its outputs by simply doing another inference, but consults external sources or gets feedback from code execution. Humans in chat rooms could also provide lots of learning signal, especially when actions are judged against the outcomes they cause down the line, using hindsight. So in short what works is a model + a way to know its good outputs from bad ones.
	▲	6 hours ago \| parent \| prev [-]
		[deleted]

▲

shawntan 4 hours ago | parent | prev | next [-]

Not sure if you mean in general, but I'll answer both branches of the question.

In general: Depending on the method of compression, you can have lossy or non-lossy compression. Using 7zip on a bunch of text files can lossless-ly compress that data. Briefly, you calculate the statistics of the data you want to compress (the dictionary), and then make the commonly re-occuring chunks describable with fewer bits (encoding). The compressed file basically contains the dictionary and the encoding.

For LLMs: There are ways to use an LLM (or any statistical model of text) to compress text data. But the techniques use similar settings as the above, with a dictionary and an encoding, with the LLM taking the function of a dictionary. When "extracting" data from the dictionary alone, you're basically sampling from the dictionary distribution.

Quantitatively, the "loss" in "lossy" being described is literally the number of bits used for the encoding.

I wrote a brief description here of techniques from an undergrad CS course that can be used: https://blog.wtf.sg/posts/2023-06-05-yes-its-just-doing-comp...

▲

esafak 7 hours ago | parent | prev | next [-]

https://en.wikipedia.org/wiki/Lossy_compression

▲

fxj 5 hours ago | parent | prev | next [-]

It is at least as lossy as jpeg compression. Details get lost and artifacts are generated.

▲

kiicia 8 hours ago | parent | prev [-]

it's exactly the same as JPEG images being lossy, while you can see image as a whole (and it is enough for 99% of people), you are obviously missing some details

and the more you rely on those details (professional photography, scientific data) the more obvious it is (to the point of image being useless in some cases)

same with LLMs, we are currently testing how far we can go before we seeing obvious issues

▲

apwell23 8 hours ago | parent [-]

whats an example of loss?

▲

aewens 5 hours ago | parent | next [-]

Lossy compression vs lossless compression is the difference of whether you can get a 1:1 copy of the original data if you compress and then decompress it.

A simple example of this is if you have 4 bits of data and have a compression algorithm that turns it into 2 bits of data. If your dataset only contains 0000, 0011, 1100, and 1111; then this can technically be considered lossless compression because we can always reconstruct the exact original data (e.g. 0011 compresses to 01 and can decompress back to 0011, 1100 compresses to 10 and can decompress back to 1100, etc). However, if our dataset later included 1101 and got compressed to 10, this is now “lossy” because it would decompress to 1100, that last bit was “lost”.

An LLM is lossy compression because it lacks the capacity to 1:1 replicate all its input data 100% of the time. It can get quite close in some cases, sure, but it is not perfect every time. So it is considered “lossy”.

▲

fxj 5 hours ago | parent | prev [-]

How good can you recreate an image that is described by words? Obviously not bit by bit and pixel by pixel. You get something that resembles the original but not an exact copy.

▲

apwell23 4 hours ago | parent [-]

you can create original exactly with right prompt

	▲	sfink 2 hours ago \| parent [-]
		Yes. For example, you could always say "give me a jpeg image file that is encoded as the bytes 255, 216, 255, 224, 0, 16, 74, ...". But that's just pointing out that the input to your "LLM" function includes the prompt. It's f(model, prompt) = response. It's not straightforward to prove that models have to be lossy. Sure, the training data is much larger than the model, but there is a huge amount of redundancy in the training data. You have to compare a hypothetically optimal compression of the training data to the size of the model to prove that it must be lossy. And yet, it's intuitively obvious that even the best lossless compression (measured in Kolmogorov complexity) of the training data is going to be vastly larger than the biggest models we have today. You can always construct toy examples where this isn't the case. For example, you could just store all of the training data in your model, and train another part of the model to read it out. But that's not an LLM anymore. Similarly, you could make an LLM out of synthetic redundant data and it could achieve perfect recall. (Unless you're clever with how you generate it, though, any off the shelf compression algorithm is likely to produce something much much smaller.)