Remix.run Logo
GuB-42 3 days ago

There are a lot of parallels between AI and compression.

In fact the best compression algorithms and LLMs have in common that they work by predicting the next word. Compression algorithms take an extra step called entropy coding to encode the difference between the prediction and the actual data efficiently, and the better the prediction, the better the compression ratio.

What makes a LLM "lossy" is that you don't have the "encode the difference" step.

And yes, it means you can turn a LLM into a (lossless) compression algorithm, and I think a really good one in term of compression ratio on huge data sets. You can also turn a compression algorithm like gzip into a language model! A very terrible one, but the output is better than a random stream of bytes.

jparishy 3 days ago | parent | next [-]

I suspect this ends up being pretty important for the next advancements in AI, specifically LLM-based AI. To me, the transformer architecture is a sort of compression algorithm that is being exploited for emergent behavior at the margins. But I think this is more like stream of consciousness than premeditated thought. Eventually I think we figure out a way to "think" in latent space and have our existing AI models be just the mouthpiece.

In my experience as a human, the more you know about a subject, or even the more you have simply seen content about it, the easier it is to ramble on about it convincingly. It's like a mirroring skill, and it does not actually mean you understand what you're saying.

LLMs seem to do the same thing, I think. At scale this is widely useful, though, I am not discounting it. Just think it's an order of magnitude below what's possible and all this talk of existing stream-of-consciousness-like LLMs creating AGI seems like a miss

layer8 3 days ago | parent | prev | next [-]

One difference is that compression gives you one and only one thing when decompressing. Decompression isn't a function taking arbitrary additional input and producing potentially arbitrary, nondeterministic output based on it.

We would have very different conversations if LLMs were things that merely exploded into a singular lossy-expanded version of Wikipedia, but where looking at the article for any topic X would give you the exact same article each time.

withinboredom 3 days ago | parent [-]

LLMs deliberately insert randomness. If you run a model locally (or sometimes via API), you can turn that off and get the same response for the same input every time.

layer8 3 days ago | parent [-]

True, but I'd argue that you can't get the definite knowledge of an LLM by turning off randomness, or fixing the seed. Otherwise that would be a routinely employed feature, to determine what an LLM "truly knows", removing any random noise distorting that knowledge, and instead randomness would only be turned on for tasks requiring creativity, not when merely asking factual questions. But it doesn’t work that way. Different seeds and will uncover different "knowledge", and it's not the case that one is a truer representation of an LLM's knowledge than another.

Furthernore, even in the absence of randomness, asking an LLM the same question in different ways can yield different, potentially contradictory answers, even when the difference in prompting is perfectly benign.

withinboredom 3 days ago | parent [-]

This is because the knowledge is encoded in a multi-dimensional space, and a seed doesn’t change the knowledge, only the expression of it. If you ask me what E=mc^2 means, I’ll give you different answers depending on whether I think you are a curious lay-person vs. a physicist testing my response.

You see this with humans who encode physical space to physical matrix in our brain. When asking for directions, people have to traverse this matrix until it is memorized, then it isn’t used any longer; only the rote data is referenced.

arjvik 3 days ago | parent | prev [-]

With a handy trick called arithmetic coding, you can actually turn an LLM into a lossless compression algorithm!

vbarrielle 3 days ago | parent [-]

Indeed, see https://bellard.org/nncp/ for an example.