Sweet! I love clever information theory things like that.

It goes the other way too. Given that LLMs are just lossless compression machines, I do sometimes wonder how much better they are at compressing plain text compared to zstd or similar. Should be easy to calculate...

EDIT: lossless when they're used as the probability estimator and paired with something like an arithmetic coder.

▲

dchest 9 hours ago | parent | next [-]

https://bellard.org/ts_zip/

▲

hxtk 3 hours ago | parent | prev | next [-]

I've actually been experimenting with that lately. I did a really naive version that tokenizes the input, feeds the max context window up to the token being encoded into an LLM, and uses that to produce a distribution of likely next tokens, then encodes the actual token with Huffman Coding with the LLM's estimated distribution. I could get better results with arithmetic encoding almost certainly.

It outperforms zstd by a long shot (I haven't dedicated the compute horsepower to figuring out what "a long shot" means quantitatively with reasonably small confidence intervals) on natural language, like wikipedia articles or markdown documents, but (using GPT-2) it's about as good as zstd or worse than zstd on things like files in the Kubernetes source repository.

You already get a significant amount of compression just out of the tokenization in some cases ("The quick red fox jumps over the lazy brown dog." encodes to one token per word plus one token for the '.' for the GPT-2 tokenizer), where as with code a lot of your tokens will just represent a single character so the entropy coding is doing all the work, which means your compression is only as good as the accuracy of your LLM, plus the efficiency of your entropy coding.

I would need to be encoding multiple tokens per "word" with Huffman Coding to hit the entropy bounds, since it has a minimum of one bit per character, so if tokens are mostly just one byte then I can't do better than a 12.5% compression ratio with one token per word. And doing otherwise gets computationally infeasible very fast. Arithmetic coding would do much better especially on code because it can encode a word with fractional bits.

I used Huffman coding for my first attempt because it's easier to implement and most libraries don't support dynamically updating the distribution throughout the process.

▲

nl 4 hours ago | parent | prev | next [-]

> Given that LLMs are just lossless compression machines, I do sometimes wonder how much better they are at compressing plain text compared to zstd or similar. Should be easy to calculate...

The current leader on the Hutter Prize (http://prize.hutter1.net/) are all LLM based.

It can (slowly!!) compress a 1GB dump of Wikipedia to 106Mb

By comparison GZip can compress it to 321Mb

See https://mattmahoney.net/dc/text.html for the current leaderboard

▲

az09mugen 8 hours ago | parent | prev | next [-]

I do not agree on the "lossless" adjective. And even if it is lossless, for sure it is not deterministic.

For example I would not want a zip of an encyclopedia that uncompresses to unverified, approximate and sometimes even wrong text. According to this site : https://www.wikiwand.com/en/articles/Size%20of%20Wikipedia a compressed Wikipedia without medias, just text is ~24GB. What's the medium size of an LLM, 10 GB ? 50 GB ? 100 GB ? Even if it's less, it's not an accurate and deterministic way to compress text.

Yeah, pretty easy to calculate...

▲

nagaiaida 7 hours ago | parent | next [-]

(to be clear this is not me arguing for any particular merits of llm-based compression, but) you appear to have conflated one particular nondeterministic llm-based compression scheme that you imagined with all possible such schemes, many of which would easily fit any reasonable definitions of lossless and deterministic by losslessly doing deterministic things using the probability distributions output by an llm at each step along the input sequence to be compressed.

▲

notpushkin 6 hours ago | parent | prev [-]

With a temperature of zero, LLM output will always be the same. Then it becomes a matter of getting it to output the exact replica of the input: if we can do that, it will always produce it, and the fact it can also be used as a bullshit machine becomes irrelevant.

With the usual interface it’s probably inefficient: giving just a prompt alone might not produce the output we need, or it might be larger than the thing we’re trying to compress. However, if we also steer the decisions along the way, we can probably give a small prompt that gets the LLM going, and tweak its decision process to get the tokens we want. We can then store those changes alongside the prompt. (This is a very hand-wavy concept, I know.)

	▲	duskwuff 5 hours ago \| parent \| next [-]
		There's an easier and more effective way of doing that - instead of trying to give the model an extrinsic prompt which makes it respond with your text, you use the text as input and, for each token, encode the rank of the actual token within the set of tokens that the model could have produced at that point. (Or an escape code for tokens which were completely unexpected.) If you're feeling really crafty, you can even use arithmetic coding based on the probabilities of each token, so that encoding high-probability tokens uses fewer bits. From what I understand, this is essentially how ts_zip (linked elsewhere) works.
	▲	D-Machine 2 hours ago \| parent \| prev [-]
		> With a temperature of zero, LLM output will always be the same Ignoring GPU indeterminism, if you are running a local LLM and control batching, yes. If you are computing via API / on the cloud, and so being batched with other computations, then no (https://thinkingmachines.ai/blog/defeating-nondeterminism-in...). But, yes, there is a lot of potential from semantic compression via AI models here, if we just make the efforts.

▲

pornel 4 hours ago | parent | prev | next [-]

Compression can be generalized as probability modelling (prediction) + entropy coding of the difference between the prediction and the data. The entropy coding has known optimal solutions.

So yes, LLMs are nearly ideal text compressors, except for all the practical inconveniences of their size and speed (they can be reliably deterministic if you sacrifice parallel execution and some optimizations).

▲

fwip 9 hours ago | parent | prev [-]

Aren't LLMs lossy? You could make them lossless by also encoding a diff of the predicted output vs the actual text.

Edit to soften a claim I didn't mean to make.

	▲	thomasmg 8 hours ago \| parent \| next [-]
		LLMs are good at predicting the next token. Basically you use them to predict what are the probabilities of the next tokens to be a, b, or c. And then use arithmetic coding to store which one matched. So the LLM is used during compression and decompression.
	▲	D-Machine 2 hours ago \| parent \| prev [-]
		Yes LLMs are always lossy, unless their size / capacity is so huge they can memorize all their inputs. Even if LLMs were not resource-constrained, one would expect lossy compression due to batching and the math of the loss function. Training is such that it is always better for the model to accurately approximate the majority of texts than to approximate any single text with maximum accuracy.