Remix clone Hacker News

new | show | ask | jobs Github

	▲	shawntan 4 hours ago
		Not sure if you mean in general, but I'll answer both branches of the question. In general: Depending on the method of compression, you can have lossy or non-lossy compression. Using 7zip on a bunch of text files can lossless-ly compress that data. Briefly, you calculate the statistics of the data you want to compress (the dictionary), and then make the commonly re-occuring chunks describable with fewer bits (encoding). The compressed file basically contains the dictionary and the encoding. For LLMs: There are ways to use an LLM (or any statistical model of text) to compress text data. But the techniques use similar settings as the above, with a dictionary and an encoding, with the LLM taking the function of a dictionary. When "extracting" data from the dictionary alone, you're basically sampling from the dictionary distribution. Quantitatively, the "loss" in "lossy" being described is literally the number of bits used for the encoding. I wrote a brief description here of techniques from an undergrad CS course that can be used: https://blog.wtf.sg/posts/2023-06-05-yes-its-just-doing-comp...