Remix.run Logo
lambdaone 2 days ago

This is actually an extremely interesting question. 'Weak' files that are more easily compressable than others certainly exist, but with low probability.

For example, the all-zeros file is a member of the set of all random 3 megabyte files, and it would certainly be possible to compress that, if by great good fortune you were lucky enough to receive it - albeit something that is unlikely to ever happen in the possible lifetime of the universe, given current physical theories.

Is it possible to quantify this idea of a 'weak' file more accurately?

ccleve 2 days ago | parent | next [-]

I know very little about this, but a little googling suggests that the measure you're looking for is entropy, which has a mathematical definition: https://en.wikipedia.org/wiki/Entropy_(information_theory)

pizza 16 hours ago | parent | prev | next [-]

Yes, you can think of it in terms of (WLOG think of any uniquely-decodable code) prefix-free codes. They're uniquely decodable - for things that are not uniquely decodable, that implies that you could put overlapping codes over that symbol. If you make a matrix like this where the rows are the bitstrings of length b and columns are individual bits:

  000 ... 000
  000 ... 001
  ...
  111 ... 110
  111 ... 111
then you have 2^b rows. Suppose you look at the sub-bitstrings of length k, k < b. They all appear the same number of times, if you count them wherever they appear at any position in across the entire matrix.

However, you also know, for sure, that, if a prefix-free code appears in a particular row, that means since it's impossible to overlap with anything else in that row at its span. What does that imply? That the prefix-free codes have a greater 'occupancy percentage' of a single row than all other sub-bitstrings. That means that you must find fewer of them, on average, inside of a single row.

But since we know that all sub-bitstrings appear the same number of times throughout the entire matrix, what else can we deduce? That the prefix-free codes must appear /over more rows / on average, if they cannot appear as many times while looking at bit positions /along the columns/. That means they will occur as a sub-pattern in full-bitstrings more often than typical random sub-patterns.

So weakness here corresponds to the presence of patterns (prefix-free codes) that are:

- non-overlapping within bitstrings

- widely distributed across bitstrings

- due to their wide distribution, there's a higher chance of encountering these patterns in any given random file

- therefore, weak files are more compressible because they contain widely-distributed, non-overlapping patterns that compression algorithms can take advantage of

l33t7332273 2 days ago | parent | prev [-]

One thing you can do, as the other commenter pointed out, is consider entropy of the file.

However, this restriction is too much for the purposes of this challenge. We don’t actually need a file with low entropy, in fact I claim that a weak file exists for files with entropy 8 (the maximum entropy value) - epsilon for each epsilon > 0.

What we actually need is a sufficiently large chunk in a file to have low entropy. The largeness is in absolute terms, not relative terms.

A very simple file would be taking a very large file with maximum entropy and adding 200 0’s to the end. This would not decrease the entropy of the file much, but it gives way to a compression algorithm that should be able to save ~100 bytes

kevinventullo a day ago | parent [-]

Note that if this large chunk occurs in the middle of the file, then you will need extra space to encode that position. For example, a random bit string of length 2^n is decently likely to have a run of n zeroes. But this doesn’t help you because you need n bits just to encode where that run happens.

l33t7332273 19 hours ago | parent [-]

But storing an index for a file of length 2^n takes only n bits, so you need that run of 0’s to be of length n+1 to win