Remix.run Logo
pizza 15 hours ago

Yes, you can think of it in terms of (WLOG think of any uniquely-decodable code) prefix-free codes. They're uniquely decodable - for things that are not uniquely decodable, that implies that you could put overlapping codes over that symbol. If you make a matrix like this where the rows are the bitstrings of length b and columns are individual bits:

  000 ... 000
  000 ... 001
  ...
  111 ... 110
  111 ... 111
then you have 2^b rows. Suppose you look at the sub-bitstrings of length k, k < b. They all appear the same number of times, if you count them wherever they appear at any position in across the entire matrix.

However, you also know, for sure, that, if a prefix-free code appears in a particular row, that means since it's impossible to overlap with anything else in that row at its span. What does that imply? That the prefix-free codes have a greater 'occupancy percentage' of a single row than all other sub-bitstrings. That means that you must find fewer of them, on average, inside of a single row.

But since we know that all sub-bitstrings appear the same number of times throughout the entire matrix, what else can we deduce? That the prefix-free codes must appear /over more rows / on average, if they cannot appear as many times while looking at bit positions /along the columns/. That means they will occur as a sub-pattern in full-bitstrings more often than typical random sub-patterns.

So weakness here corresponds to the presence of patterns (prefix-free codes) that are:

- non-overlapping within bitstrings

- widely distributed across bitstrings

- due to their wide distribution, there's a higher chance of encountering these patterns in any given random file

- therefore, weak files are more compressible because they contain widely-distributed, non-overlapping patterns that compression algorithms can take advantage of