Remix.run Logo
Panzer04 17 hours ago

I would have expected it to be possible to compress a single arbitrary random file with a small program. I would have thought an RNG could generate a file with some weakness that allows you to compress it, although said compressor would be worse at other inputs.

Likewise this challenge would have been stronger if the requirement was to compress two provided arbitrary files :P

codazoda 5 hours ago | parent | next [-]

I expected this too, so I asked an LLM to write a quick Go program that would tell me how many bytes repeated. I created a 1.44MB file from /dev/random and then ran it through the program. Here are the bytes that repeat the most.

  Top 10 repeated byte patterns and their counts:
  Bytes: 7eda16, Count: 3
  Bytes: 65b1a4, Count: 3
  Bytes: 71d745, Count: 3
  Bytes: b72808, Count: 2
  Bytes: 60e3ee, Count: 2
  Bytes: 6e9152, Count: 2
  Bytes: 26446b, Count: 2
  Bytes: e4a05a, Count: 2
  Bytes: 67f86a, Count: 2
  Bytes: 92c487, Count: 2
Since the most common three byte sequence only appears 3 times, it seems like a non-starter. No longer byte sequences appeared more than twice either.

I generated the random digits with this:

  dd if=/dev/random of=random.bin bs=1024 count=1440
bo1024 14 hours ago | parent | prev | next [-]

Absolutely, but here's the catch. You're designing the compression and decompression program based on the particular file. So the decompression program is really part of the data that's used to reconstruct the program. (For example you could replace a certain 1MB stretch of random data with a short magic word, but the decompressor would have to contain that 1MB so it knows what to insert when it sees the magic word.) That's why the original challenge measured the length of the data plus the decompressor program.

brody_hamer 16 hours ago | parent | prev [-]

I thought the same thing. Surely a random binary sequence will have some (small) repeating sequences? So as long as we can find them and (efficiently) mark their locations, we can have some small size reduction.

Which is basically what he’s done here. He’s cheating by marking the location of the repeating sequence using unique files, rather than some other actually more efficient location system.

This is a fun discussion! I think it helps to make compression feel more approachable for armchair enthusiasts.