Remix.run Logo
ccleve 2 days ago

I wonder if it would have been possible to win the challenge legitimately?

If a randomly-generated file happened to contain some redundancy through sheer chance, you could hand-craft a compressor to take advantage of it. This compressor would not work in general for random data, but it could work for this one particular case.

It's a bet worth taking, because the payoff, 50:1 ($5,000 to $100), is pretty good. Play the game 50 times and you might get a file you could compress.

The challenge, then, would be for the person offering the bet to generate a really random file that contained no such redundancy. That might not be easy.

dmurray 2 days ago | parent | next [-]

I think you can make some argument about why this isn't possible at 50:1 odds.

A plausible "decompressor" is at least, say, 30 or 100 bytes, so the random file needs to have 30 bytes less entropy than you expected, which happens with probability X where X << 1/50. Sum over the whole domain of reasonable decompressors, and you still don't get there.

This argument could do with more rigor, but I think it's correct. Give me 100 million to 1 odds, though, and I'll take my chances trying to brute force a compressor.

lambdaone 2 days ago | parent [-]

This is actually an extremely interesting question. 'Weak' files that are more easily compressable than others certainly exist, but with low probability.

For example, the all-zeros file is a member of the set of all random 3 megabyte files, and it would certainly be possible to compress that, if by great good fortune you were lucky enough to receive it - albeit something that is unlikely to ever happen in the possible lifetime of the universe, given current physical theories.

Is it possible to quantify this idea of a 'weak' file more accurately?

ccleve 2 days ago | parent | next [-]

I know very little about this, but a little googling suggests that the measure you're looking for is entropy, which has a mathematical definition: https://en.wikipedia.org/wiki/Entropy_(information_theory)

pizza 15 hours ago | parent | prev | next [-]

Yes, you can think of it in terms of (WLOG think of any uniquely-decodable code) prefix-free codes. They're uniquely decodable - for things that are not uniquely decodable, that implies that you could put overlapping codes over that symbol. If you make a matrix like this where the rows are the bitstrings of length b and columns are individual bits:

  000 ... 000
  000 ... 001
  ...
  111 ... 110
  111 ... 111
then you have 2^b rows. Suppose you look at the sub-bitstrings of length k, k < b. They all appear the same number of times, if you count them wherever they appear at any position in across the entire matrix.

However, you also know, for sure, that, if a prefix-free code appears in a particular row, that means since it's impossible to overlap with anything else in that row at its span. What does that imply? That the prefix-free codes have a greater 'occupancy percentage' of a single row than all other sub-bitstrings. That means that you must find fewer of them, on average, inside of a single row.

But since we know that all sub-bitstrings appear the same number of times throughout the entire matrix, what else can we deduce? That the prefix-free codes must appear /over more rows / on average, if they cannot appear as many times while looking at bit positions /along the columns/. That means they will occur as a sub-pattern in full-bitstrings more often than typical random sub-patterns.

So weakness here corresponds to the presence of patterns (prefix-free codes) that are:

- non-overlapping within bitstrings

- widely distributed across bitstrings

- due to their wide distribution, there's a higher chance of encountering these patterns in any given random file

- therefore, weak files are more compressible because they contain widely-distributed, non-overlapping patterns that compression algorithms can take advantage of

l33t7332273 2 days ago | parent | prev [-]

One thing you can do, as the other commenter pointed out, is consider entropy of the file.

However, this restriction is too much for the purposes of this challenge. We don’t actually need a file with low entropy, in fact I claim that a weak file exists for files with entropy 8 (the maximum entropy value) - epsilon for each epsilon > 0.

What we actually need is a sufficiently large chunk in a file to have low entropy. The largeness is in absolute terms, not relative terms.

A very simple file would be taking a very large file with maximum entropy and adding 200 0’s to the end. This would not decrease the entropy of the file much, but it gives way to a compression algorithm that should be able to save ~100 bytes

kevinventullo a day ago | parent [-]

Note that if this large chunk occurs in the middle of the file, then you will need extra space to encode that position. For example, a random bit string of length 2^n is decently likely to have a run of n zeroes. But this doesn’t help you because you need n bits just to encode where that run happens.

l33t7332273 18 hours ago | parent [-]

But storing an index for a file of length 2^n takes only n bits, so you need that run of 0’s to be of length n+1 to win

kittoes 2 days ago | parent | prev | next [-]

What if we didn't even attempt to compress the file? Theoretically, there is a random number generator + seed combination that would output the same bytes as the source file.

changoplatanero 2 days ago | parent [-]

Neat idea but chances are the length of the seed is equal to the length of the original file.

guepe 2 days ago | parent | next [-]

There is a polynomial expression that will match the file. I wonder if expressing it would be compressible to a lower size than original file? [edit] I’m wrong - not all sequences can be expressed with a polynomial.

l33t7332273 2 days ago | parent [-]

This reminds me of a data compression scheme I came up with once:

Treat an n bit file as a polynomial over the finite field with characteristic 2. Now, there are some irreducible polynomials in this field, but many polynomials have factors of x and of (x+1). Factor the polynomial into P(x)x^n (x+1)^m. and just collect these terms, storing only P, n, and m.

crazygringo 20 hours ago | parent | prev [-]

That's not how seeds work. Seeds are tiny.

Actually this would work perfectly if you knew it was generated in a single pass by a known random number generator and you had tons of time to brute force it.

If the file were generated by a natural source of entropy then forget it.

Or even if modified in a trivial way like adding 1 to every byte.

crazygringo 15 hours ago | parent [-]

What is with the many downvotes but no comments? Everything I said is factual.

Seeds are something like 32 bits, though it depends on the exact implementation. But not the length of files.

iforgotpassword 10 hours ago | parent [-]

When implementing a PRNG, you can make its seed as big as you want. There is no mathematical law that dictates or limits the size of a seed.

gus_massa 5 hours ago | parent | next [-]

But I assume the GGP assumes that the author is lazy and used a public available PRNG instead of a custom made. (A long time ago someone broke the login security check in HN using a trick like that. Obviously, it's already fixed.)

crazygringo 5 hours ago | parent | prev [-]

I mean sure you could in theory, but in practice that's not how common built-in random number generators work.

I was responding to:

> chances are the length of the seed is equal to the length of the original file

And why would the chances be that? You'd really have to go out of your way for that. I don't even know if there are libraries that can handle a seed and state length on the scale of megabytes. No, chances are 99.99+% it used a seed of a few bytes, because that's how common random number generators designed for efficiency work.

Retr0id 2 days ago | parent | prev [-]

Somewhere (discussed on HN) someone devised a "better-than-perfect" compressor. Most inputs get compressed (smaller than input), except for one input that does not. That one input is cryptographically impossible to find - or something along those lines.

Unfortunately I can't find the article I'm describing here, maybe someone else can? It was a long time ago so I might be misrepresenting it slightly.

phire 15 hours ago | parent | next [-]

It can't exist.

Presume this compressor guarantees the output will always be at least one byte smaller (with the exception of the impossible to find input)

So just keep running your data in a loop through the compressor. If you start with a 1MB file, it will take a maximum of a million iterations until the output shrinks down to zero bytes, which is the smallest possible file. Record how many iterations it too.

You can now extract your file by feeding a zero byte file into the decompressor and running the same number of iterations. Which You can now store every 1MB (or smaller) file in the world in just 20 bits.... But that would means there are only 1 million possible 1MB files?

Even if you put some minimum output size limitation on the compressor, say it can't produce any file less than 512 bits, the same argument applies. It's just that the numbers get bigger.

spencerflem 2 days ago | parent | prev | next [-]

That's how all compressors work, in that likely files (eg. ASCII, obvious patterns, etc) become smaller and unlikely files become bigger.

Retr0id an hour ago | parent | next [-]

Right, but the point was, the case where it became bigger was ~impossible to find.

Dylan16807 2 days ago | parent | prev | next [-]

> likely files (eg. ASCII, obvious patterns, etc) become smaller

Likely files for a real human workload are like that, but if "most inputs" is talking about the set of all possible files, then that's a whole different ball game and "most inputs" will compress very badly.

> unlikely files become bigger

Yes, but when compressors can't find useful patterns they generally only increase size by a small fraction of a percent. There aren't files that get significantly bigger.

a day ago | parent [-]
[deleted]
PaulHoule 2 days ago | parent | prev [-]

In some cases it can be certain, the ascii encoded in the usual 8 bits has fat to trim even if it is random in that space.

Dylan16807 2 days ago | parent | prev [-]

You could design the opposite, a system where ultra-rare files get compressed a lot but most don't compress.

But as stated, what you said is not possible. To compress by a single byte, only 1/256 files can be compressible, or you'd have multiple different inputs compressing down to the same output.