Remix.run Logo
eru 3 hours ago

> I'm going to be the nerd that points out that it has not been mathematically proven that pi contains every substring, so the pifs might not work even in theory (besides being utterly impractical, of course).

Well, either your program 'works', or you will have discovered a major new insight about Pi.

> On a more serious note, as far as I understand these compression competitions require that static data is included in the size computation. So if you compress 1000 MB into 500 MB, but to decompress you need a 1 MB binary and a 100 MB initial dictionary, your score would be 500 + 100 + 1 = 601 MB, not 500 MB.

And that's the only way to do this fairly, if you are running a competition where you only have a single static corpus to compress.

It would be more interesting and would make the results more useful, if the texts to be compressed would be drawn from a wide probability distribution, and then we scored people on eg the average length. Then you wouldn't necessarily need to include the size of the compressor and decompressor in the score.

Of course, it would be utterly impractical to sample Gigabytes of new text each time you need to run the benchmark: humans are expensive writers. The only way this could work would be either to sample via an LLM, but that's somewhat circular and wouldn't measure what you actually want to measure in the benchmark, or you could try to keep the benchmark text secret, but that has its own problems.