Remix.run Logo
Aachen a day ago

I've also noticed this. Zstandard doesn't see very common patterns

For me it was an increasing number (think of unix timestamps in a data logger that stores one entry per second, so you are just counting up until there's a gap in your data), in the article it's a fixed value every 60 bytes

Of course, our brains are exceedingly good at finding patterns (to the point where we often find phantom ones). I was just expecting some basic checks like "does it make sense to store the difference instead of the absolute value for some of these bytes here". Seeing as the difference is 0 between every 60th byte in the submitted article, that should fix both our issues

Bzip2 performed much better for me but it's also incredibly slow. If it were only the compressor, that might be fine for many applications, but also decompressing is an exercise in patience so I've moved to Zstandard at the standard thing to use

pajko a day ago | parent [-]

Bzip2 performs exactly better because it rearranges the input to achieve better pattern matches: https://en.m.wikipedia.org/wiki/Burrows%E2%80%93Wheeler_tran...

vintermann 6 hours ago | parent [-]

A number of identical copies of a string, but with random mutations propagating through it like a word ladder puzzle, is pretty close to best-case for BWT-based compressors.

But Bzip2 is also a pretty bad BWT-based compressor. Not only does it use block sizes from a time when 8mb memory was a lot, it does silly things which doesn't help compression at all.