Overfitted a 900KB Transformer to Compress a 100MB CSV into 7MB

I built an experiment that uses an overfitted transformer and arithmetic coding to compress individual files.

Instead of training the model to generalize, I train a 900KB transformer to memorize a single file and predict the next byte. Those predictions are fed into an arithmetic coder to produce the compressed output.

On a 100MB NYC taxi CSV, it compresses to about 7MB (~0.5 bits/byte). On a 100MB slice of enwik9, it compresses to about 21MB (~1.68 bits/byte).

It's pretty slow right now (roughly 20–30 minutes of training and 45 minutes each for compression and decompression on my AMD 7800XT).

Checkout the repo - https://github.com/samyak112/pym-particles

▲

tae0086 a day ago | parent | next [-]

Neat approach. Since the 900KB model ships with the compressed file, is there a file size below which the model overhead just eats the gains? Curious where the crossover is.

	▲	spidy__ 18 hours ago \| parent [-]
		For the model overhead to become significant enough to eat into the gains, the file size would need to be fairly small, right? I assumed nobody would use this for compressing anything below 100 MB. I tested with 100 MB files because anything larger takes a long time to evaluate. The actual target was at least 1 GB, and in that case I would use a 100 MB model (Shannon entropy rules). I also tried it on a 100 MB Photoshop file and was able to compress it down to 45 MB, whereas ZIP could only get it down to 60 MB. So yeah still not losing gains.

▲

7373737373 2 days ago | parent | prev | next [-]

What does it compress the full 1GB file to? http://prize.hutter1.net/

	▲	purple-leafy 5 hours ago \| parent \| next [-]
		Thanks for the link!
	▲	spidy__ 2 days ago \| parent \| prev [-]
		I tried it on a enwik9 100 mb slice and was able to compress it to 20 mb + 900kb transformer so 21mb. I know the top submission was able to get it to 13 mb. Still trying some ideas to get better compression.

▲

purple-leafy 2 days ago | parent | prev [-]

That’s so awesome! I want to try something similar. I’ve been going crazy with compression work. I reckon I can beat that prize link

▲

spidy__ 18 hours ago | parent [-]

Reallly?? So have you published something so far? Can i read something? Sounds like you got some interesting ideas.

▲

purple-leafy 7 hours ago | parent [-]

I will be showcasing something on hackernews soon! Basically I found a way to “compress” a multiplayer game state from ~100KB+ to ~1KB

But it’s only for the game I’m building and it’s not pure compression work, I had to do some tricky things

▲

purple-leafy 5 hours ago | parent [-]

And just for comparison, my absolute best compression method managed to get down to 10s of KB, but the real unlock got to the ~1KB figures. Note these numbers are ALL post-compression numbers. This is not raw data vs compressed data. The ~100KB figure IS POST COMPRESSION.

For context these numbers are for a grid based game where players can perform 4 actions per second, and the numbers I’m sharing are for 30 minutes of gameplay with anywhere from 2-1024+ players (human players) playing simultaneously

So if you do the math, my compression feat is effectively ~99% compression on naive best case. And if you compare it to the raw data, it’s closing in on an even higher number than that I haven’t done the math but the raw data is another factor of 10 greater than ~100KB so the “compression” versus raw data is ~99.9%

It sounds absolutely bullshit I know :D

But I will be posting a blog post soon once I release the game.

I do compression in quotes because it’s not a pure compression feat, the 99%+ feat is effectively being clever about what actually requires compression to achieve the same outcome

	▲	spidy__ 15 minutes ago \| parent [-]
		Sounds interesting man, soo am a bit confused maybe but can you run this on enwik9?