Remix.run Logo
Overfitted a 900KB Transformer to Compress a 100MB CSV into 7MB
8 points by spidy__ 3 days ago | 10 comments

I built an experiment that uses an overfitted transformer and arithmetic coding to compress individual files.

Instead of training the model to generalize, I train a 900KB transformer to memorize a single file and predict the next byte. Those predictions are fed into an arithmetic coder to produce the compressed output.

On a 100MB NYC taxi CSV, it compresses to about 7MB (~0.5 bits/byte). On a 100MB slice of enwik9, it compresses to about 21MB (~1.68 bits/byte).

It's pretty slow right now (roughly 20–30 minutes of training and 45 minutes each for compression and decompression on my AMD 7800XT).

Checkout the repo - https://github.com/samyak112/pym-particles

tae0086 a day ago | parent | next [-]

Neat approach. Since the 900KB model ships with the compressed file, is there a file size below which the model overhead just eats the gains? Curious where the crossover is.

spidy__ 18 hours ago | parent [-]

For the model overhead to become significant enough to eat into the gains, the file size would need to be fairly small, right? I assumed nobody would use this for compressing anything below 100 MB.

I tested with 100 MB files because anything larger takes a long time to evaluate. The actual target was at least 1 GB, and in that case I would use a 100 MB model (Shannon entropy rules).

I also tried it on a 100 MB Photoshop file and was able to compress it down to 45 MB, whereas ZIP could only get it down to 60 MB. So yeah still not losing gains.

7373737373 2 days ago | parent | prev | next [-]

What does it compress the full 1GB file to? http://prize.hutter1.net/

purple-leafy 5 hours ago | parent | next [-]

Thanks for the link!

spidy__ 2 days ago | parent | prev [-]

I tried it on a enwik9 100 mb slice and was able to compress it to 20 mb + 900kb transformer so 21mb.

I know the top submission was able to get it to 13 mb.

Still trying some ideas to get better compression.

purple-leafy 2 days ago | parent | prev [-]

That’s so awesome! I want to try something similar. I’ve been going crazy with compression work. I reckon I can beat that prize link

spidy__ 18 hours ago | parent [-]

Reallly?? So have you published something so far? Can i read something? Sounds like you got some interesting ideas.

purple-leafy 7 hours ago | parent [-]

I will be showcasing something on hackernews soon! Basically I found a way to “compress” a multiplayer game state from ~100KB+ to ~1KB

But it’s only for the game I’m building and it’s not pure compression work, I had to do some tricky things

purple-leafy 5 hours ago | parent [-]

And just for comparison, my absolute best compression method managed to get down to 10s of KB, but the real unlock got to the ~1KB figures. Note these numbers are ALL post-compression numbers. This is not raw data vs compressed data. The ~100KB figure IS POST COMPRESSION.

For context these numbers are for a grid based game where players can perform 4 actions per second, and the numbers I’m sharing are for 30 minutes of gameplay with anywhere from 2-1024+ players (human players) playing simultaneously

So if you do the math, my compression feat is effectively ~99% compression on naive best case. And if you compare it to the raw data, it’s closing in on an even higher number than that I haven’t done the math but the raw data is another factor of 10 greater than ~100KB so the “compression” versus raw data is ~99.9%

It sounds absolutely bullshit I know :D

But I will be posting a blog post soon once I release the game.

I do compression in quotes because it’s not a pure compression feat, the 99%+ feat is effectively being clever about what actually requires compression to achieve the same outcome

spidy__ 15 minutes ago | parent [-]

Sounds interesting man, soo am a bit confused maybe but can you run this on enwik9?