Remix.run Logo
notpushkin 3 hours ago

> The application of compressors for text statistics is fun, but it's a software equivalent of discovering that speakers and microphones are in principle the same device.

I think it makes sense to explore it from practical standpoint, too. It’s in Python stdlib, and works reasonably well, so for some applications it might be good enough.

It’s also fairly easy to implement in other languages with zstd bindings, or even shell scripts:

  $ echo 'taco burrito tortilla salsa guacamole cilantro lime' > /tmp/tacos.txt
  $ zstd --train $(yes '/tmp/tacos.txt' | head -n 50) -o tacos.dict
  [...snip]

  $ echo 'racket court serve volley smash lob match game set' > /tmp/padel.txt
  $ zstd --train $(yes '/tmp/padel.txt' | head -n 50) -o padel.dict
  [...snip]

  $ echo 'I ordered three tacos with extra guacamole' | zstd -D tacos.dict | wc -c
        57
  $ echo 'I ordered three tacos with extra guacamole' | zstd -D padel.dict | wc -c
        60
notpushkin 2 hours ago | parent [-]

Or with the newsgroup20 dataset:

  curl http://qwone.com/~jason/20Newsgroups/20news-19997.tar.gz | tar -xzf -
  cd 20_newsgroups
  for f in *; do zstd --train "$f/*" -o "../$f.dict"; done
  cd ..
  for d in *.dict; do
    cat 20_newsgroups/misc.forsale/74150 | zstd -D "$d" | wc -c | tr -d '\n'; echo " $d";
  done | sort | head -n 3
Output:

     422 misc.forsale.dict
     462 rec.autos.dict
     463 comp.sys.mac.hardware.dict
Pretty neat IMO.