Remix clone Hacker News

new | show | ask | jobs Github

	▲	notpushkin 3 hours ago
		> The application of compressors for text statistics is fun, but it's a software equivalent of discovering that speakers and microphones are in principle the same device. I think it makes sense to explore it from practical standpoint, too. It’s in Python stdlib, and works reasonably well, so for some applications it might be good enough. It’s also fairly easy to implement in other languages with zstd bindings, or even shell scripts: `$ echo 'taco burrito tortilla salsa guacamole cilantro lime' > /tmp/tacos.txt $ zstd --train $(yes '/tmp/tacos.txt' \| head -n 50) -o tacos.dict [...snip] $ echo 'racket court serve volley smash lob match game set' > /tmp/padel.txt $ zstd --train $(yes '/tmp/padel.txt' \| head -n 50) -o padel.dict [...snip] $ echo 'I ordered three tacos with extra guacamole' \| zstd -D tacos.dict \| wc -c 57 $ echo 'I ordered three tacos with extra guacamole' \| zstd -D padel.dict \| wc -c 60`
	▲	notpushkin 2 hours ago \| parent [-]
		Or with the newsgroup20 dataset: `curl http://qwone.com/~jason/20Newsgroups/20news-19997.tar.gz \| tar -xzf - cd 20_newsgroups for f in ; do zstd --train "$f/" -o "../$f.dict"; done cd .. for d in *.dict; do cat 20_newsgroups/misc.forsale/74150 \| zstd -D "$d" \| wc -c \| tr -d '\n'; echo " $d"; done \| sort \| head -n 3` Output: `422 misc.forsale.dict 462 rec.autos.dict 463 comp.sys.mac.hardware.dict` Pretty neat IMO.