Remix.run Logo
staplung 2 hours ago

This has been possible with the zlib module since 1997 [EDIT: zlib is from '97. The zdict param wasn't added until 2012]. You even get similar byte count outputs to the example and on my machine, it's about 10x faster to use zlib.

  import zlib

  input_text = b"I ordered three tacos with extra guacamole"

  tacos = b"taco burrito tortilla salsa guacamole cilantro lime " * 50
  taco_comp = zlib.compressobj(zdict=tacos)
  print(len(taco_comp.compress(input_text) + taco_comp.flush()))
  # prints 41

  padel = b"racket court serve volley smash lob match game set " * 50
  padel_comp = zlib.compressobj(zdict=padel)
  print(len(padel_comp.compress(input_text) +  padel_comp.flush()))
  # prints 54
notpushkin 2 hours ago | parent [-]

True. The post calls out that “you have to recompress the training data for each test document” with zlib (otherwise input_text would taint it), but you can actually call Compress.copy().

zdict was added in Python 3.3, though, so it’s 2012, not 1997. (It might have worked before, just not a part of the official API :-)

staplung 2 hours ago | parent [-]

Ah, okay. Didn't realize that. I used either zlib or gzip long, long ago but never messed with the `zdict` param. Thanks for pointing that out.