▲ | felixhandte a day ago | ||||||||||||||||
This is because Zstd's long-distance matcher looks for matching sequences of 64 bytes [0]. Because long matching sequences of the data will likely have the newlines inserted in different offsets in the run, this totally breaks Zstd's ability to find the long-distance match. Ultimately, Zstd is a byte-oriented compressor that doesn't understand the semantics of the data it compresses. Improvements are certainly possible if you can recognize and separate that framing to recover a contiguous view of the underlying data. [0] https://github.com/facebook/zstd/blob/v1.5.7/lib/compress/zs... (I am one of the maintainers of Zstd.) | |||||||||||||||||
▲ | nerpderp82 a day ago | parent [-] | ||||||||||||||||
That is fascinating. I wonder if you could layer a Levenshtein State Machine on the strings so you can apply n-edits to the text to get longer matches. I absolutely adore ZSTD, it has worked so well for me compressing json metadata for a knowledge engine. | |||||||||||||||||
|