This is because Zstd's long-distance matcher looks for matching sequences of 64 bytes [0]. Because long matching sequences of the data will likely have the newlines inserted in different offsets in the run, this totally breaks Zstd's ability to find the long-distance match.

Ultimately, Zstd is a byte-oriented compressor that doesn't understand the semantics of the data it compresses. Improvements are certainly possible if you can recognize and separate that framing to recover a contiguous view of the underlying data.

[0] https://github.com/facebook/zstd/blob/v1.5.7/lib/compress/zs...

(I am one of the maintainers of Zstd.)

▲

nerpderp82 a day ago | parent [-]

That is fascinating. I wonder if you could layer a Levenshtein State Machine on the strings so you can apply n-edits to the text to get longer matches.

I absolutely adore ZSTD, it has worked so well for me compressing json metadata for a knowledge engine.

▲

felixhandte a day ago | parent [-]

Zstd has a similar-ish capability called "repetition codes" [0].

The first stage of Zstd does LZ77 matching, which transforms the input into "sequences", a series of instructions each of which describes some literals and one match. The literals component of the instruction says "the next L bytes of the message are these L bytes". The match component says "the next M bytes of the input are the M bytes N bytes ago".

If you want to construct a match between two strings that differ by one character, rather than saying "the next N bytes are the N bytes M bytes ago except for this one byte here which is X instead", Zstd just breaks it up into two sequences, the first part of the match, and then a single literal byte describing the changed byte, and then the rest of the match, which is described as being at offset 0. The encoding rules for Zstd define offset 0 to mean "the previously used match offset". This isn't as powerful as a Levenshtein edit, but it's a reasonable approximation.

The big advantage of this approach is that it doesn't require much additional machinery on the encoder or decoder, and thus remains very fast. Whereas implementing a whole edit description state machine would (I think) slow down decompression and especially compression enormously.

[0] https://datatracker.ietf.org/doc/html/rfc8878#name-repeat-of...

	▲	bede 7 hours ago \| parent [-]
		Fascinating, thank you.