▲ | mbreese 6 days ago | |||||||||||||||||||||||||||||||
I’m trying to learn more about the seekable zstd format. I don’t know very much about zstd, aside from reading the spec a few weeks ago. But I thought this was part of the spec? IIRC, zstd files don’t have to have just one frame. Is the norm to have just one large frame for a file and the multiple frame version just isn’t as common? Gzip can also have multiple “frames” concatenated together and be seamlessly decrypted. Is this basically the same concept? As mentioned by others bgzip uses this feature of gzip to great effect and is the standard compression in bioinformatics because of it (and is sadly hard coded to limit other potentially useful Gzip extensions). My interest is to see if using zstd instead of gzip as a basis of a format would be beneficial. I expect for there to be better compression, but I’m skeptical if it would be enough to make it worthwhile. | ||||||||||||||||||||||||||||||||
▲ | teraflop 6 days ago | parent [-] | |||||||||||||||||||||||||||||||
The Zstd spec allows a stream to consist of multiple frames, but that alone isn't enough for efficient seeking. You would still need to read every frame header to determine which compressed frame corresponds to a particular byte offset in the uncompressed stream. "Seekable Zstd" is basically just a multi-frame Zstd stream, with the addition of a "seek table" at the end of the file which contains the compressed and uncompressed sizes of every other frame. The seek table itself is marked as a skippable frame, so that seekable Zstd is backward-compatible with normal Zstd decompressors (the seek table is just treated as metadata and ignored). https://github.com/facebook/zstd/blob/dev/contrib/seekable_f... | ||||||||||||||||||||||||||||||||
|