▲ | teraflop 6 days ago | ||||||||||||||||||||||
The Zstd spec allows a stream to consist of multiple frames, but that alone isn't enough for efficient seeking. You would still need to read every frame header to determine which compressed frame corresponds to a particular byte offset in the uncompressed stream. "Seekable Zstd" is basically just a multi-frame Zstd stream, with the addition of a "seek table" at the end of the file which contains the compressed and uncompressed sizes of every other frame. The seek table itself is marked as a skippable frame, so that seekable Zstd is backward-compatible with normal Zstd decompressors (the seek table is just treated as metadata and ignored). https://github.com/facebook/zstd/blob/dev/contrib/seekable_f... | |||||||||||||||||||||||
▲ | mbreese 6 days ago | parent [-] | ||||||||||||||||||||||
Got it. That’s incredibly helpful. Thank you! The way that’s handled in the bgzip/gzip world is with an external index file (.gzi) with compressed/uncompressed offsets. The index could be auto-computed, but would still require reading the header for each frame. I vastly prefer the idea of having the index as part of the file. Sadly, gzip doesn’t have the concept of a skippable frame, so that would break naive decompressors. I’m still not sure the file size savings would be big enough to switch over to zstd, but I like the approach. | |||||||||||||||||||||||
|