Remix.run Logo
mbreese 6 days ago

I’m trying to learn more about the seekable zstd format. I don’t know very much about zstd, aside from reading the spec a few weeks ago. But I thought this was part of the spec? IIRC, zstd files don’t have to have just one frame. Is the norm to have just one large frame for a file and the multiple frame version just isn’t as common?

Gzip can also have multiple “frames” concatenated together and be seamlessly decrypted. Is this basically the same concept? As mentioned by others bgzip uses this feature of gzip to great effect and is the standard compression in bioinformatics because of it (and is sadly hard coded to limit other potentially useful Gzip extensions).

My interest is to see if using zstd instead of gzip as a basis of a format would be beneficial. I expect for there to be better compression, but I’m skeptical if it would be enough to make it worthwhile.

teraflop 6 days ago | parent [-]

The Zstd spec allows a stream to consist of multiple frames, but that alone isn't enough for efficient seeking. You would still need to read every frame header to determine which compressed frame corresponds to a particular byte offset in the uncompressed stream.

"Seekable Zstd" is basically just a multi-frame Zstd stream, with the addition of a "seek table" at the end of the file which contains the compressed and uncompressed sizes of every other frame. The seek table itself is marked as a skippable frame, so that seekable Zstd is backward-compatible with normal Zstd decompressors (the seek table is just treated as metadata and ignored).

https://github.com/facebook/zstd/blob/dev/contrib/seekable_f...

mbreese 6 days ago | parent [-]

Got it. That’s incredibly helpful. Thank you!

The way that’s handled in the bgzip/gzip world is with an external index file (.gzi) with compressed/uncompressed offsets. The index could be auto-computed, but would still require reading the header for each frame.

I vastly prefer the idea of having the index as part of the file. Sadly, gzip doesn’t have the concept of a skippable frame, so that would break naive decompressors. I’m still not sure the file size savings would be big enough to switch over to zstd, but I like the approach.

adrianmonk 6 days ago | parent | next [-]

> having the index as part of the file. Sadly, gzip doesn’t have the concept of a skippable frame

Looking at the file format RFC (https://www.ietf.org/rfc/rfc1952.txt), the compressed frames are called "members" and each member's header has some optional fields: "extra", "name", and "comment".

The comment is meant to be displayed to users (and shouldn't affect compression) so assuming common decoder software is at least able to properly skip over it, it seems like you could put the index data there.

One way to do it would be to compress everything except the last byte of the input data, then create a separate member just for that last byte. That way you can look at the end of the file and pretty easily find the header because the compressed data that follows it will be very tiny.

mbreese 6 days ago | parent [-]

Oh, I’m pretty sure you could set a gzip header field with a full index and a zero-byte payload. You could even make it so that the size of that last block would be in a standard location in the file (at a known offset, still in the gzip header).

One issue with bgzip in particular is that it fixes the gzip header fields allowed, so you can only have one extra value (which is the size of the current block). Because of this, you can’t have new fields in the header for bgzip (the gzip flavor widely used in bioinformatics). One thing I wanted to do was to also add was a header field for sha1/sha256/etc for the current block. When you have files of sufficient size, it can be helpful to have chunk-level signatures to protect against bitrot. This is just one usecase for novel header elements (which is somewhat alleviated as gzip blocks all have their own crc32, but that’s just one idea).

rorosen 6 days ago | parent | prev [-]

Writing the seek table to an external file is also possible with zeekstd, the initial spec of the seekable format doesn't allow this.