Got it. That’s incredibly helpful. Thank you!

The way that’s handled in the bgzip/gzip world is with an external index file (.gzi) with compressed/uncompressed offsets. The index could be auto-computed, but would still require reading the header for each frame.

I vastly prefer the idea of having the index as part of the file. Sadly, gzip doesn’t have the concept of a skippable frame, so that would break naive decompressors. I’m still not sure the file size savings would be big enough to switch over to zstd, but I like the approach.

▲

adrianmonk 6 months ago | parent | next [-]

> having the index as part of the file. Sadly, gzip doesn’t have the concept of a skippable frame

Looking at the file format RFC (https://www.ietf.org/rfc/rfc1952.txt), the compressed frames are called "members" and each member's header has some optional fields: "extra", "name", and "comment".

The comment is meant to be displayed to users (and shouldn't affect compression) so assuming common decoder software is at least able to properly skip over it, it seems like you could put the index data there.

One way to do it would be to compress everything except the last byte of the input data, then create a separate member just for that last byte. That way you can look at the end of the file and pretty easily find the header because the compressed data that follows it will be very tiny.

	▲	mbreese 6 months ago \| parent [-]
		Oh, I’m pretty sure you could set a gzip header field with a full index and a zero-byte payload. You could even make it so that the size of that last block would be in a standard location in the file (at a known offset, still in the gzip header). One issue with bgzip in particular is that it fixes the gzip header fields allowed, so you can only have one extra value (which is the size of the current block). Because of this, you can’t have new fields in the header for bgzip (the gzip flavor widely used in bioinformatics). One thing I wanted to do was to also add was a header field for sha1/sha256/etc for the current block. When you have files of sufficient size, it can be helpful to have chunk-level signatures to protect against bitrot. This is just one usecase for novel header elements (which is somewhat alleviated as gzip blocks all have their own crc32, but that’s just one idea).

▲

rorosen 6 months ago | parent | prev [-]

Writing the seek table to an external file is also possible with zeekstd, the initial spec of the seekable format doesn't allow this.