▲ | adrianmonk 6 days ago | |
> having the index as part of the file. Sadly, gzip doesn’t have the concept of a skippable frame Looking at the file format RFC (https://www.ietf.org/rfc/rfc1952.txt), the compressed frames are called "members" and each member's header has some optional fields: "extra", "name", and "comment". The comment is meant to be displayed to users (and shouldn't affect compression) so assuming common decoder software is at least able to properly skip over it, it seems like you could put the index data there. One way to do it would be to compress everything except the last byte of the input data, then create a separate member just for that last byte. That way you can look at the end of the file and pretty easily find the header because the compressed data that follows it will be very tiny. | ||
▲ | mbreese 6 days ago | parent [-] | |
Oh, I’m pretty sure you could set a gzip header field with a full index and a zero-byte payload. You could even make it so that the size of that last block would be in a standard location in the file (at a known offset, still in the gzip header). One issue with bgzip in particular is that it fixes the gzip header fields allowed, so you can only have one extra value (which is the size of the current block). Because of this, you can’t have new fields in the header for bgzip (the gzip flavor widely used in bioinformatics). One thing I wanted to do was to also add was a header field for sha1/sha256/etc for the current block. When you have files of sufficient size, it can be helpful to have chunk-level signatures to protect against bitrot. This is just one usecase for novel header elements (which is somewhat alleviated as gzip blocks all have their own crc32, but that’s just one idea). |