| ▲ | mbreese 6 months ago | |||||||
Got it. That’s incredibly helpful. Thank you! The way that’s handled in the bgzip/gzip world is with an external index file (.gzi) with compressed/uncompressed offsets. The index could be auto-computed, but would still require reading the header for each frame. I vastly prefer the idea of having the index as part of the file. Sadly, gzip doesn’t have the concept of a skippable frame, so that would break naive decompressors. I’m still not sure the file size savings would be big enough to switch over to zstd, but I like the approach. | ||||||||
| ▲ | adrianmonk 6 months ago | parent | next [-] | |||||||
> having the index as part of the file. Sadly, gzip doesn’t have the concept of a skippable frame Looking at the file format RFC (https://www.ietf.org/rfc/rfc1952.txt), the compressed frames are called "members" and each member's header has some optional fields: "extra", "name", and "comment". The comment is meant to be displayed to users (and shouldn't affect compression) so assuming common decoder software is at least able to properly skip over it, it seems like you could put the index data there. One way to do it would be to compress everything except the last byte of the input data, then create a separate member just for that last byte. That way you can look at the end of the file and pretty easily find the header because the compressed data that follows it will be very tiny. | ||||||||
| ||||||||
| ▲ | rorosen 6 months ago | parent | prev [-] | |||||||
Writing the seek table to an external file is also possible with zeekstd, the initial spec of the seekable format doesn't allow this. | ||||||||