Remix.run Logo
phiresky 2 hours ago

I think you're looking at a different perspective than me. At _build time_ you need to process O(n), yes, and generate O(n) additional data. But I said "The amount of data you need to decompress is a constant". At _read time_, you need to do exactly three steps:

1. Load the file index - this one scales with the number of files unless you do something else smart and get it down to O(log(n)). This gives you an offset into the file. *That same offset/32 is an offset into your gzip index.*

2. take that offset, load 32kB into memory (constant - does not change by number of files, total size, or anything else apart from the actual file you are looking at)

3. decompress a 1MB chunk (or more if necessary)

So yes, it's a constant.

vlovich123 2 hours ago | parent [-]

My bad. Yes from decompression perspective you have O(1) ancillary data to initiate decompression at 1 seek point.

This is how seeking can work in encrypted data btw without the ancillary data - you just increment the IV every N bytes so there’s a guaranteed mapping for how to derive the IV for a block so you’re bounded by how much extra you need to encrypt/decrypt to do a random byte range access of the plaintext.

But none of this is unique to gzip. You can always do this for any compression algorithm provided you can snapshot the state - the state at a seek point is always going to be fairly small.

phiresky 30 minutes ago | parent [-]

I actually first thought this wasn't possible at all because I'm used to zstd which by default uses a 128MB window and I usually set it to the max (2GB window). 32kB is _really_ tiny in comparison. On the other hand though, zstd also compresses in parallel by default and has tools built in to handle these things, so seekable zstd archives are fairly common.