Remix.run Logo
k__ 4 days ago

"... the last 4 bytes of the gzip format. These bytes are special since store the uncompressed size of the file!"

What's the reason for this?

I could imagine, many tools could profit from knowing the decompressed file size in advance.

philipwhiuk 4 days ago | parent | next [-]

It's straight from the GZIP spec if you assume there's a single GZIP "member": https://www.ietf.org/rfc/rfc1952.txt

> ISIZE (Input SIZE)

> This contains the size of the original (uncompressed) input data modulo 2^32.

So there's two big caveats:

1. Your data is a single GIZP member (I guess this means everything in a folder)

2. Your data is < 2^32 bytes.

jerf 3 days ago | parent | next [-]

A GZIP "member" is whatever the creating program wants it to be. I have not carefully verified this but I see no reason for the command line program "gzip" to ever generate more than one member (at least for smaller inputs), after a quick scan through the command line options. I'm sure it's the modal case by far. Since this is specifically about reading .tar.gz files as hosted on npm, this is probably reasonably safe.

However, because of the scale of what bun deals with it's on the edge of what I would consider safe and I hope in the real code there's a fallback for what happens if the file has multiple members in it, because sooner or later it'll happen.

It's not necessarily terribly well known that you can just slam gzip members (or files) together and it's still a legal gzip stream, but it's something I've made use of in real code, so I know it's happened. You can do some simple things with having indices into a compressed file so you can skip over portions of the compressed stream safely, without other programs having to "know" that's a feature of the file format.

Although the whole thing is weird in general because you can stream gzip'd tars without every having to allocate space for the whole thing anyhow. gzip can be streamed without having seen the footer yet and the tar format can be streamed out pretty easily. I've written code for this in Go a couple of times, where I can be quite sure there's no stream rewinding occuring by the nature of the io.Reader system. Reading the whole file into memory to unpack it was never necessary in the first place, not sure if they've got some other reason to do that.

k__ 4 days ago | parent | prev [-]

Yeah, I understood that.

I was just wondering why GZIP specified it that way.

ncruces 4 days ago | parent [-]

Because it allows streaming compression.

k__ 4 days ago | parent [-]

Ah, makes sense.

Thanks!

lkbm 4 days ago | parent | prev | next [-]

I believe it's because you get to stream-compress efficiently, at the cost of stream-decompress efficiency.

8cvor6j844qw_d6 4 days ago | parent | prev [-]

gzip.py [1]

---

def _read_eof(self):

# We've read to the end of the file, so we have to rewind in order

# to reread the 8 bytes containing the CRC and the file size.

# We check the that the computed CRC and size of the

# uncompressed data matches the stored values. Note that the size

# stored is the true file size mod 2*32.

---

[1]: https://stackoverflow.com/a/1704576