Remix.run Logo
mrcarrot 4 days ago

The "Optimized Tarball Extraction" confuses me a bit. It begins by illustrating how other package managers have to repeatedly copy the received, compressed data into larger and larger buffers (not mentioning anything about the buffer where the decompressed data goes), and then says that:

> Bun takes a different approach by buffering the entire tarball before decompressing.

But seems to sidestep _how_ it does this any differently than the "bad" snippet the section opened with (presumably it checks the Content-Length header when it's fetching the tarball or something, and can assume the size it gets from there is correct). All it says about this is:

> Once Bun has the complete tarball in memory it can read the last 4 bytes of the gzip format.

Then it explains how it can pre-allocate a buffer for the decompressed data, but we never saw how this buffer allocation happens in the "bad" example!

> These bytes are special since store the uncompressed size of the file! Instead of having to guess how large the uncompressed file will be, Bun can pre-allocate memory to eliminate buffer resizing entirely

Presumably the saving is in the slow package managers having to expand _both_ of the buffers involved, while bun preallocates at least one of them?

Jarred 3 days ago | parent [-]

Here is the code:

https://github.com/oven-sh/bun/blob/7d5f5ad7728b4ede521906a4...

We trust the self-reported size by gzip up to 64 MB, try to allocate enough space for all the output, then run it through libdeflate.

This is instead of a loop that decompresses it chunk-by-chunk and then extracts it chunk-by-chunk and resizing a big tarball many times over.

mrcarrot 3 days ago | parent [-]

Thanks - this does make sense in isolation.

I think my actual issue is that the "most package managers do something like this" example code snippet at the start of [1] doesn't seem to quite make sense - or doesn't match what I guess would actually happen in the decompress-in-a-loop scenario?

As in, it appears to illustrate building up a buffer holding the compressed data that's being received (since the "// ... decompress from buffer ..." comment at the end suggests what we're receiving in `chunk` is compressed), but I guess the problem with the decompress-as-the-data-arrives approach in reality is having to re-allocate the buffer for the decompressed data?

[1] https://bun.com/blog/behind-the-scenes-of-bun-install#optimi...