Remix.run Logo
mort96 2 hours ago

'zstd --version' reports: "** Zstandard CLI (64-bit) v1.5.7, by Yann Collet **". This is zstd installed through Homebrew on macOS 26 on an M1 Pro laptop. Also of interest, I was able to reproduce this with a random binary I had in /bin: https://floss.social/@mort/115940378643840495

I was completely unable to reproduce it on my Linux desktop though: https://floss.social/@mort/115940627269799738

terrelln 2 hours ago | parent [-]

I've figured out the issue. Use `wc -c` instead of `du`.

I can repro on my Mac with these steps with either `zstd` or `gzip`:

    $ rm -f ksh.zst
    $ zstd < /bin/ksh > ksh.zst
    $ du -h ksh.zst
    1.2M ksh.zst
    $ wc -c ksh.zst
     1240701 ksh.zst
    $ zstd < /bin/ksh > ksh.zst
    $ du -h ksh.zst
    2.0M ksh.zst
    $ wc -c ksh.zst
     1240701 ksh.zst
    
    $ rm -f ksh.gz
    $ gzip < /bin/ksh > ksh.gz
    $ du -h ksh.gz
    1.2M ksh.gz
    $ wc -c ksh.gz
     1246815 ksh.gz
    $ gzip < /bin/ksh > ksh.gz
    $ du -h ksh.gz
    2.1M ksh.gz
    $ wc -c ksh.gz
     1246815 ksh.gz
When a file is overwritten, the on-disk size is bigger. I don't know why. But you must have ran zstd's benchmark twice, and every other compressor's benchmark once.

I'm a zstd developer, so I have a vested interest in accurate benchmarks, and finding & fixing issues :)

mort96 an hour ago | parent [-]

Interesting!

It doesn't seem to be only about overwriting, I can be in a directory without any .zst files and run the command to compress 55 files in parallel and it's still 45M according to 'du -h'. But you're right, 'wc -c' shows 38809999 bytes regardless of whether 'du -h' shows 45M after a parallel compression or 38M after a sequential compression.

My mental model of 'du' was basically that it gives a size accurate to the nearest 4k block, which is usually accurate enough. Seems I have to reconsider. Too bad there's no standard alternative which has the interface of 'du' but with byte-accurate file sizes...

terrelln an hour ago | parent [-]

Yeah, it isn't quite that simple. E.g. `/bin/ksh` reports 1.4MB, but it is actually 2.4MB. Initially, I thought it was because the file was sparse, but there are only 493KB of zeros. So something else is going on. Perhaps some filesystem-level blocks are deduped from other files? Or APFS has transparent compression? I'm not sure.

It does still seem odd that APFS is reporting a significantly larger disk-size for these files. I'm not sure why that would ever be the case, unless there is something like deferred cleanup work.