Remix.run Logo
jmillikin 3 hours ago

I use CBZ to archive both physical and digital comic books so I was interested in the idea of an improved container format, but the claimed improvements here don't make sense.

---

For example they make a big deal about each archive entry being aligned to a 4 KiB boundary "allowing for DirectStorage transfers directly from disk to GPU memory", but the pages within a CBZ are going to be encoded (JPEG/PNG/etc) rather than just being bitmaps. They need to be decoded first, the GPU isn't going to let you create a texture directly from JPEG data.

Furthermore the README says "While folders allow memory mapping, individual images within them are rarely sector-aligned for optimized DirectStorage throughput" which ... what? If an image file needs to be sector-aligned (!?) then a BBF file would also need to be, else the 4 KiB alignment within the file doesn't work, so what is special about the format that causes the OS to place its files differently on disk?

Also in the official DirectStorage docs (https://github.com/microsoft/DirectStorage/blob/main/Docs/De...) it says this:

  > Don't worry about 4-KiB alignment restrictions
  > * Win32 has a restriction that asynchronous requests be aligned on a
  >   4-KiB boundary and be a multiple of 4-KiB in size.
  > * DirectStorage does not have a 4-KiB alignment or size restriction. This
  >   means you don't need to pad your data which just adds extra size to your
  >   package and internal buffers.
Where is the supposed 4 KiB alignment restriction even coming from?

There are zip-based formats that align files so they can be mmap'd as executable pages, but that's not what's happening here, and I've never heard of a JPEG/PNG/etc image decoder that requires aligned buffers for the input data.

Is the entire 4 KiB alignment requirement fictitious?

---

The README also talks about using xxhash instead of CRC32 for integrity checking (the OP calls it "verification"), claiming this is more performant for large collections, but this is insane:

  > ZIP/RAR use CRC32, which is aging, collision-prone, and significantly slower
  > to verify than XXH3 for large archival collections.  
  > [...]  
  > On multi-core systems, the verifier splits the asset table into chunks and
  > validates multiple pages simultaneously. This makes BBF verification up to
  > 10x faster than ZIP/RAR CRC checks.
CRC32 is limited by memory bandwidth if you're using a normal (i.e. SIMD) implementation. Assuming 100 GiB/s throughput, a typical comic book page (a few megabytes) will take like ... a millisecond? And there's no data dependency between file content checksums in the zip format, so for a CBZ you can run the CRC32 calculations in parallel for each page just like BBF says it does.

But that doesn't matter because to actually check the integrity of archived files you want to use something like sha256, not CRC32 or xxhash. Checksum each archive (not each page), store that checksum as a `.sha256` file (or whatever), and now you can (1) use normal tools to check that your archives are intact, and (2) record those checksums as metadata in the blob storage service you're using.

---

The Reddit thread has more comments from people who have noticed other sorts of discrepancies, and the author is having a really difficult time responding to them in a coherent way. The most charitable interpretation is that this whole project (supposed problems with CBZ, the readme, the code) is the output of an LLM.

creata 2 hours ago | parent [-]

> The most charitable interpretation is that this whole project (supposed problems with CBZ, the readme, the code) is the output of an LLM.

Do LLMs perform de/serialization by casting C structs to char-pointers? I would've expected that to have been trained out of them. (Which is to say: lots of it is clearly LLM-generated, but at least some of the code might be human.)

Anyway, I hope that the person who published this can take all the responses constructively. I know I'd feel awful if I was getting so much negative feedback.