I got into an argument on Discord about how inefficient CBR/CBZ is, so I wrote a new file format
53 points by pushcx
53 points by pushcx
The comments have interesting discussion that call into question some of the purported performance wins, but having a file format announcement start with "I was arguing on the r/yuri_manga Discord" is objectively hilarious and I am in favor of seeing more of these :D
Reminds me of the /sci/ anon who proved a result about superpermutations to find an optimal way of watching Haruhi: https://en.wikipedia.org/wiki/Superpermutation#Lower_bounds,_or_the_Haruhi_problem
There's quite a bit of amazing tech that has come from anime/manga communities ^^
It's just unfortunate that this doesn't look like one of those, but rather seems likely to be LLM slop...
Is there any particular reason you believe this to be slop? I spent a couple minutes looking at the code and nothing obvious jumps out at me (then again, I'm more familiar with how LLM-generated prose feels).
Okay, I'm still not sure to what extent LLMs were involved, but at the very least the performance claims are heavily suspect.
spite-driven development
For years & years now I have been using "CFRO" - Caffeine-Fueled Rage Optimization because the picture it paints amuses me. ;-) TFA does not opine on involvement of stimulants. So, hard to judge.
I use CBZ to archive both physical and digital comic books so I was interested in the idea of an improved container format, but the claimed improvements here don't make sense.
For example they make a big deal about each archive entry being aligned to a 4 KiB boundary "allowing for DirectStorage transfers directly from disk to GPU memory", but the pages within a CBZ are going to be encoded (JPEG/PNG/etc) rather than just being bitmaps. They need to be decoded first, the GPU isn't going to let you create a texture directly from JPEG data.
Furthermore the README says "While folders allow memory mapping, individual images within them are rarely sector-aligned for optimized DirectStorage throughput" which ... what? If an image file needs to be sector-aligned (!?) then a BBF file would also need to be, else the 4 KiB alignment within the file doesn't work, so what is special about the format that causes the OS to place its files differently on disk?
Also in the official DirectStorage docs (https://github.com/microsoft/DirectStorage/blob/main/Docs/DeveloperGuidance.md) it says this:
Don't worry about 4-KiB alignment restrictions
- Win32 has a restriction that asynchronous requests be aligned on a 4-KiB boundary and be a multiple of 4-KiB in size.
- DirectStorage does not have a 4-KiB alignment or size restriction. This means you don't need to pad your data which just adds extra size to your package and internal buffers.
Where is the supposed 4 KiB alignment restriction even coming from?
There are zip-based formats that align files so they can be mmap'd as executable pages, but that's not what's happening here, and I've never heard of a JPEG/PNG/etc image decoder that requires aligned buffers for the input data.
Is the entire 4 KiB alignment requirement fictitious?
The README also talks about using xxhash instead of CRC32 for integrity checking (the OP calls it "verification"), claiming this is more performant for large collections, but this is insane:
ZIP/RAR use CRC32, which is aging, collision-prone, and significantly slower to verify than XXH3 for large archival collections.
[...]
On multi-core systems, the verifier splits the asset table into chunks and validates multiple pages simultaneously. This makes BBF verification up to 10x faster than ZIP/RAR CRC checks.
CRC32 is limited by memory bandwidth if you're using a normal (i.e. SIMD) implementation. Assuming 100 GiB/s throughput, a typical comic book page (a few megabytes) will take like ... a millisecond? And there's no data dependency between file content checksums in the zip format, so for a CBZ you can run the CRC32 calculations in parallel for each page just like BBF says it does.
But that doesn't matter because to actually check the integrity of archived files you want to use something like sha256, not CRC32 or xxhash. Checksum each archive (not each page), store that checksum as a .sha256 file (or whatever), and now you can (1) use normal tools to check that your archives are intact, and (2) record those checksums as metadata in the blob storage service you're using.
The Reddit thread has more comments from people who have noticed other sorts of discrepancies, and the author is having a really difficult time responding to them in a coherent way. The most charitable interpretation is that this whole project (supposed problems with CBZ, the readme, the code) is the output of an LLM.
Yeah...
No Random Access. CBZ spikes CPU usage when scrubbing through pages.
Not true. ZIP files allow random access to individual files.
Footer-Based Index. BBF doesn't have to parse a central directory, it only has to read the footer to know where every page is.
That's... what a ZIP file central directory is. A footer. That tells you wheve every page is.
Per-Asset Hashes. Every asset (and the footer) has an associated XXH3 hash with it, so you can quickly verify the entire book or just a single page nearly instantly.
... which again is exactly how ZIP checksums work.
Also, CRC is parallelizable, even within a single file. It's a completely linear function. You can split a file into pieces, compute the CRCs in parallel, and merge the result into a CRC for the whole file (with a bit of math).
This does not pass the LLM smell test. I flagged it as "spam" but I wish we had a flag for "slop"...
This does not pass the LLM smell test.
The classic LLM "bullet points/numbered list with short summary at the start of each item" style does ring alarm bells for me as well...
(this post is more of an aside than replying or arguing)
CRC32 is limited by memory bandwidth if you're using a normal (i.e. SIMD) implementation.
One thing to keep in mind is that the easiest way to do that on Intel, using the crc32 instructions from SSE4.2, computes CRC-32C, which is a different polynomial than the one used in Zip etc. There are the CMUL and similar instructions to do it the hard way, though.
[second level quote] CRC32, which is aging, collision-prone
That's a pet peeve of mine, Cyclic Redundancy Checks aren't "aging", they're still the best in class solutions, but for an entirely different problem than checking integrity of files.
CRCs and checksums have a property of guaranteed detection of errors up to a certain Hamming Distance (i.e. the number of bits flipped compared to the original; for CRCs the maximum HD depends on the chosen polynomial and the length of the input), and CRCs in addition can detect all burst errors up to the CRC's length, if there aren't any bit flips anywhere else.
Hash functions do not have this property, and it doesn't seem to be even desirable for cryptographic hash functions at least (https://crypto.stackexchange.com/a/89716)
I'm not entirely sure, but I think it should be safe to assume that a file read from a hard drive won't have any uncorrelated bit flips (since it's already above a layer of error correction), and detecting whole bad bytes or worse (so HD>=8) is only guaranteed for absolutely pitiful sizes (see https://users.ece.cmu.edu/~koopman/crc/crc32.html )
So using a fast hash function instead of a CRC or checksum might be ok, as long as the designer knows what they're trying to achieve, which I think might be right that the OP is not.
For disk storage, corruption primarily occurs either during the transfer to/from memory or cold bit rot. Both are already protected by CRC but demonstrably fail to detect many instances of corruption. PCIe v6 is adding a CRC64 to mitigate transfer corruption issues, which will be nice.
Detecting bit rot is not an ideal use case for CRC. The use of 128-bit (or larger) hashes in modern storage engines addresses an issue with the existing CRC16/32 failing to reliably detect bit rot. A hash does not have the same theoretical properties as a CRC but is more robust for detecting bit rot if you use a high-quality hash algorithm.
That last condition is doing a lot of heavy lifting. Many people use algorithms like 'xxh3' designed for hash tables but which are demonstrably low-quality algorithms for data integrity applications. In these cases, these hash algorithms may in fact be worse than using a CRC despite the increased bit width.
I find that most devs are very confused about the nuances of algorithm selection for these kinds of things. They look for the one-size-fits-all algorithm when in fact you need 3-4 different algorithms depending on the specific application.
Doesn't zip already allow you to access random files in the archive?
Yes, that sounds like a limitation of the readers rather than the file format.
Plus the entries are independently compressed (not that archive compression should matter much for image files) so the corruption of one entry has no reason to affect the next one(s) at all.
Edit: looks like the author has corrected themselves in the repository.
There's a file format that has been standard (like, as in ISO) since 2008, has efficient indexing, TOC management, fast page previewing and lossless image round-tripping. It can even hold text layers.
It's called PDF. But you've probably never heard of it.
I'd assume simplicity is a big factor in why people use CBZ over PDFs. Writing a reasonably complete CBZ reader is probably going to be much easier than writing a PDF reader, and you can inspect/edit it with simple tools.
Can't say the same about this format, though.
It may also be simplicity of production (going from scan to distribution) as well. Or, like a lot of things in the filesharing community, a bad or dubious decision became the "done thing" at some point (like the use of rar or HEVC) and stuck around long past the point there was an argument to be made in favor of it.
It's okay, but PDF is a bit mediocre in that all of those things are optional features that might not be present in a given file, and there's no trivial way to check for them.
(Also ISO is terrible because they charge a substantial fee for viewing the standards, unlike say ECMA.)
Overhead of the container format is dwarfed by efficiency of image compression. Work on replacing jpeg/png with a modern format and you'll benefit the scene.
So it says
Zero-Copy Architecture. The file is 4KB-aligned. We map the file directly from disk to memory/GPU. No buffers, no copying. BBF is DirectStorage ready.
I'm unclear why 4k alignment? In memory page alignment isn't relevant: for small data it's wasteful, and for large data it's moot. If you map the entire file the OS will just fetch (and even prefetch) data, and absent memory pressure won't page out the data after access. At the same time 4k alignment gains you nothing on systems that don't use 4k pages (the various 16k, 64k, and larger platforms).
I pushed the author on this in this issue. They did not know what "zero-copy" means, or they did not know how GPU rendering works, or they did not know how image formats work. I'm glad that they're learning something but it's disappointing that they could only do it by sharing slop with a community in violation of their guidelines.
I’d assume that has more to do with making disk reads faster than memory speed. Are you saying that’s obviated by memory mapping?
Interesting - maybe? I think modern SSDs are built on >4k and pages (8 or 16k?), but at the same time I think their presented sector size is still 4k? But I have no idea how the sector vs nand size difference actually impacts how much is read at once - i.e. while the nand itself operates on Xk pages, is only 4k actually sent over the bus?
Dunno about pages, but I think block size is determined by the filesystem configuration. You may have a point in that the best block size may ultimately be dictated by drive design. We’re getting way outside my area of expertise here, haha. Source: Have formatted a few drives.
Edit: Seems like the whole point is moot, it’s more likely the project is the product of a series of non-sentient dice rolls and therefore none of these decisions were made for any particular reason.
NAND is physically broken into pages it can be accessed in, and SSDs interfaces don’t give the OS arbitrary control of the minimal addressable granularity, so I’m trying to understand if any “page size” choice you make in software impacts the real performance at all? In principle spinning platters benefit from exact placement, but for nand is it sufficient to be in the same page, or is it close to irrelevant once you’re operating at some fixed block size? E.g reading single bytes scattered randomly vs reading the minimum loadable block randomly.
But from other comments it sounds like I might have been assuming good rationales where none were present :)