Bab: a family of hashing functions for p2p networks
26 points by dustyweb
26 points by dustyweb
The spec is a little bit more clear about why a new hash function was needed:
The key feature of Bab is streaming verification of a string: as a program reads a string, it can verify at regular intervals that the prefix read so far is indeed part of a string hashing to some target digest.
Which of course is important if you have P2P programs streaming large blobs of data to each other.
Aside: the pixel font on the website is unnecessarily painful and difficult to read, especially the small italicized text.
Seems similar to IPFS or BitTorrent DHT -- send a tree of hashes, then the content, and the client can incrementally verify the content against the hash tree nodes.
As a minor nitpick of the terminology in their spec, this is not "content-addressed storage" as it's usually understood because the same content can have different checksums if the chunk size, hash algorithm, etc is parameterized. "The chunk_size value must be a natural number greater than zero" means a 4 GiB file has > ~4 billion possible checksum variants even if everything else is hardcoded.
If you can't compute a single storage address of some content using only the content itself then it's not "content-addressed" (it might still be useful, it's just called something else).
I'm more used to seeing this as a two-layer send where the tree is itself hashed and sent all at once, so you end up with a protocol like:
GET sha256:00002000:123...DEF
<obtain the tree that serializes to 8192 bytes with SHA256 checksum 123...DEF>
GET sha256:00100000:<hash of 1 MiB chunk 1...>
GET sha256:00100000:<hash of 1 MiB chunk 2...>
...
The advantage being that a structured encoded tree is a bit more flexible than just sending a list of checksums in some order, for example you can have chunks be different lengths and then apply content-defined chunking. And maybe some simple built-in decompression if the content has long runs of predictable bytes (i.e. object code).
Haven't read the spec, but I find it weird that they mention it being content-addressed. These functions are ones powering the Willow protocol, which explicitly mentions in their comparison that it is not content-addressable https://willowprotocol.org/more/willow_compared/index.html#willow_compared in commonly used meaning. Willow is way more higher level than these primitives but the addressing part, as far I understand, is left for the user.