Introducing go-cdc-chunkers: chunk and deduplicate everything

10 points by poolpOrg


xfbs

Can I just point out that content-defined chunking is amazing. There are great Rust libraries for it too.

If you don’t understand it, let me illustrate: let’s say you chunk content into 8kb chunks, and send the hashes over. If you insert a byte anywhere, all of the hashes change. Which sucks if you use that for data deduplication or storage.

Witj CDC, you use a probabilostic method (hash function over windows) to determine where to split. So in the best case, you just habe a single chunk with a different hash that is one byte longer.

cblake

For its early history, this was referred to as the “rsync algorithm”. Or, I don’t know, maybe Tridge didn’t invent it either? A 72 LoC implementation in Nim is here. That actually is maybe 2X longer than it really needs to be since it tries to do N statistically independent framings (as a false positive mitigation strategy) and includes CLI documentation and such. The super easy and pretty fast way to go here is Bob Uzgalis’ BuzHash which I sadly now see via a quick web search is being widely confused with Bob Jenkins’ various hashes, but they’re probably all AI slop. Anyway, unless your CDC hash is so good there is no need for a secondary cryptographic hash, the crypto hash is probably your bottleneck in most scenarios.