Lessons from Hash Table Merging
9 points by knl
9 points by knl
This is roughly the same bug that the old Rust Robin Hood hash table suffered nearly a decade ago, where reinsertion was accidentally quadratic.
Hmm, wouldn’t that bug be avoided by just using “fastrange” rather than modulo, i.e. using the high bits of the hash to index into the bucket array, rather than the low bits?
If you're adding N elements to a hash map you should just grow it first (i.e. solution 2 in the post). You should do it regardless of the other things you do (solutions 1 and 3), you should do it always.
And once you do that the problem will disappear and you may not need solutions 1 and 3.
I think the reason the author considers not doing this is the case where there are actually few new elements added. Although compared to the baseline in the article, count/grow/insert could still be a good idea.
Yeah, reading the post again, you're right.
Though looking at their Rust code it's a bit unclear to me how they've achieved large amount of duplicate keys in h0 (the original map) and h1 (the map being merged to the original map) in this loop: https://github.com/attractivechaos/gistlog/blob/0633eccea938d6eaae18b9cf38d2bb9aef4843ce/src/ht-merge/rust/src/main.rs#L110-L115
Note that the RNG seed is not reset between the two loops. So when allocating h1 you don't necessarily generate the same N numbers as h0.
I have an impression that pro/con are considered for more cases than actually benchmarked.
Intuitively, inefficient approaches to growth and count-then-grow both have more overhead when the intersection is small, though, so benchmarking there seems reasonable. Although unconditional preallocation moving everything when in reality h0≈h1 would indeed be an interesting case to compare against the rest.
Because h0 and h1 use the same hash function (also known as "hasher"), the original bucket position of a key in h1 is identical to the position in h0
But h1 is twice the size of h0 : it has 2N keys where h0 only has N. That doesn’t necessarily mean it has twice as many buckets, but it almost certainly has more buckets.
The question is, do hashes that map to a low-numbered bucket in h1 also map to a low-numbered bucket in h0? Apparently they do in the tables tested [with the hash salt disabled!] My guess is that h1’s table is actually 2x the size of h0’s, so keys in the first half of h1's table do map to the same indices in h1.
Yeah, I was also confused by that this part in the post. So they're merging two hash maps that have the same capacity, with same/colliding keys, and with the same hasher. The setup seems a bit artificial.