Lessons from Hash Table Merging

9 points by knl

fanf

This is roughly the same bug that the old Rust Robin Hood hash table suffered nearly a decade ago, where reinsertion was accidentally quadratic.

tobin_baker

Hmm, wouldn’t that bug be avoided by just using “fastrange” rather than modulo, i.e. using the high bits of the hash to index into the bucket array, rather than the low bits?

osa1

If you're adding N elements to a hash map you should just grow it first (i.e. solution 2 in the post). You should do it regardless of the other things you do (solutions 1 and 3), you should do it always.

And once you do that the problem will disappear and you may not need solutions 1 and 3.

k749gtnc9l3w

I think the reason the author considers not doing this is the case where there are actually few new elements added. Although compared to the baseline in the article, count/grow/insert could still be a good idea.
- osa1
  
  Yeah, reading the post again, you're right.
  
  Though looking at their Rust code it's a bit unclear to me how they've achieved large amount of duplicate keys in h0 (the original map) and h1 (the map being merged to the original map) in this loop: https://github.com/attractivechaos/gistlog/blob/0633eccea938d6eaae18b9cf38d2bb9aef4843ce/src/ht-merge/rust/src/main.rs#L110-L115
  
  Note that the RNG seed is not reset between the two loops. So when allocating h1 you don't necessarily generate the same N numbers as h0.
  - k749gtnc9l3w
    
    I have an impression that pro/con are considered for more cases than actually benchmarked.
    
    Intuitively, inefficient approaches to growth and count-then-grow both have more overhead when the intersection is small, though, so benchmarking there seems reasonable. Although unconditional preallocation moving everything when in reality h0≈h1 would indeed be an interesting case to compare against the rest.
- snej
  
  Because h0 and h1 use the same hash function (also known as "hasher"), the original bucket position of a key in h1 is identical to the position in h0
  
  But h1 is twice the size of h0 : it has 2N keys where h0 only has N. That doesn’t necessarily mean it has twice as many buckets, but it almost certainly has more buckets.
  
  The question is, do hashes that map to a low-numbered bucket in h1 also map to a low-numbered bucket in h0? Apparently they do in the tables tested [with the hash salt disabled!] My guess is that h1’s table is actually 2x the size of h0’s, so keys in the first half of h1's table do map to the same indices in h1.
  - osa1
    
    Yeah, I was also confused by that this part in the post. So they're merging two hash maps that have the same capacity, with same/colliding keys, and with the same hasher. The setup seems a bit artificial.