Experimenting with Robin Hood hashing

8 points by dryya

cpurdy

I don't mean to imitate an engineer, but there are a number of problems with this example:

It uses a fixed-size hash map. The number one way to avoid collisions is to have the ability to keep the bucket count significantly higher than the entry count.
The EntryT does not include the hash code. The fastest way to prove inequality is to be able to compare hash codes.
Linked entries is one possibility, but highly cache inefficient. A far better answer is a 3-way entry data structure: single value (including the entry-absent condition), bag of entries sharing a single hash code, and binary tree of hash code to entry. These data structures are only rarely allocated in real world usage, except when dealing with purposeful hash attacks, but they are extremely good at dealing with hash attacks. In C (or C++) you build a length-encoded struct for these, so in all but the worst test cases the entire struct fits in a cache line with no additional pointer chasing. Something like this, whipped up from memory:

    struct Entry {
        union Storage {
            SingleEntry single;
            EntryTree* tree;
            EntryList* list;
        }
        uint8_t tag; // single entry vs tree vs list
    }
    struct SingleEntry {
        uint64_t hash;
        ValueT value;
    }
    struct EntryTree {
        uint16_t count;
        Entry entry; // array; each is single or list (begins with uint64_t hash)
    }
    struct EntryList {
        uint64_t hash;
        uint16_t count;
        ValueT value; // array
    }

(Minor) The use of modulo on a known-range counter in a hot loop triggers me. Some compilers can remove this in a peephole, but to the best of my knowledge most can't. Since the length is not likely to be a compile time constant in any real world usage, this probably can't be optimized out with an FRDC peephole (see https://arxiv.org/abs/1902.01961).

Those minor nits aside, I did like the article, and it does a better job explaining the topic than most I've seen on it.

What I was really hoping to see in the article was a discussion of the impact of table fullness on the efficiency, i.e. efficiency with zero collisions, and a curve showing efficiency all the way to a fully saturated table. Mainly because I have no clue what those curves would look like, and I've always wondered.

justinhj

Nice explanation. Here is a db project in the wild that uses robin hood hashing https://github.com/spotify/sparkey