What a Difference a Faster Hash Makes
25 points by emschwartz
25 points by emschwartz
Can’t you configure the stdlib HashMap
/HashSet
with a non-default Hasher
?
Yes. In fact that is exactly what Ahash provides: https://docs.rs/ahash/latest/ahash/#basic-usage
This person did the same thing on a Lox interpreter written in Rust. They were “able to shave up to 45% of the running time of some of the benchmarks.”
Tangentially related, but it drives me nuts when people want to hash some data and so reach for a cryptographic hash like SHA-1 or MD5. Those algorithms are way, way, way slower than they need to be and are providing a security level that is generally not even understood, much less needed.
The thing is md5sum
is very easy to run and for years was one of the few things you could reliably find anywhere. I guess xxhash has now gotten enough buy-in it’s pretty easy to find tools that support it.
The builtin hash procedure is slower because it attempts to provide some security. It wants to make it as difficult as possible to guess a key from a value.
Isn’t this also an explicit design goal of ahash? From their README:
Because AHash is a keyed hash, each map will produce completely different hashes, which cannot be predicted without knowing the keys. This prevents DOS attacks where an attacker sends a large number of items whose hashes collide that get used as keys in a hashmap.
It appears (from the README) the actual reason AHash is faster is because it uses hardware instructions intended for AES encryption.
The stdlib uses both keying and a pretty strong hash (siphash) to make attacker controller collisions essentially impossible.
Also idk if that’s the case with AHash but several of the faster hashes (there are a few others commonly used to replace sip e.g. the rustc compiler itself uses fxhash) favor short keys, but might not be as fast for long keys.
I ran into this issue during some micro-benchmarking of an internal tool - we noticed most of the time was actually spent in a hash map.
We ended up finding rustc-hash which claims to be used by the Rust compiler, but it looks like ahash is comparable. There are also other alternatives like foldhash (the default hasher for hashbrown) which might perform even better in some circumstances.
[OT] In which we learn the existence of a “Beeping Busy Beaver” problem, which sounds like a way to cuss at your turing machine if it doesn’t halt.
Nice. In Julia I had toyed with the idea of having the hashing function be parameterizable for the dictionary, but never went anywhere with it. It’s common to provide ordering functions for order-based sets/dicts (btrees etc) but not hash functions for hash-based ones and I’m not sure why?
As a small aside, I also much prefer Dict
to Map
. I would have guessed it was habit or initial learning, but I learned these things in C++, so I guess the “dictionary” concept just clicks more in my brain.
My guess is that one of the reasons hash functions are widespread is ’cause they tend not to need to be specialized for different types of input data. Though you can find ones that are and they can be extremely impressive. I suppose the logical end-point of that is perfect hash functions.
I guess there’s nothing to stop you from defining ordering on any type by just using the order of the bytes composing it. Never seen that done in practice though; people usually want their ordering to actually mean something. I wonder how well it would work? Maybe there’s some databases out there that do this sort of thing for their b-trees?
I guess there’s nothing to stop you from defining ordering on any type by just using the order of the bytes composing it. Never seen that done in practice though; people usually want their ordering to actually mean something. I wonder how well it would work? Maybe there’s some databases out there that do this sort of thing for their b-trees?
I actually do the universal ordering thing at $WORK. We have a sensible total order defined for f64, for example, but also for structs, sum types, arrays, etc. We have a static type system so you can only compare things of compatible types (you can’t compare a f64 directly with a i64 without first converting one value, for example).
I also have a serialization format where the binary encoding of float, int, strings, structs, arrays, etc have raw byte sort order equivalent to the (sensible) total ordering above (for example by twiddling the bits of ints and floats appropriately). So you can just serialize values to bytes and insert them into a simple KV store and you get a sorted dictionary with range-based lookups for free.
I haven’t seen this approach elsewhere, which I find interesting since this part was a lot simpler than everything else I’ve had to implement…
That’s really cool, thanks for the info~ I guess lots of tasks just don’t do many range-based lookups, or are used to not being able to do so.
If you use a SQL database or even something like DynamoDB then any indexed lookup getting more than one item probably is!
At the application layer, not so much - but I wonder if that’s because we have arrays and don’t want/need to use it, or because we usually don’t have a good data structure for range-based lookups in our stdlibs and so never use that approach.
It’s curious, yeah. The classic solution for sparse ordered maps is a binary search tree, red-black or something like that, though nowadays in-core B-trees are more popular. But it’s expensive to have to call out to a comparison function so many times when traversing the tree!
With a radix tree or something similar such as my qp-trie, a traversal is like, swizzle the search key into lexicographic bytes (if it isn’t already), walk the tree, and (if the tree skips like my qp-trie or DJB’s crit-bit tree) compare with the leaf node.