You can beat the binary search
60 points by gabeio
60 points by gabeio
The author would do well to remove the hero image. It conveys nothing except the "quaternary search" text, is actively unhelpful if you actually look at it, implies a drastically different tone than the actual text, and most importantly it signals that this is going to be low quality slop when I actually found it quite interesting and well written.
100%, clicked off instantly (both because slop and I'm already familiar with n-ary search). I might come back to it later, hearing that you found it interesting.
Yeah I clicked off and haven't read it since I'm not looking for something with the tone that image conveys.
Also, the path branches into 2... It depicts a slightly wonky binary tree with a list of elements per node, does it not? What aspect of the quaternary search algorithm is Daniel trying to illustrate here? What meaning does it have that some numbers (3, 4 and 5) are on both the "root" branch and the branch called Q1? What does it represent that Q2, Q3 and Q4 are all on the same branch?
The author is Daniel Lemire. He created simdjson so it makes sense that his article isn't low quality slop. Maybe your AI slop signals are giving too many false positives and you should find some better signals.
It's spring here in Montreal, and it's been very nice to go on walks lately, the parks are very beautiful.
I got past the AI image, but I stopped when he started conflating asymptotic complexity with overall performance and started talking about how changing the constant factor improved performance.
Unfortunately it really is obvious slop. The image at the top of the last article was too. That's sad if he used to do high-quality work.
This is a didactic article written by someone whose first language is likely not English (he's a university professor from Quebec).
It reads more ESL and talking to students than AI. Even the title, "You can beat the binary search" reads as a standard French to English awkwardness.
I ran it through various free AI "testers"/detectors and got 9% from Grammarly.
The C++ does seem vibecoded, but I think that's fine for a SIMD experiment.
(We're talking about the images. Though I guess if it has slop code too, two points make a line.)
I think Paul Khuong's treatment of the same subject was significantly more thorough, although he didn't include a SIMD implementation: Binary search is a pathological case for caches (2012).
Edit: turns out Paul also co-wrote a much more detailed paper on the subject a few years later covering alternative data layouts (heap/Eytzinger, B-tree, van Emde Boas).
On a side note, I've been disappointed for a while now by Daniel's gratuitous use of slop images irrelevant to the subject matter of the post. If I didn't know he generally produces high-quality content I'd be immediately put off.
(Nitpick: it would also be nice to mention that he's one of the creators of "the popular Roaring Bitmap format".)
Binary search is a pathological case for caches (2012).
This has been true ever since virtual memory became popular with DEC VAX and DG MV/8000. I switched from binary trees to in-memory B-trees (or B+ pr B* or whatever) for all my code starting in 1985.
Systems now tend to page a lot less (usually not at all) than they did in the 80s but TLB miss and refill is still significantly more expensive than a cache miss (it's potentially 4 or even 5 cache misses on current x86) so minimising the number of pages your code touches is still important.
And individual cache blocks are too small as a tree node. With 8 bytes for both keys and pointers a 64 byte cache block is only 3 keys and 4 pointers (and maybe a pointer to the parent or next sibling). Better than a binary tree, but not all that much. Modern CPUs will happily stream adjacent cache blocks with much less total latency than the same number of random cache blocks.
PCs and Macs didn't have to worry about page tables for a few years, but Connectix introduced add-on VM for MacOS in 1988, and Windows 3.0 and MacOS 7 in 1990 and 1991 brought it to everyone. Caches became a significant factor with the 68040 and 80486 generation.
There's a story (possibly even true!) that C++'s STL standardised on binary trees for std::map instead of something better performing like btrees specifically because the STL's authors used a computer whose CPU didn't have a substantial data cache and on that machine the binary tree performed just as well.
Stepanov worked at HP and very likely developed STL on early 90s HP PA-RISC workstations, which are famous for having a very large (even today) 256 KB L1 data cache.
In that case I wonder where that rumour came from.
Also I wonder why he picked the poorer data structure.
B-trees are not objectively better than BSTs (or RB trees, in particular) in every scenario though. The constant factor depends on your search strategy within the nodes, and you cannot guarantee pointer stability while maintaining cache efficiency. (Of course, lots of people think pointer stability was a bad idea to begin with, but it was at least a conscious design choice.)
Whether this even matters at all also depends on your workload: is caching the bottleneck? Is the lookup time actually dominant and worth losing pointer stability for? Maybe! But if you're designing a lowest common denominator library it's not that clear that it's the right trade-off to make.
Just as interesting was the link ~purplesyringa left in the blog post comments, which I hadn't seen before: Eytzinger Binary Search, a clever trick for speeding up binary search by improving cache locality.
that's an awesome data structure! while I'm strong at "classical" algorithms, I would like to learn more about making data strcuctures that are tuned to the hardware, cache aware, memory hierarchy aware, SIMD exploiting... how do I learn about this?
I'd like to understand both how to build such data structures and how to bench them properly
I also wrote recently [1] about Exponential Search [2] which is another algorithm if you need to repeatedly binary search in an array where the elements you're searching are themselves are sorted. It allowed for an 8x speedup in our workload!
[1] https://lalitm.com/post/exponential-search/ [2] https://en.wikipedia.org/wiki/Exponential_search
That's nice, but when I use binary search, I'm not looking for a 16-bit integer in an array of such integers, I'm looking through an array of of structures keying off a text value. And even then, it might be for only a few thousand elements, not millions (or billions) in the hot path.
A related thing I've heard, which is just anecdotal but may be worth trying:, is that supposedly 3 way search sometimes performs better than 2 way or 4 way search on power of two sizes arrays. The theory is that the first few levels of binary search on a POT array tend to all look at indices which have lots of trailing zeroes, and that tends to cause them to all alias each other in L1d cache. This can make the search only able to take advantage of N entries in an N-way associative cache, rather than the entire L1 cache.
I've had similar experiences at a higher level where I had two sorted arrays in Python but simply appending + calling sort was just way more performant than doing a hand-rolled "merge two sorted lists" operation. This was with similar... 4-digit cardinalities I suppose.
There the sort of "obvious" thing was sort being C code and my thing being Python functions, but I had really thought that those costs would be outweighed by me not churning around a bunch of data via the sorting itself. Really wasn't though
Python's built in sorting algorithm Timsort is really sophisticated and it does things like detect runs of monotonically increasing or decreasing values in the input array and take advantage of them. (For instance I think if you pass it a nearly-sorted array with an O(1) number of out-of-place elements at the end, Timsort will sort it in O(array.length) time.) I would not be surprised if, when you concatenate two sorted lists and then call concatenated.sort(), Timsort could detect that the list has two already-sorted runs of elements in it and then sort the list with a single merge pass.
And I like the hero image. It's a short lightweight read on a well known topic with just 2 figures. Very much rhyming with the lighthearted image 🤷♂️ Thanks for sharing!