O(no) You Didn’t
54 points by knl
54 points by knl
A good reminder that constants disappear in the big O notation. There is even a name for algorithms that – despite having lower asymptotic complexity – are never faster in our real-world: Galactic algorithms.
Sometimes the difference is negligible. I’ve recently implemented two algorithms for a problem. One was O(n³) and the other O(n). When benchmarked on expected input (strings of length between 1 and 100 characters), the linear solution was only teeny tiny bit faster (and not even always).
I learned about the galactic AKS sorting network earlier today which apparently is only feasible for n > 2^78. Funnily enough it’s a different AKS to the ones behind the galactic AKS primality testing algorithm
Interestingly, this isn’t because the AKS sorting network is unfathomably huge – it’s only about 100 times bigger than practical sorting networks for small n. But because the asymptotic complexity of sorting networks is logarithmic – O(log n)
for AKS and O((log n)^2)
for practical networks – constant factors in the performance become exponentially larger when looking at the crossover point.
This triggers a pet peeve of mine.
Big-O notation (“f is O(g)”) states that there exist a constant c and a value n_0 such that f(n) <= c*g(n), for all n >= n_0.
The way the article applies this definition is wrong: there is no guarantee that either of the performance describing functions’ O-notation guarantees apply at n_0 = 4.
For a discrete function bounded function, the definition you give implies the one of the article, because there are only a finite set of values of n smaller than n0.
Implementing only the O(n²) algorithm is also a form of technical debt: you don’t know how your inputs will evolve in the future, or how the code will be reused under changed requirements, and maybe suddenly it will create a noticeable performance degradation. In contrast, the O(n) algorithm is robust: slower on small inputs, but it does not blow up. In code that is not performance-critical, I will take “does not blow up” over constant factors any day.
(We can get the best of both worlds by using a size cutoff to switch from the O(n²) to the O(n) version, but then the code has become more complex. Oh well.)
Yes, this is my philosophy. If I am only implementing one solution and I don’t know that it cannot be fed large inputs I always pick the O(n) one. (If I know that large inputs are impossible I put a comment on the implementation warning people.) Because it is almost always better to be fairly fast for all inputs then super fast on tiny inputs. Generally the performance of tiny inputs is so good as to not to matter anyways. If your software is too slow on tiny inputs that is basically guaranteed to be noticed and fixed (if performance has any sort of weight in prioritization) but software that is only low for some often unexpected cases will often go unnoticed and some users suffer.
So if I want to keep it simple I just implement the one that performs better on large inputs. If it ever shows up in profiles or I want to be performance optimal it is easy to come back and add an if n < 400: return small_impl(...)
to handle all cases optimally (or at least as optimally as 400 can be determined).
I wrote an article about this: https://kevincox.ca/2023/05/09/less-than-quadratic/
Wow. What an impressive piece of work. I have seen this problem at least a decade ago and never thought that brute-force is superior in most common use-cases.
Is it? At the end he mentions that the O(n) solution gets a lot faster if you preallocate memory for the hashmap, but he doesn’t compare the improved version to the brute-force solution. The different chart scales means we can’t use them to figure out what the new cutoff is. Is it still n>300
?
I like a good dive into profiling. Am I reading the charts wrong, though, or does N only go up to 1,000 ? I feel like N = 100,000 is more “it”, especially for a linear-pass type algorithm.
I guess the question I wonder about is: if we write both algorithms, and adaptively switch between them, what’s a good crossover for N ? (and how do we decide that? hmm)
Besides, if the domain is small, you might could “just” do a bitset, right ?
s/compliment/complement/g , compliments are criticisms in disguise (:
I guess the question I wonder about is: if we write both algorithms, and adaptively switch between them, what’s a good crossover for N ? (and how do we decide that? hmm)
Don’t choose. Run both in parallel, return the first result. We can call it “algorithm hedging”, haha :)
And if instead of just two, you run every possible algorithm along with a search for proof for correctness of every one of these, and return the first result from a provably correct algorithm, you may end up running The Fastest and Shortest Algorithm for All Well-Defined Problems
I was concerned about this too: I thought 1,000 was extremely small. I was unsure about accuracy of the Python performance analysis and the Python preallocation discussion. Just plotting the graphs wasn’t convincing for me so after reading the blog post I went to look at the strace output and the behaviour differed between 1,000 and 10,000 items which at the very least suggests the input size was too small and might also indicate that there’s a bit more to the story
sam@samtop ~ % strace -e trace=memory python
>>> data = range(1_000)
>>> hashmap = {v: i for i, v in enumerate(data)}
>>> data = range(10_000)
>>> hashmap = {v: i for i, v in enumerate(data)}
mmap(NULL, 1048576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x78190f8f2000
mmap(NULL, 299008, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x78190f8a9000
>>>
Well, you can do better than O(n²)
without relying on maps. Steps are:
O(n log n)
and deduplicate (O(n)
)O(n log n)
)So you result with O(n log n)
total, but without the cost on allocating additional memory when there is not enough space in memory, which is especially painful in case of hash map, as even with preallocation it can happen. And even with moderately sized arrays there is chance that whole array will fit into L3 cache, so random access performance should not suffer much.
Even better, in O(n)
:
O(n)
(assuming fixed-length integers, like we did so far) with radix sortThat’s a nice approach. One observation: Radix sort is O(w n)
where w
is the key length, which is bounded by your assumption of fixed-length integers. Merge/quick/whatever sort is O(n log n)
but log n
is also a small number (<=64) for any list on a 64-bit machine, just like w
. So there’s a parallel between these two, one optimizing for integer value bounds and one for integer list length bounds. Both are better than the hashmap in terms of big O though, since with probing it could degrade to worst case O(n^2)
.
Indeed.
log n
is also a small number (<=64) for any list on a 64-bit machine
Since we’re nitpicking, being a 64-bits machine is unrelated to the value of log n
because n
is the number of elements before deduplication. log n
is actually much smaller than 64 because 8n
has to fit in RAM (though hypothetically we could have a large zram to accomodate a lot of numbers if they are not too random but then we have other performance issues).
FYI the author implemented and compared this: https://github.com/MrShiny608/code_profiling_playground/issues/1
Since the return value is the indexes of the target elements, if you sort it in-place and deduplicate it, you lose the original indexes.
You’ll have to store the original indexes somehow, which means a copy of the array / roughly the same extra data as a hashmap.
If the problem was to return the two array values, this would make a lot of sense!
You’ll have to store the original indexes somehow, which means a copy of the array / roughly the same extra data as a hashmap.
Not really, as for hash map you need to have more memory allocated (unless you know all possible input values, and these values aren’t full range of int64
, so you can have PHF, but if you know and can do that, then you can do everything in linear time anyway). You need to allocate more memory to avoid O(n²)
behaviour of algorithm (as hash list degrades to linear search through list in worst case scenario). So sorting list of tuples should still be more performant. Especially with 2nd @val optimisation.
As with all performance questions, the only way to answer is: measure. If you’re not measuring, you’re guessing. I wrote a full post on how to do that right over here.
I don’t think this is necessarily true. Often, having good “mechanical sympathy” allows you to make lots of small changes (which individually are indistinguishable from noise) that add up to a generally faster system. If you are comfortable with the abstractions that you’re working with, each improvement can “nudge” performance to be optimal in the limit.
I think your comment and what you quote don’t disagree on facts. Making changes based on what you call “mechanical sympathy” would be “guessing”, albeit educated guessing. The difference seems to me to lie in how negative or positive an attitude one has towards “guessing”.