It's Not Always ICache (2021)
28 points by hyperpape
28 points by hyperpape
So, in this case, the main issue is that the inlined version counts up instead of counting down, which adds the cmp: the compiler applies an optimization in the non-inlined version, and misses it on the inlined version. But why? Is the optimizer less eager to optimizer bigger functions?
the inlined version counts up instead of counting down
Good catch. I wonder what kind of heuristics are at play here. I'd think that rustc/LLVM would let me decide whether to count upwards or downwards if it impacts performance this much because if this is the only reason for 21% decrease in runtime, that's quite significant and the compiler shouldn't mess with it on its own and let me do the benchmarking and tweaking.
Would be interesting to learn about a rustc or LLVM dev's perspective here.
That's an odd attitude to have I think. There are many many things that an optimizing compiler decides for you that can have more than 20% performance impact.
The inlined version needs branch prediction hints to tell the compiler that the fast path will be taken most of the time. (or that the slow path won't be taken often)
This will potentially change the basic block layout (slow path blocks will be at the end, out of the hot path, similar to the non-inlined version) and can affect register allocation too (if there's enough pressure the fast path code will use the registers instead of stack locations).
I like this attitude of questioning/testing the received wisdom a lot. Nice post.
That said, the other bit of the lore that I've internalized (and maybe I should question!) is that the icache misses might not make your microbenchmark slow down, but they are a bigger problem in the context of a larger program that is doing other things as well. IOW, I worry about the effect of code bloat on icache pressure more when my microbenchmark speeds up, where the bad effect might be more subtle, than in cases like this blog post where the microbenchmark slows down.
I wanted to comment exactly the same thing. It would be hard to measure the cost of a bit of extra icache use. It's like measuring number of deaths from air pollution caused by cars. No single death can be attributed directly to any single car ride, but we generally agree it is best to keep air quality as high as we can, and statisticians can do come up with some numbers.
It’s still hard to say what perf’s sL1-icache-loads means. Judging by the name, it should correspond to cachegrind’s I refs, but it doesn’t.
The article doesn't state exactly what CPU they're using, but I'd guess it's probably not counting instructions that are executed out of a uop cache or similar that's closer/smaller/faster even than the L1 I-cache (an "L0" cache of sorts that I wouldn't expect cachegrind to simulate). For a tight loop like the code in question that could easily dominate the total dynamic instruction count.