Why xor eax, eax?
73 points by krtab
73 points by krtab
so it can allocate a fresh, dependency-free zero register renamer slot. And, having done that it removes the operation from the execution queue - that is the xor takes zero execution cycles!1 It’s essentially optimised out by the CPU!
It’s actually better than that because it also marks the old value in the rename register as dead after this point. As soon as all uses within the basic block have retired, the rename register can be reused. In a tight loop, killing a false loop-carried dependency like this can speed things up a surprising amount.
Modern CPUs are witchcraft.
Very true. Reasoning about performance on a modern system is incredibly hard because generally everything is really fast but there are a load of constrained resources and you hit a performance cliff when any one of them is exhausted. And the size of these vary between chips from the same manufacturer, in the same generation. Something that should obviously cause a slowdown can speed things up because now it significantly increases consumption of a thing that was nowhere near the cliff and marginally reduces consumption of something that was exhausted. Things like caches and branch predictors, which are associative structures that die horribly in cases where aliasing exceeds a threshold, make this so much worse.
My favourite example here was a story from some Apple folks about getting the first samples back of a new iPhone chip. Everything on the new chip was faster, but for one critical bit of software it was much slower. It turned out that the old chip was incorrectly predicting a branch. The prediction caused it to load something that was not in cache and was used later. The pipeline was short so the invalid prediction cost about ten cycles of wasted work. The newer chip had a better branch predictor and so correctly predicted the branch. Then it stalled for over a hundred cycles waiting for the value from memory. Once they understood that, adding a prefetch where the old predictor jumped to the wrong place made it much faster on the new chip. The root cause analysis for that was, I'm told, very hard. I doubt you could even do it without a detailed performance model of the chip.
A friend who works in μarch design told me x86 assembly is so far decoupled from the underlying execution in modern CPUs it's reasonable to consider it an IR.
I always liked it because I came up with a great mnemonic for it as an undergraduate: it reminded me of the Greek onomatopoeia for frogs, βρεκεκέξ κουάξ κουάξ (vrekekex koax koax). This does require pronouncing eax as eh-ahx rather than E-A-X to make it work.
And not even remotely related to the fact that on the 8080/Z80, XRA A / XOR A (&af, 4 cycles) was almost twice as fast as MVI A,0 / LD A,0 (&3e 0, 7 cycles). It's the Cool S of Z80 programming.
Note how it’s
xor r8d, r8d(the 32-bit variant) even though with the REX prefix (here45) it would be the same number of bytes toxor r8, r8the full width. Probably makes something easier in the compilers, as clang does this too.
Unless something has changed recently, it's because only the first 8 ("original recipe") registers can use the xor modR/M format: there are only three bits for holding the register number. The REX prefix is required to provide that extra bit to allow access to the 8 new registers.
[edit: Oh, I see that the author is really just bemused that they don't let the instruction refer to the entire r8 register, since it's free. I'll leave the comment in case it's interesting anyway.]
From a practical perspective, I understand why there is this subtle chicken-and-egg interfacing between compiler optimizations and the types of performance improvements that can be made through out-of-order instruction scheduling, advanced branch prediction, cache handling, etc, that is implemented in the hardware.
The world is not going to recompile itself to work best with your hardware, so you optimize for existing binaries, and vice-versa, your software should be compiled to use the most out of what you understand that the hardware can do. But each time I dig down into this level, the interior surface of the target platforms, compiler [options, versions, ...] is so large that it seems truly unexplainable, at least from my current vantage point.
All in all, these types of considerations only matter in exceptional circumstances, and I'm curious to see what the relationship hardware and software will grow into as we trundle into the future with even bigger demands for performance and efficiency.
An interesting instance of this is the dedicated byte copy instruction sequence rep movsb on Intel. Historically it wasn't very fast (since it was a complex instruction that was difficult for the hardware to handle), leading to software avoiding it, leading to Intel neglecting to optimize it. So there was a weird situation where a huge complicated vectorized copy loop would be faster than a dedicated instruction. I think Intel eventually broke the chicken-egg problem by optimizing it anyway in more recent processors.