You can't fool the optimiser
15 points by fernplus
15 points by fernplus
This is why, when doing benchmarks, you always need to check the disassembly too. As you strip things down, you often hit a point where the optimizer thinks your code didn't actually do anything of value and just deletes the whole test!
When writing microbenchmarks I always have a trivial accumulator (eg, the sum of the results) that is printed at the end. The sum doesn’t mean anything, it just stops the compiler from constant-propagating the code to nothing.
I have written microbenchmarks where, even though the compiler was faithful, the bloody CPU was so good at pipelining loop iterations that each one took less than a nanosecond. It took me some time to work out how to persuade various flavours of CPU to retire one iteration before starting the next, so that the benchmark framework was able to compare different loop bodies against each other in a meaningful manner.
At low levels it’s worth having a rough idea of clock speed, instructions per clock, memory latency and bandwidth, so that you can spot when the compiler’s or CPU’s optimizers are fooling you.
This is even more obvious in Verilog with Yosys/ABC. As I'm writing the code from the inputs towards the outputs, and want to estimate how many cells I used up from my FPGA so far, I was quite surprised to see a complex circuit all optimized away to a constant, because I haven't yet implemented any non-trivial output. I had to ensure each internal signal that I haven't yet used is somehow reflected in a non-trivial way in the output signal. I didn't want to sum, because that would've used too many additional cells, but an XOR tree did the trick. (OR and AND won't work if any of your signals happen to be the constant 1 or 0). Similarly for inputs, but here expanding a single but wouldn't work (ISTR that it could deduce that AA...A==01...0 is always false and delete many gates). For example if I haven't yet implemented communication with external RAM I have to provide full N bits of dummy input signals at the top-level module to keep it all alive. There is probably a better way to do all this with Verilog attributes, but I haven't explored that deeply.