The fastest Linux timestamps
19 points by hmpc
19 points by hmpc
I really enjoyed your post!
For the majority of work, this is far beyond the level of detail or complexity I encounter. The most complex problem I commonly address when measuring timing is reminding people to use std::chrono::steady_clock (or, e.g., time.perf_counter) instead of std::chrono::system_clock (or, e.g., time.time/datetime.datetime.now.)
Despite my lack of depth in this area, I thought the complexity you presented was very well-motivated.
After all, if you can guarantee that instrumentation will impose a penalty that whose tails are bounded to ≤5%, that makes a strong argument for eliminating an instrumented/non-instrumented modality. It occurs to me that eliminating this modality in very high performance routines is very organisationally beneficial—it's simply very easy to turn off instrumentation to get a boost for the most performance-sensitive code (certainly easier than addressing a fault in the design of the code itself,) yet it's probably the case that when something goes wrong (esp. an unexpected performance degradation in production,) the lack of instrumentation becomes a huge barrier to debugging.
There are a couple of details I have to infer from the code samples you shared (and, of course, this is not my area of expertise, so I can only guess at things…)
For your benchmarks, you set “[a]ll cores [...] to use the performance governor with a fixed 4.1 GHz frequency” which suggests to me that vdso.mult and vdso.shift should be constant over the runtime of the benchmark, so capturing these in ever call to read_clock (and performing subsequent mathematics in elapsed) should be unnecessary. I presume that this assumption does not hold for performance workloads (despite their running on “overpowered CPUs used in datacentres.”)
You initially stop rereading these values in elapsed: “Instead of loading the multiplier and shift from the data page at the end of the interval, we cache them in the timer itself at the cost of an extra 8 bytes, avoiding entering the critical region again and the attendant possibility of a cache miss after an update” and note that “this may produce slightly different measured values than our initial approach in case the multiplier is updated during the timed interval.” Then you move to caching these values (out of sight in an elided read() call): “… frequency adjustments are gradual enough that it won’t matter if we miss a few… we can cache the required data ourselves and refresh it at some acceptable frequency…”
So if we're assuming that there is some separate thread mechanism for maintaining this data, and assuming the latency for writing out to the telemetry system is order-of-magnitudes greater than what you are trying to optimise, would it a useful further optimisation to skip all of the mathematics entirely?
(Given that you note that previous steps have the advantage of “saving us a load, a multiplication, and an addition”…) Is there a reason that it is necessary to compute ((cycles - cache.last) * cache.mult) >> cache.shift inline as part of the instrumentation hot path, rather than doing this as part of the telemetry path?