How a 40-Line Fix Eliminated a 400x Performance Gap
39 points by lalitm
39 points by lalitm
Bad title, good article. Don't go for this kind of title if you have good content: that's really underselling.
For people who may not click: the topic is about some clever and undocumented extra API through the POSIX clock_* functions on Linux. By using free bits in the clock identifier values, Linux encodes extra information for the API so that you can request other data and make the kernel do less work (and therefore give you a reply faster).
Right? I almost didnt look, something like
“Using undocumented linux abi features for performance improvements” or the likes wouldve had me click instantly
Author here. I hear you. I was torn on the title, but I'm glad the content held up.
Regarding the API: it's a part of the Linux ABI, which is famously stable. It's less "undocumented" and more "hidden in plain sight" because most people never need to go deeper than the aggregate CLOCK_THREAD_CPUTIME_ID.
today's clever and undocumented extra API bits quite often become tomorrow's legacy maintenance nightmare. "Hey, we'll never use those upper bits …" should come with a mental health warning.
Yes, this can definitely be an issue. I'm parted with this one: it's very old by now and hasn't caused issue by now. The clock id is also a very sparsely populated type so I don't fell bad about the abuse: there are really many unused bits.
Yeah, I don't like the 400x gap claim either because it is specific to this one function that is is already pretty fast AND rarely called anyway.... so the difference is virtually guaranteed to be imperceptible to the user in almost all actual use. But I agree, the article itself was fine, and I might use this trick in my own program! I was writing a shell in October, and while I got distracted and haven't finished it yet, one of the last bits is making a kind of time command to measure performance, and this looks perfectly fine for fetching that kind of info (though I think I still have to look at proc stat for memory use anyway.... so..... eeeeeeh idk maybe not, if pulling from procfs anyway, the data is right there and might as well use it. i'll see. maybe there's something similar for fetching the memory stats.)
What libc are they using? I'm surprised to even see a syscall for clock_gettime, it should be a function call to __vdso_clock_gettime, no?
You can see in the flamegraph — it's the vDSO function making the syscall… Because not all clocks are implemented via the fast path there!
Currently it's CLOCK_REALTIME, CLOCK_MONOTONIC, CLOCK_BOOTTIME, CLOCK_TAI, CLOCK_REALTIME_COARSE, CLOCK_MONOTONIC_COARSE, CLOCK_MONOTONIC_RAW, and CLOCK_AUX…CLOCK_AUX_LAST. That list does not include CLOCK_THREAD_CPUTIME_ID and it definitely does not include this "secret" CLOCK_CURRENT_THREAD_USERTIME :)
Any idea how to make those interactive SVGs for free? I see a lot of paid tools
No need to pay for anything. It's very easy to do with Brendan Gregg's flamegraph.pl which is free and open source: https://github.com/brendangregg/FlameGraph
$ perf record -F 99 -a -g -- sleep 60 && perf script > out.perf
# Or in the case of the article, they used async-profiler
# https://github.com/async-profiler/async-profiler
# and generated stacks from that
$ ./stackcollapse.pl out.perf | ./flamegraph.pl > example.svg
Ah nice thanks. I was interested in interactive SVGs in general but I’ll dig into that library to see what I can make work