PMU Counters on Apple Silicon

19 points by ohrv

zmitchell

I’ve always wondered what the reasoning is for not documenting frameworks like this. Is it really about protecting IP? Surely it’s not a lack of resources.

olliej

I can't speak to kperf, but my experience using instruments' UI is that it is not really that clear (purely at the UI level) what "bad" events (things like mispredicts and what not) are actually important, e.g. it will report "x% of these mispredicted", but the UI makes it difficult to tell "but is that highly mispredicted thing actually responsible for significant amounts of performance loss?" e.g imagine two different functions that both have if(rand()%2), the UI does not make it clear which (if either) of those functions is actually causing more of a performance impact than the other. So while I've tried to use the tools available, I've found the UI makes it fairly opaque and hard to reason about.

There are also some other simply weird UI issues, that mean that even just getting the proportional information of anything requires click in the correct part of the correct row of numerous very similar rows, and then choosing the correct option in a drop down, that only includes the required option if you ran the correct version of the PMU based profiling mode, and as stated have selected the right version of numerous nearly identical UI elements.

pervognsen

the UI makes it difficult to tell "but is that highly mispredicted thing actually responsible for significant amounts of performance loss?"

Intel has something they call top-down microarchitecture analysis [1] with top-level metrics like Front-End Bound, Back-End Bound and Bad Speculation which you can use to help answer questions like this. ARM has their own version but it looks like it might only be for their Neoverse server cores so I'm not sure if there's an equivalent for Apple Silicon. Even in the x86 world it's not possible (including on Zen 5 last I checked) to replicate something like Intel's TMA on AMD cores because Intel added a bunch of TMA-specific counters to support the methodology (and it's probably a patent minefield like everything else).

[1] The original paper is A Top-Down Method for Performance Analysis and Counters Architecture and has some of the technical details.
- olliej
  
  Oh yeah, this is 100% a UI problem, I don’t think there’s anything inherent to the data being captured, solely its presentation.
itamarst

I really hate how this is root only. There are many environments where it's both useful to have this info on demand, and frustrating to have to run as root, e.g. Jupyter notebooks.
- olliej
  
  For things like Jupyter notebooks I would have thought you’d be fine with basic sampling based profiles?
  - itamarst
    
    That's a start, but it can be useful beyond a certain point (when writing compiled code) to start looking at things like branch mispredictions, or cache misses, etc.
    
    As an added sadness, for Python kernels at least most sampling profilers mostly don't work in a useful way with Jupyter. Maybe the new profiler in Python 3.15 will fix that.