Data Access Patterns That Makes Your CPU Really Angry

93 points by weineng

benj

So, this is how internal Windows 11 dev research looks like! /s

That was a fun read!

https://github.com/ob/cache was my experience learning this

Sietsebb

That looks interesting. Can I ask how I should understand these two pieces of the README?

I first saw the technique used for generating these latency numbers in exercise 5.2 on page 476 of Computer Architecture: A Quantitative Approach

vs.

The idea comes from the Ph.D. dissertation of Rafael Héctor Saavedra‑Barrera

Do these refer to different things, or do they contradict each other, or are they about the same thing and is Saavedra-Barrera cited by CA:AQA?

It's not clear to me whether the Claude program was kept out of the README or not[1], so I'd rather check with you first to know if the references are real.

The idea looks cool, hence my interest <3

[1] The Claude program is credited alongside the author as a repo contrbutor
- ob
  
  The book cites the thesis and I tracked down the original[1]. I used Claude to clean up some stuff and had it write a Python program to make it easier to analyze the results. The core of the program is hand-written in like 2009 or so, way before AI :-)
  
  The idea is really cool and I used to run this analysis in every new computer I was programming to get a sense of the memory hierarchy and timings.
  
  [1] I clarified the language in the README, you were right that Claude had messed with it :-/
  - Sietsebb
    
    That's grand, thanks for taking the time to answer my question. And props for citing the original source as well as the source you found it in. I look forward to trying the exercise for myself soon. Cheers!
- shdown
  
  uint32_t random = rand() % remaining;
  
  Uh, on glibc, RAND_MAX is 32767. So this is far from a really random shuffle of all elements (you have 2^26 of them). Not sure if this affects the results, though.
- thecloudlet
  
  Recommending my own simulator for customizing custom cache hierarchy access experience.
  
  https://github.com/TheCloudlet/Stratum
- orib
  
  How did you separate the overhead of calculating the indexes from the cost of accessing them?
  - weineng
    
    I'm not sure what you mean. Do you mean how I measure accumulator cycles without also counting the time taken to create positions? If so, the following is the macro i used:
    
    #define BENCHMARK_ACCESS_PATTERN(arrange_positions) \ do { \ reset(data, positions); \ arrange_positions(data, positions); \ uint64_t start = rdtsc_start(); \ uint32_t sum = accumulator(data, positions); \ uint64_t end = rdtsc_end(); \ print_cycles(#arrange_positions, end - start); \ assert(sum == linear_scan_sum); \ do_not_optimize(sum); \ } while (0)
    
    orib
    
    Yep, that's what I mean.
  - JulianWgs
    
    Very good read! It would be very interesting if it is possible to find the same solution with reinforcement learning (with or without LLMs). The search space is huge, but the solution "only" involves some quite simple patterns.
  - olliej
    
    I was planning on seeing what’s this looked like on my MacBook, and noticed it has (as is so obnoxiously common) hard coded the page size to 4k :-/
    
    Edit: “was” is meant as “so I looked at the code first”, not “so I flipped the (page? :) ) table and walked off
  - trenchant
    
    Lets try with Data Access Patterns that our professors taught us.
    
    We'll start with a SafeNumber.java class. Then we'll need an add(SafeNumber b) member...
    
    Next semester we'll learn how to architect this behind Redis for webscale performance.
    
    Since Claude made it easy, I translated the benchmark to Java and I am impressed to learn that it's only 4x slower.:
    
    Absolute times (nanoseconds, 67M elements) ┌──────────────────────┬────────────────┬───────────────────┐ │ access pattern │ C++ uint32_t[] │ Java SafeNumber[] │ ├──────────────────────┼────────────────┼───────────────────┤ │ linear │ 10,258,584 │ 36,740,667 │ ├──────────────────────┼────────────────┼───────────────────┤ │ fisher_yates_shuffle │ 264,856,042 │ 535,724,208 │ └──────────────────────┴────────────────┴───────────────────┘ --- Relative to C++ linear (1.0×) ┌──────────────────────┬────────────────┬───────────────────┐ │ access pattern │ C++ uint32_t[] │ Java SafeNumber[] │ ├──────────────────────┼────────────────┼───────────────────┤ │ linear │ 1.0× │ 3.6× │ ├──────────────────────┼────────────────┼───────────────────┤ │ fisher_yates_shuffle │ 25.8× │ 52.2× │ └──────────────────────┴────────────────┴───────────────────┘
    
    kev009
    
    The java multiple isn't very good, maybe project valhalla will get it closer?