Optimizing FizzBuzz in Rust

21 points by nrposner

joshka

If you’re writing a proc macro that expands fizzbuzz, the optimal approach for low numbers of iterations would likely be expanding the result to a string instead of code. E.g. “1\n2\nFizz\n4\nBuzz…”. This would be pretty much zero runtime cost.

I recall being asked in an interview effectively the same question with a small twist - convert time of day to english (e.g. “A quarter to 3”). My immediate answer was there’s 24*60 items which for anything except an embedded context is small enough to pre-compute a lookup table, and which likely has a decent performance for all use cases. The interviewer hadn’t expected an answer that breaks the question’s assumptions, so I suggested that we do the programming bit for how we’d generate that table so that he had an actual data point to measure me against.

mighmi

To what point should a given software not just be a look up table?
- joshka
  When it needs to calculate something dynamic
  
  When clarity and maintainability matters
  
  When the performance doesn’t matter
  
  Probably other cases
- jo3_l
  
  If the goal is to squeeze out the maximal performance, deciding whether to use a lookup table as opposed to doing some other constant-time or near-constant-time computation can actually be a bit complex. Just because looking up values by index in a table uses constant time in the input size doesn’t necessarily imply it is the best choice. This post discusses the tradeoffs in using a LUT; see also the Lobsters discussion. In particular, in microbenchmarks, a LUT-based approach will appear to blow the competition out of the water because the table will remain in cache for the entire time. This may not be the case on a real workload, and ensuing cache misses can be very costly.
  
  To give a cool example where doing some constant-time computation is faster than an ‘obvious’ use of a LUT, suppose we want to compute the day-of-year given the month and day-of-month. There is an obvious and fairly efficient solution that uses a LUT mapping months to the total days up to that month, but some efficient date libraries don’t do this! See “Computing day-of-year from month and day-of-month” in this article: the idea is that the function that maps months to total days up to that month is almost linear (with slope between 30 and 31), and with truncating division and some offsetting you can obtain a direct formula. (Of course the lookup table is still going to be very fast.)
jo3_l

On the topic of fast FizzBuzz: today I stumbled across an update to this famous Code Golf Stack Exchange FizzBuzz optimization problem. (The goal is just to output the FizzBuzz sequence as far as possible.) There is a well-known x86 Assembly solution that achieves an incredible throughput of ~60 GiB/s–if you haven’t seen it yet, go read it! However, today I realized that, as of 2024, that solution has dethroned by an even faster C++ solution that achieves a whopping ~200 GiB/s!

Just for context on how incredible those numbers are, the original 60 GiB/s was already faster than memcpy, and achieves zero-copy output by using the vmsplice syscall.
- dijit
  
  Worth noting that this is a little bit apples to oranges.
  
  I really hate that I am typing this and ruining the fun, but the Assembly variant was run on an Intel CPU (I think with a skylake derivative that has a slower L3 cache) and the C++ example was run on a Zen 3 AMD.
  
  It makes a substantial difference to benchmarks like these, the C++ and ASM code-golfs perform almost identically on my Xeon Platinum 8168 (skylake) machine
  - jo3_l
    
    Reasonable point. The numbers in individual posts are certainly not useful for relative comparison, because they are run on different machines. However, in the original post, the problem author runs all the submitted programs on their machine:
    
    Scores are from running on my desktop with an AMD 5950x CPU (16C / 32T). I have 64GB of 3200Mhz RAM. CPU mitigations are disabled.
    
    and the numbers I quoted are from there.
- nrposner
  
  Holy cow! I had started this as an exercise in improving my Rust skills… looks like I’ll need to brush up on Assembly and C++ and give these a peek: let’s see how well Rust’s inline assembly supports it.
olliej

This is terrible: everyone knows that for real performance you need a jit :D :D :D
- Corbin
  
  Yes. Here is a JIT for DIVSPL which generates assembly comparable to what one might hand-write; each program is incrementally compiled to a tree of modulus-and-check routines which each are specialized using the power-of-two multiplicative-inverse trick to avoid divisions. Like many of the other linked solutions, it is I/O-bound.