Rust zero-cost abstractions vs. SIMD
24 points by emschwartz
24 points by emschwartz
A zero-cost abstraction promises that we couldn't hand-write faster assembly for the same logic.
Nobody who knows what they're talking about has ever said this. Not once. Never.
It only means the Iterator abstraction has no inherent runtime cost over the alternative natural expression of the logic in the language. Assembly is not Rust, Rust does not make claims in relation to what you can do with assembly.
This was an interesting read!
Nicole's post and my followup show that LLVM can inline a next call, and flatten iterators... when it feels like it. I guess it didn't inline in this case because of the complexity of next: "Each call to next() compares across iterators, mutates internal positions, and returns one value."
Which aligns with the broader point of the article: "zero-cost abstractions do not absolve you from practicing mechanical sympathy." Totally agreed!
I don't think I'd understood "zero-cost abstraction" as "couldn't hand-write faster assembly for the same logic". Specifically "assembly"- lowering to assembly implies embedding assumptions/choices that weren't in the original code (choice of compiler, ISA, etc.) (unless you're using intrinsics, which...seems different.)
In the case of an iterator, the iterator abstraction is "just" covering the while-next loop. Function calls are not a zero-cost abstraction; they are an abstraction.
One thing that may be of note is that this implementation could be done by specializing the (nightly only) next_chunk method on Iterator. Which would be nice in terms keeping good Iterator ergonomics.
(This of course is unrelated to the point of zero-cost)
I’d like to know what their next() functions are actually doing - certainly it seems significantly more complex than normal iterators but given it can apparently be autovectorised when written as a normal loop I’m wondering what the iterator version is actually doing
They wrote
The scan() function builds the merge from all candidate key range iterators (here just slices of integers with a position), then drains it in a loop, with each call to next() recursing down through the merge structure to produce one value at a time.
“The merge” is some kind of merge tree. They made the loop vectorizable by extracting batches from the merge tree and doing a tight loop over the resulting vec. I think vectorizing is not the main story, but instead most of the performance win came from a more efficient tree walk. The microbenchmark showed a 60x speedup which is a far greater factor than I would expect from the scalar/vector change that the blog post focuses on.
I just wanted to comment on something that stood out to me structurally:
// next() recurses down to produce the smallest current key
while let Some(v) = merge_tree.next() {
// Simple arithmetic: sum if even
let x = v + 1;
if x % 2 == 0 { sum += x; }
}
sum
In performance-sensitive code you basically never want to use Iterator::next() (directly or indirectly) for any flatten-style iterator stacks involving chain(), flat_map(), etc, or tree iterators. The problem is that the inversion of control usually produces very suboptimal machine code with LLVM for non-trivial cases like that.
Instead you want to do everything with try_fold/fold (or the functions based on fold like for_each, map, etc). Assuming all your flatten-style iterators have custom try_fold implementations (not the default next-based implementation) the residual code after inlining is basically what you'd write by hand; it's more like macro expansion than sophisticated optimization. Last I checked all the important flatten-style std::iter iterators satisfied this property. Last I checked that wasn't the case for std's B-tree iterators or the third-party itertools crate.
One downside from a performance perspective about the try_fold design is that it conjoins two capabilities: stopping (the 'try' part) and resuming. This is reflected in the &mut self receiver type. Resumability is obviously important for many cases. But for tree iterators in particular, a non-resumable try_fold (receiving self rather than &mut self) is something you can write as a simple recursive function. To make that resumable, you would need to track an explicit path stack (or another cursor representation), which is extra cost and complexity.
Now, maybe for this merge tree use-case you really want or need the pull-oriented interface of next. In which case next_batch is probably the right approach. There's a lot of precedent for that kind of pull-mode batching of interconnected dataflow operators when implementing database queries with the so-called Volcano iterator model. In that case the batching not only amortizes the cost of transferring control (and data if you're not zero-copy) between the dataflow operators, but it also provides a natural Goldilocks granularity for vectorization.
But I thought it was worth highlighting the performance impact of pull vs push iteration when it comes to flattening in general and recursive flattening for trees in particular.
My question was that they seemed to imply they had a non iterator version that was vectorized - e.g they only had to make this design change due to the use of iterators, and without iterators it was the old speed.
Otherwise this article makes no sense: it’s saying “zero cost abstractions aren’t free if you can do a completely different approach that is faster by virtue of architecture”
Yeah it’s hard to tell what the real story is without seeing the next() functions. Maybe the compiler is able to optimize simpler query operators that are mostly linear scans of packed data, but this particular operator’s loop fell off the happy path; they didn’t notice because they were used to getting fast loops from their iterators in other cases. I can’t tell if it’s just the ContainsAny operator that’s slow, or if ContainsAny is only slow with many values, or why it’s so much faster when batched…