I Made Zig Compute 33 Million Satellite Positions in 3 Seconds. No GPU Required
65 points by badtuple
65 points by badtuple
Submitting this because it was a super clear explanation of applying SIMD concepts to optimize an algorithm. SIMD advice often feels like it's asking you to "draw the rest of the owl," whereas this felt like a very natural working through of the problem. Also satellites are just plain cool.
I almost wish Zig wasn't so emphasized in the title since it may give people the impression that it's another "I rewrote this in a lower level language and now it's faster!" post. On the other hand, the code reads super naturally compared to equivalent Rust or C++...so Zig definitely deserves props.
Thanks to Anthony for writing it. It was a joy to read!
The main limitation to Zig's SIMD intrinsics is there is no easy way to compile a function to multiple targets and choose it cheaply at runtime. This issue is the long-standing one: https://github.com/ziglang/zig/issues/1018 So unless your target user is compiling from source with -Dtarget=native (the default, usually), they're going to instead get a -Dtarget=baseline package probably that doesn't have the latest and greatest SIMD support.
There are various tricks, and I tried em: pulling SIMD functions out into a separate Zig module, compiling it multiple times, embedding all of them into your binary, manually implementing CPU fingerprinting in your main, swapping dlopen your embedded bins, swap pointers, etc. etc. etc. You see, the complexity?
Ultimately, I personally opted into C++ (the HORROR) (https://github.com/google/highway) which I then call from Zig. Highway handles all the fingerprinting, multicompilation, etc. Here is Ghostty's SIMD routines and their Zig wrappers: https://github.com/ghostty-org/ghostty/tree/main/src/simd
I'd really prefer to use Zig, though. :)
the moment i saw :
SIMD ... multiple targets ...
i immediately thought about google's project highway, and then i read rest of your comment referencing just that :o)
i am not sure how useful it would be to trans-literate that into $LANGUAGE though. seems it would be best to have $LANGUAGE bindings to it instead.
While this is true, people really ought to not do baseline anymore but something like x86_64_v3 instead.
I still use hardware predating this daily. There are many folks benefitting from the compatibility of baseline. Please try not to assume that people in general will have recent hardware. That assumption makes sense only within contexts where the set of users is narrower than the general public. So I guess I'm asking you to think about adding a caveat to your recommendation in the future.
I'm usually an anti-proponent of autovectorization when you want to nail down performance in a non-brittle way. But everything in engineering has pros and cons. So, this is mostly just to address what is and isn't possible with LLVM's autovectorization:
But in SIMD, different lanes converge at different rates. Lane 0 might converge in 3 iterations while lane 3 needs 5. You can't exit early for just one lane. The solution is to track convergence per lane with a mask and use @reduce to check if everyone's done: [...]
This is the only thing I noticed in the article that would require some care to autovectorize. You need to structure your code in chunks, which makes the desired short-circuiting logic explicit to the compiler. It's almost identical to what you'd write for explicit vectorization but with chunks (constant size arrays) instead of vectors:
pub fn converge<const N: usize>(array: &mut [f32]) {
for chunk in array.as_chunks_mut::<N>().0 {
loop {
let mut converged = true;
for x in chunk.iter_mut() {
let b = x.abs() < 1.0;
*x = if b { *x } else { *x * 0.5 };
converged &= b;
}
if converged {
break;
}
}
}
}
(The entry-wise multiplication by 0.5 is a placeholder for the converging computation.)
For N = 16 the core loop on AVX512 compiles down to this:
.LBB0_2:
vandps zmm5, zmm0, zmm1
vmovaps zmm4, zmm0
vcmpnltps k1, zmm5, zmm2
vmulps zmm0 {k1}, zmm0, zmm3
kortestw k1, k1
jne .LBB0_2
Importantly, you don't need to use explicit chunking (or explicit vectorization) for the other parts of the code which are directly autovectorizable.
Just noticed the convergence logic got inverted by accident. It should have been
let mut any_active = false;
for x in chunk.iter_mut() {
let active = x.abs() >= 1.0;
*x = if active { *x * 0.5 } else { *x };
any_active |= active;
}
if !any_active {
break;
}
Here's where things got interesting. SGP4's Kepler solver needs atan2, but LLVM doesn't provide a vectorized builtin for it. Calling the scalar function would break the SIMD implementation.
This surprised me, so I looked it up - it does appear to me (if I'm understanding correctly) that LLVM does have an atan2 intrinsic for vectors, but zig doesn't provide @atan2?
Interesting read, but the performance bar charts were outright misleading. Those need to start at zero for the y-axis to be useful at all.