Faster asin() Was Hiding In Plain Sight
28 points by knl
28 points by knl
As a readability suggestion, write approximant polynomials using Horner's rule. The resulting polynomial is mostly just a list of coefficients. That way, it's much easier to compare different polynomials. It also gives the programmer more control over the order of operations, since IEEE-754 arithmetic doesn't associate; if one wants a predictable ladder of FMAs then one has to repeatedly alternate multiplication and addition.
Lolremez (previously, on Lobsters) supports arc sine. Instead of fishing for constants from old codebases and hoping that they work, we can generate fresh coefficients. Arc sine is an odd function, so we can change Remez variables to add a corresponding structural constraint before iterating. For example, I got:
$ lolremez --debug --degree 4 --range 0:1 "asin(sqrt(x))/sqrt(x)" "1/sqrt(x)"
…
[debug] error: 3.0793806583731183e-2
// Approximation of f(x) = asin(sqrt(x))/sqrt(x)
// with weight function g(x) = 1/sqrt(x)
// on interval [ 0, 1 ]
// with a polynomial of degree 4.
// p(x)=(((1.0632192121365258e+1*x-2.2122094756619801e+1)*x+1.5633874881759952e+1)*x-3.8898539851548752)*x+1.285884258668319
I'll leave it to the reader to read the linked tutorial and write the wrapper function that calls this.
Do compilers optimize a*x+b into a fused multiply-add? (With typical recommended compiler options, I mean. I could see fast-math turning that into a FMA, but fast-math is widely recommended against.)
Also, re: Horner's rule, someone on HN was saying there's a tradeoff because evaluating via Horner's is inherently serial: each instruction depends on the result of the previous. But if your processor has ILP and multiple ALUs, maybe there's more efficient ways to structure the math?
(The "most correct" answer is to profile on one's actual hardware and see. But you seemed to be talking in general about Horner's, and I value your opinions, which is why I ask. 🙂)
Regarding FMA, it depends on your target. You do have to add some -f flags, but not necessarily -ffast-math. I suppose I think first about GPUs and DSPs, but there's FMA on some consumer CPUs as well. On x86, they're unfortunately a casualty of the war between Intel and AMD; there's a WP page on the x86 FMA instructions.
Horner's rule is serial, yes. You need to have quite a few coefficients before parallelizing it becomes efficient, though. Like, maybe if you have dozens of coefficients? You're right about ILP and profiling, and I encourage folks to do that. However, more important for me is numerical accuracy and stability. Let x be small, so that the bits of its mantissa are very valuable; powers of x lose those bits too quickly. Horner's rule postpones exponentiation as long as possible, letting the bits gently roll off instead. This isn't as big of a deal in double-precision, but in single-precision arithmetic it helps a lot!
As a rule -ffast-math enables unsound optimizations that you don’t want: it’s best to examine the more specific underlying optimization options and pick just the ones you need. It also has global side-effects that can break translation units that are not compiled with -ffast-math.
But those blog posts don’t mention fma.
The clang docs say:
-ffp-contract=<value>
The C and C++ standards permit intermediate floating-point results within an expression to be computed with more precision than their type would normally allow. This permits operation fusing, and Clang takes advantage of this by default (on). Fusion across statements is not compliant with the C and C++ standards but can be enabled using -ffp-contract=fast.
Apparently gcc has -ffp-contract too
Define "compilers" ;)
In C you get some _mm_... intrinsics that can help. Rust has a straight up mul_add, but also can do less strict math in nightly. Others can do different things.
i believe gcc and clang treat <math.h>'s fma/fmaf as a built-in so they optimize it to fma instructions if the target instruction set supports it
Define "compilers" ;)
Fair enough! I was thinking "C/C++ compilers in general" because the blog post was testing C++ across a few different compilers.
Intrinsic functions make sense. Though I was more curious how generally eager C/C++ compilers are to choose FMA instructions without the user requesting them. I already knew reordering operations was a floating point no-no that you had to specifically ask for, as it seems to be the one that's most talked about when fast-math comes up. But if you literally have a*x+b in your program, applying FMA doesn't reorder anything—but it does change the rounding.
(And on the topic of rounding and how much leeway compilers have: How I joined the bug 323 community from 2 years ago. Guess linking to old Lobsters stories is my thing today.)
This reminds me of the recent announcements about the unsolved math problems. "LLM got a new Erdős problem solution", "wait we actually already had that one in some obscure book and didn't connect the dots". It feels like this is going to be a theme for a while and shows how our search was really really bad.
If AI's cavernous databases end up being a better search tool because of sources they've internalized and later surface in response to a prompt, that might not be a bad outcome.
Reminds me of Embeddings are underrated from last year.
They're not the only way to do search, and for some queries they may be worse than traditional full-text search. But it's a really handy "AI" tool that's distinct from LLMs and that we only recently added to our field's collective toolbox.
Another approximation for asin can be obtained from Bhāskara I's approximation for sin in 629. Surprisingly accurate, though IDK if accurate&fast enough for this purpose.
asin(x) ~= pi * (sqrt((1+x)/(4-x)) - 1/2)
https://en.wikipedia.org/wiki/Bh%C4%81skara_I%27s_sine_approximation_formula