I have spent my last evenings optimizing my C# .NET 7 vectorized exp2 and log2, by improving their precision to 1 ULP (in addition to 3) and by allowing to parameterize over it, so that codegen gets nicely monomorphized

I compared it with SLEEF, that I used partly to optimize further. It's crazy how many code out there with exp2 and log2 are sometimes wrong or not as optimized!

I can now continue building higher level blocks for my tensor lib with activation functions for neural networks! 🏎️

@xoofx

Those are some impressive numbers. If it's broadly applicable you should do a full blog post on it.

@rastilin That would be definitely interesting! The code will be OSS with a BSD permissive license as well, so folks will be able to experiment/port it.
@xoofx what did you use to do the range reduction? I vaguely remember the last time I did a low ULP variant of these years ago there were some tricks needed to get the input down to the range for the polynomial approximation!
@sheredom yeah, the trickiest one is Log2 (Exp2 has also an interesting range reduction that SLEEF is also using). For Log2, I'm optimizing with the standard trick log2((1+x)/(1-x)) over [-1/7, 1/5[ which is almost super linear on this range (picture below), but then to achieve ULP1 for f32, when evaluating the polynomial, I switch my SIMD vector types to double. SLEEF is using double float-float arithmetic. I tried to implement it, but the codegen was far worse than switching to double!

@xoofx I think when I did it I used Robin Green's 2003 GDC talk approach, but turning log2 into log, and then using:

ln(x) = ln(2n × f )
= ln(2n )+ ln f
= nln2+ln f

I had to avoid a divide when I did this 10 years ago because it was slow on the GPU I was working on I remember!