Mastodawn

I have spent my last evenings optimizing my C# .NET 7 vectorized exp2 and log2, by improving their precision to 1 ULP (in addition to 3) and by allowing to parameterize over it, so that codegen gets nicely monomorphized

I compared it with SLEEF, that I used partly to optimize further. It's crazy how many code out there with exp2 and log2 are sometimes wrong or not as optimized!

I can now continue building higher level blocks for my tensor lib with activation functions for neural networks! 🏎️

Show thread

rastilin Mar 24, 2023

@xoofx

Those are some impressive numbers. If it's broadly applicable you should do a full blog post on it.

Show thread

Alexandre Mutel Mar 24, 2023

@rastilin That would be definitely interesting! The code will be OSS with a BSD permissive license as well, so folks will be able to experiment/port it.

Show thread

Neil Henning Mar 24, 2023

@xoofx what did you use to do the range reduction? I vaguely remember the last time I did a low ULP variant of these years ago there were some tricks needed to get the input down to the range for the polynomial approximation!

Show thread

Alexandre Mutel Mar 24, 2023

@sheredom yeah, the trickiest one is Log2 (Exp2 has also an interesting range reduction that SLEEF is also using). For Log2, I'm optimizing with the standard trick log2((1+x)/(1-x)) over [-1/7, 1/5[ which is almost super linear on this range (picture below), but then to achieve ULP1 for f32, when evaluating the polynomial, I switch my SIMD vector types to double. SLEEF is using double float-float arithmetic. I tried to implement it, but the codegen was far worse than switching to double!

Show thread

Neil Henning Mar 24, 2023

@xoofx I think when I did it I used Robin Green's 2003 GDC talk approach, but turning log2 into log, and then using:

ln(x) = ln(2n × f )
= ln(2n )+ ln f
= nln2+ln f

I had to avoid a divide when I did this 10 years ago because it was slow on the GPU I was working on I remember!