@rygorous general audience aside: For a known divisor you can (almost always) get a correctly rounded division with a FMA and a pair of constants.
https://marc-b-reynolds.github.io/math/2019/03/12/FpDiv.html
@mbr @rygorous I'm interested in doing this in half floating point and it looks like this method should work:
half k0 = 1.0 / 255.0;
half k1 = 0x0001h;
half x = fma(i, k0, i * k1);
but it requires denormal support, which is not available on all GPUs, and explicit fp16 fmas, which are busted on Adreno (they are promoted to fp32).
Any other ideas?
I guess I can just write `i*k0+i*k1` and hope for the best. In the worst case I just get the i*k0 approximation.
@castano @rygorous If I haven't messed up then just multiplying by the recip give 10 inputs that aren't correctly rounded. And I think by reworking the constants avoids required FMA and denormals.
k0 = 0x1.00p-8;
k1 = 0x1.01p-8;
lo = k1*i; // rounds
result = k0*(i+lo) // add rounds
#define half(x) ((__fp16)(x)) #define half_mul(a,b) half((a)*(b)) /* 09 0x1.2140p-5 : 0x1.2100p-5 (0x1.0000p-15) 0d 0x1.a1c0p-5 : 0x1.a180p-5 (0x1.0000p-15) 12 0x1.2140p-4 : 0x1.2100p-4 (0x1.0000p-14) 1a 0x1.a1c0p-4 : 0x1.a180p-4 (0x1.0000p-14) 24 0x1.2140p-3 : 0x1.2100p-3 (0x1.0000p-13) 34 0x1.a1c0p-3 : 0x1.a180p-3 (0x1.0000p-13) 48 0x1.2140p-2 : 0x1.2100p-2 (0x1.0000p-12) 68 0x1.a1c0p-2 : 0x1.a180p-2 (0x1.0000p-12) 90 0x1.2140p-1 : 0x1.2100p-1 (0x1.0000p-11) d0 0x1.a1c0p-1 : 0x1.a180p-1 (0x1.0000p-11) */ int main(void) { uint32_t count = 0; for(uint32_t i=0; i<256; i++) { __fp16 cr = half(half(i)/half(255)); __fp16 f; // multiply by recip f = half_mul(half(0x1.01p-8f), half(i)); // move a digit to the lo constant __fp16 x = half(i); __fp16 lo = half_mul(0x1.01p-8f, x); f = half_mul(0x1.0p-8f,half(x+lo)); if (f != cr) { printf("%02x %1.4a : %1.4a (%1.4a)\n",i,cr,f, cr-f); count++; } } printf("# not correctly rounded: %u\n", count); return 0; }
@castano @rygorous OK: validated using sollya.
https://gist.github.com/Marc-B-Reynolds/e4fd22fd4a98616816e24ecb123fb2e9