@pervognsen @rygorous with fma my somewhat naive approach of splitting the double-precision reciprocal into floats worked:
int main()
{
float r1 = 1./255;
float r2 = 1./255 - r1;
for (int i = 0; i < 256; i++) {
if (i / 255.f != fmaf(i, r1, i*r2))
printf("%d\n", i);
}
}
@pervognsen As a general rule, it is often much easier to make this kind of ansatz work with directed roundings (and thus one-sided errors) than it is with RN; with RN in the mix, it's often easier to work with exact intermediaries (as in my case) than with rounded values.
The issue is that the combination of two-sided errors with rounding tends to be awkward to correct.
You can do it, but it generally pushes you towards two-sum style algorithms that have forced extra steps.
@rygorous Huh. I never considered that the reciprocal of a particular real number might have a less precise fp representation, but it makes sense now that I think about it.
Though I wonder… what if we invented a system where this wasn’t the case?
@rygorous I guess multiplication would be simple (just add the exponent values), and so would division, but you’d pay the price by making addition really hard.
It does seem interesting to me, I might try it…
@rygorous general audience aside: For a known divisor you can (almost always) get a correctly rounded division with a FMA and a pair of constants.
https://marc-b-reynolds.github.io/math/2019/03/12/FpDiv.html
@mbr @rygorous I'm interested in doing this in half floating point and it looks like this method should work:
half k0 = 1.0 / 255.0;
half k1 = 0x0001h;
half x = fma(i, k0, i * k1);
but it requires denormal support, which is not available on all GPUs, and explicit fp16 fmas, which are busted on Adreno (they are promoted to fp32).
Any other ideas?
I guess I can just write `i*k0+i*k1` and hope for the best. In the worst case I just get the i*k0 approximation.
@castano @rygorous If I haven't messed up then just multiplying by the recip give 10 inputs that aren't correctly rounded. And I think by reworking the constants avoids required FMA and denormals.
k0 = 0x1.00p-8;
k1 = 0x1.01p-8;
lo = k1*i; // rounds
result = k0*(i+lo) // add rounds
#define half(x) ((__fp16)(x)) #define half_mul(a,b) half((a)*(b)) /* 09 0x1.2140p-5 : 0x1.2100p-5 (0x1.0000p-15) 0d 0x1.a1c0p-5 : 0x1.a180p-5 (0x1.0000p-15) 12 0x1.2140p-4 : 0x1.2100p-4 (0x1.0000p-14) 1a 0x1.a1c0p-4 : 0x1.a180p-4 (0x1.0000p-14) 24 0x1.2140p-3 : 0x1.2100p-3 (0x1.0000p-13) 34 0x1.a1c0p-3 : 0x1.a180p-3 (0x1.0000p-13) 48 0x1.2140p-2 : 0x1.2100p-2 (0x1.0000p-12) 68 0x1.a1c0p-2 : 0x1.a180p-2 (0x1.0000p-12) 90 0x1.2140p-1 : 0x1.2100p-1 (0x1.0000p-11) d0 0x1.a1c0p-1 : 0x1.a180p-1 (0x1.0000p-11) */ int main(void) { uint32_t count = 0; for(uint32_t i=0; i<256; i++) { __fp16 cr = half(half(i)/half(255)); __fp16 f; // multiply by recip f = half_mul(half(0x1.01p-8f), half(i)); // move a digit to the lo constant __fp16 x = half(i); __fp16 lo = half_mul(0x1.01p-8f, x); f = half_mul(0x1.0p-8f,half(x+lo)); if (f != cr) { printf("%02x %1.4a : %1.4a (%1.4a)\n",i,cr,f, cr-f); count++; } } printf("# not correctly rounded: %u\n", count); return 0; }
@rygorous Super interesting! I recently had this problem of precision when converting colors. I came with a slightly different solution to this problem, giving similar exact results and slightly 10% faster. Code in #csharp
static float ByteToFloat(byte value) => (value + value + value) * (1.0f / (3.0f * 255.0f));
@xoofx @rygorous Hmm... Messing around, the best "i" I could find (by how far beyond 255 you need to go before getting an incorrect answer) is 341.
This gives a number that, when represented as float, is accurate to 40 bits (i.e. it has 16 zero bits after the 24 significant bits):
11000000111100001111110100000000000000001... 1/(341*255)
10000000100000001000000010000000100000001... 1/255
i=3 doesn't have this property, but is a little (enough?) closer to its rounded representation than i=1...
@xoofx @rygorous (I feel like the other factor is increasing the number of significant bits on the left hand side, so the multiplication result is wider. I wouldn't be surprised if I was making an error there though.)
This is probably a better way to look at it: https://x.com/corsix/status/1854539270981025903
(I _think_ my ideas are kind of equivalent?)
@rygorous Intuition: multiplication by the reciprocal involves multiplication by integer m then division by 2^e, where m/2^e approximates 1/255. We want a slightly more accurate approximation, which means increasing |m|, but FP format limits |m|. Solution? Factorise m = m_1 * m_2.
@corsix @xoofx @rygorous I think the best I can do on UNORM16 is:
float unorm16(int x) {
float f = x * (1 / 65536.f);
return f + f * (0x10001 / 4294967296.f);
}
Not amazing, but the first scale factor is a power-of-two, so it can be folded into the int-to-float operation on A64 (https://docsmirror.github.io/A64/2022-09/scvtf_float_fix.html), such that it's just two ops, SCVTF + FMADD.
It works without FMAs too, and x86 just needs an extra MULSS, but it's less convincing in those cases: https://godbolt.org/z/bsfx7WM7h