Mastodawn

@rygorous Would be a 256x4 byte lookup table be terrible?

@pervognsen Very inconvenient for SIMD, anyhow. I'd much rather have 2 mul-pipe ops (and possibly an add if no FMA) than a gather.

@rygorous Oh yeah, for some reason I was thinking of GPU shaders. Although then you'd probably be dealing with bank conflicts for common bytes. Your solution is obviously great.

@rygorous I thought maybe I could replicate the result by using k1 = 1.0/255.0 and brute forcing a k2 due to the constrained combinatorics (only need a second-order correction to match 256 values) but no dice (which makes sense when I thought about the round-off analysis). Although I brute forced an exact k2 that goes with k1 = 65536.0 / 16777216.0.

@rygorous I only mention this because brute-force tuning single precision constants (even easier in this case since you can bound k2 relative to k1 very tightly) when you have the right first-order ansatz tends to work pretty well but obviously not necessary in this case as you showed, especially since exactness is a pass/fail test rather than a goodness of fit thing.

Alexander Monakov Nov 7, 2024

@pervognsen @rygorous with fma my somewhat naive approach of splitting the double-precision reciprocal into floats worked:

int main()
{

float r1 = 1./255;
float r2 = 1./255 - r1;

for (int i = 0; i < 256; i++) {
if (i / 255.f != fmaf(i, r1, i*r2))
printf("%d\n", i);
}
}

@amonakov @pervognsen yeah with FMA you generally have plenty of options, the interesting part about this case is that you can do just fine without it too.

@pervognsen As a general rule, it is often much easier to make this kind of ansatz work with directed roundings (and thus one-sided errors) than it is with RN; with RN in the mix, it's often easier to work with exact intermediaries (as in my case) than with rounded values.

The issue is that the combination of two-sided errors with rounding tends to be awkward to correct.

You can do it, but it generally pushes you towards two-sum style algorithms that have forced extra steps.

moonchild Nov 7, 2024

@rygorous @pervognsen don't feel like gathering? try 8x vpermi2b + mux + transpose or 8x vpermi2d + mux. 8x vpermi2b + mux + transpose: a tasty addition to an inner loop (it was probably memory bound anyway). it will go smoothly down the alu pipes! you will certainly not regret executing 8x vpermi2d + mux

Ethan Clark Nov 7, 2024

@rygorous Huh. I never considered that the reciprocal of a particular real number might have a less precise fp representation, but it makes sense now that I think about it.

Though I wonder… what if we invented a system where this wasn’t the case?

Ethan Clark Nov 7, 2024

@rygorous I suppose the only “proper” way to do it would be to map fp values to a geometric series (e.g. [the 256th root of 2] ^ n, -2^30 <= n <= 2^30, plus sign bit, zero value, etc), which is a simpler representation but almost certainly harder to do math on in many cases.

Ethan Clark Nov 7, 2024

@rygorous I guess multiplication would be simple (just add the exponent values), and so would division, but you’d pay the price by making addition really hard.

It does seem interesting to me, I might try it…

Marc B. Reynolds Nov 7, 2024

@rygorous general audience aside: For a known divisor you can (almost always) get a correctly rounded division with a FMA and a pair of constants.

https://marc-b-reynolds.github.io/math/2019/03/12/FpDiv.html

Floating point division with known divisor

A small note on division removal

Marc B. Reynolds

Ignacio Castaño🍉Nov 9, 2024

@mbr @rygorous I'm interested in doing this in half floating point and it looks like this method should work:

half k0 = 1.0 / 255.0;
half k1 = 0x0001h;
half x = fma(i, k0, i * k1);

but it requires denormal support, which is not available on all GPUs, and explicit fp16 fmas, which are busted on Adreno (they are promoted to fp32).

Any other ideas?

I guess I can just write `i*k0+i*k1` and hope for the best. In the worst case I just get the i*k0 approximation.

Marc B. Reynolds Nov 9, 2024

@castano @rygorous If I haven't messed up then just multiplying by the recip give 10 inputs that aren't correctly rounded. And I think by reworking the constants avoids required FMA and denormals.

k0 = 0x1.00p-8;
k1 = 0x1.01p-8;
lo = k1*i; // rounds
result = k0*(i+lo) // add rounds

https://gcc.godbolt.org/z/cTTKd41x5

Compiler Explorer - C (x86-64 clang (trunk))

#define half(x) ((__fp16)(x)) #define half_mul(a,b) half((a)*(b)) /* 09 0x1.2140p-5 : 0x1.2100p-5 (0x1.0000p-15) 0d 0x1.a1c0p-5 : 0x1.a180p-5 (0x1.0000p-15) 12 0x1.2140p-4 : 0x1.2100p-4 (0x1.0000p-14) 1a 0x1.a1c0p-4 : 0x1.a180p-4 (0x1.0000p-14) 24 0x1.2140p-3 : 0x1.2100p-3 (0x1.0000p-13) 34 0x1.a1c0p-3 : 0x1.a180p-3 (0x1.0000p-13) 48 0x1.2140p-2 : 0x1.2100p-2 (0x1.0000p-12) 68 0x1.a1c0p-2 : 0x1.a180p-2 (0x1.0000p-12) 90 0x1.2140p-1 : 0x1.2100p-1 (0x1.0000p-11) d0 0x1.a1c0p-1 : 0x1.a180p-1 (0x1.0000p-11) */ int main(void) { uint32_t count = 0; for(uint32_t i=0; i<256; i++) { __fp16 cr = half(half(i)/half(255)); __fp16 f; // multiply by recip f = half_mul(half(0x1.01p-8f), half(i)); // move a digit to the lo constant __fp16 x = half(i); __fp16 lo = half_mul(0x1.01p-8f, x); f = half_mul(0x1.0p-8f,half(x+lo)); if (f != cr) { printf("%02x %1.4a : %1.4a (%1.4a)\n",i,cr,f, cr-f); count++; } } printf("# not correctly rounded: %u\n", count); return 0; }

Marc B. Reynolds Nov 9, 2024

@castano @rygorous OK: validated using sollya.

https://gist.github.com/Marc-B-Reynolds/e4fd22fd4a98616816e24ecb123fb2e9

UNORM8 to half float validation

UNORM8 to half float validation. GitHub Gist: instantly share code, notes, and snippets.

Gist

Arseny Kapoulkine Nov 9, 2024

@mbr @castano @rygorous Never seen sollya before; I assume on this example it’s useful mostly to guarantee that the intermediate computations are all rounded to half, vs C with float16_t where this is easy to miss and may get optimized away?

Marc B. Reynolds Nov 9, 2024

@zeux @castano @rygorous Yeah...exactly that.

Its main use is for creating function approximations (floating & fixed point).

https://www.sollya.org/

Sollya software tool

Ignacio Castaño🍉Nov 10, 2024

@mbr @zeux @rygorous Very neat! How did you come up with that approach? I noticed that the ten inputs that are off by one follow an interesting pattern:
9, 13 2*9, 2*13, 4*9, 4*13, 8*9, 8*13, 16*9, 16*13
but didn't know what to do with that.

Marc B. Reynolds Nov 10, 2024

@castano @zeux @rygorous Just by observing that the extended multiplier:

(2^-8 + 2^-16) + 2^-24

and that can be refactored into:

2^-8 *(1 + (2^-8 + 2^-16))

Yeah. The first thing I did as well was looked at was the incorrectly rounded.

Fabian Giesen Nov 9, 2024

@zeux @mbr @castano Sollya is great! It's a frontend for multiprec math + various FP lattice-constrained solvers with ability to compute infinity/supremum (same thing, max abs error) norm for approximations.

It's a pretty narrow use case but it's fantastic at it.

moonchild Nov 9, 2024

@rygorous @zeux @mbr @castano also has the honour of being one of the very few environments that i can really trust won't do something fucky with fp math (the rest being mostly machine codes :V)

Marc B. Reynolds Nov 10, 2024

@castano @rygorous I forgot to point out an obvious detail. Since the outer product is now 2^-8 it can be error free composed with any product that is applied to the dequantized value.

Alexandre Mutel Nov 7, 2024

@rygorous Super interesting! I recently had this problem of precision when converting colors. I came with a slightly different solution to this problem, giving similar exact results and slightly 10% faster. Code in #csharp

static float ByteToFloat(byte value) => (value + value + value) * (1.0f / (3.0f * 255.0f));

Dougall Nov 7, 2024

@xoofx @rygorous Woah, nice! Is there theory behind that, or just trial and error?

Alexandre Mutel Nov 7, 2024

@dougall @rygorous It was purely gut feeling and brute force! 😅 I looked for an integer that would mitigate the loss of precision when dividing solely by 255 and tried i with (value * i) * (1.0f / (i * 255.0f)) and 3 came out quickly. Finding the theory going backward should be possible from there, but I'm a lazy person ☺️

Dougall Nov 7, 2024

@xoofx @rygorous Hmm... Messing around, the best "i" I could find (by how far beyond 255 you need to go before getting an incorrect answer) is 341.

This gives a number that, when represented as float, is accurate to 40 bits (i.e. it has 16 zero bits after the 24 significant bits):

11000000111100001111110100000000000000001... 1/(341*255)
10000000100000001000000010000000100000001... 1/255

i=3 doesn't have this property, but is a little (enough?) closer to its rounded representation than i=1...

Dougall Nov 7, 2024

@xoofx @rygorous (I feel like the other factor is increasing the number of significant bits on the left hand side, so the multiplication result is wider. I wouldn't be surprised if I was making an error there though.)

This is probably a better way to look at it: https://x.com/corsix/status/1854539270981025903

(I _think_ my ideas are kind of equivalent?)

Pete Cawley (@corsix) on X

@rygorous Intuition: multiplication by the reciprocal involves multiplication by integer m then division by 2^e, where m/2^e approximates 1/255. We want a slightly more accurate approximation, which means increasing |m|, but FP format limits |m|. Solution? Factorise m = m_1 * m_2.

X (formerly Twitter)

@xoofx Very nice! Added.

Alexandre Mutel Nov 8, 2024

@rygorous I'm curious if you have found an exact version for UNORM16 that is faster than a plain / 65535.0f? I have a basic mul version but it is just 1% faster so probably not worth.

corsix Nov 8, 2024

@xoofx @rygorous It looks like exact FP32 div via two muls works for SNORM8 (first mul by 31) and UNORM8 (first mul by 3) and SNORM16 (first mul by 73), but a coefficient search of this particular functional form comes up blank for UNORM16. (NB: The SNORM forms also need an extra clamp to treat -128 as -127 / -32768 as -32767)

@corsix @xoofx @rygorous I think the best I can do on UNORM16 is:

float unorm16(int x) {
float f = x * (1 / 65536.f);
return f + f * (0x10001 / 4294967296.f);
}

Not amazing, but the first scale factor is a power-of-two, so it can be folded into the int-to-float operation on A64 (https://docsmirror.github.io/A64/2022-09/scvtf_float_fix.html), such that it's just two ops, SCVTF + FMADD.

It works without FMAs too, and x86 just needs an extra MULSS, but it's less convincing in those cases: https://godbolt.org/z/bsfx7WM7h

SCVTF (scalar, fixed-point) -- A64

@corsix @xoofx @rygorous Oh, looks like that can work for UNORM8 too... The magic number (0x1010102) is a bit cursed, but it'd be hard to beat on A64:

float unorm8(int x) {
float f = x * (1 / 256.f);
return f + f * (0x1010102 / 4294967296.f);
}

@corsix @xoofx @rygorous And one for the AVX-512 people...

float unorm8(int x) {
// fesetround is slow, use _mm_mul_round_ss
fesetround(FE_TOWARDZERO);
float result = (float)x * (0x01010102 / 4294967296.f);
fesetround(FE_TONEAREST);
return result;
}

corsix Nov 9, 2024

@dougall @xoofx @rygorous Also a candidate on RV32F/RV64F due to the per-instruction rounding modes there.

@corsix @xoofx @rygorous Oh, nice! But not the vector instructions? Haha, oh well. I swear every time I learn something cool about RISC-V, I learn something disappointing at the same time.