New blog post: "Exact UNORM8 to float" https://fgiesen.wordpress.com/2024/11/06/exact-unorm8-to-float/ nobody asked, but here we go.
Exact UNORM8 to float

GPUs support UNORM formats that represent a number inside [0,1] as an 8-bit unsigned integer. In exact arithmetic, the conversion to a floating-point number is straightforward: take the integer and…

The ryg blog

@rygorous general audience aside: For a known divisor you can (almost always) get a correctly rounded division with a FMA and a pair of constants.

https://marc-b-reynolds.github.io/math/2019/03/12/FpDiv.html

Floating point division with known divisor

A small note on division removal

Marc B. Reynolds

@mbr @rygorous I'm interested in doing this in half floating point and it looks like this method should work:

half k0 = 1.0 / 255.0;
half k1 = 0x0001h;
half x = fma(i, k0, i * k1);

but it requires denormal support, which is not available on all GPUs, and explicit fp16 fmas, which are busted on Adreno (they are promoted to fp32).

Any other ideas?

I guess I can just write `i*k0+i*k1` and hope for the best. In the worst case I just get the i*k0 approximation.

@castano @rygorous If I haven't messed up then just multiplying by the recip give 10 inputs that aren't correctly rounded. And I think by reworking the constants avoids required FMA and denormals.

k0 = 0x1.00p-8;
k1 = 0x1.01p-8;
lo = k1*i; // rounds
result = k0*(i+lo) // add rounds

https://gcc.godbolt.org/z/cTTKd41x5

Compiler Explorer - C (x86-64 clang (trunk))

#define half(x) ((__fp16)(x)) #define half_mul(a,b) half((a)*(b)) /* 09 0x1.2140p-5 : 0x1.2100p-5 (0x1.0000p-15) 0d 0x1.a1c0p-5 : 0x1.a180p-5 (0x1.0000p-15) 12 0x1.2140p-4 : 0x1.2100p-4 (0x1.0000p-14) 1a 0x1.a1c0p-4 : 0x1.a180p-4 (0x1.0000p-14) 24 0x1.2140p-3 : 0x1.2100p-3 (0x1.0000p-13) 34 0x1.a1c0p-3 : 0x1.a180p-3 (0x1.0000p-13) 48 0x1.2140p-2 : 0x1.2100p-2 (0x1.0000p-12) 68 0x1.a1c0p-2 : 0x1.a180p-2 (0x1.0000p-12) 90 0x1.2140p-1 : 0x1.2100p-1 (0x1.0000p-11) d0 0x1.a1c0p-1 : 0x1.a180p-1 (0x1.0000p-11) */ int main(void) { uint32_t count = 0; for(uint32_t i=0; i<256; i++) { __fp16 cr = half(half(i)/half(255)); __fp16 f; // multiply by recip f = half_mul(half(0x1.01p-8f), half(i)); // move a digit to the lo constant __fp16 x = half(i); __fp16 lo = half_mul(0x1.01p-8f, x); f = half_mul(0x1.0p-8f,half(x+lo)); if (f != cr) { printf("%02x %1.4a : %1.4a (%1.4a)\n",i,cr,f, cr-f); count++; } } printf("# not correctly rounded: %u\n", count); return 0; }

UNORM8 to half float validation

UNORM8 to half float validation. GitHub Gist: instantly share code, notes, and snippets.

Gist
@mbr @castano @rygorous Never seen sollya before; I assume on this example it’s useful mostly to guarantee that the intermediate computations are all rounded to half, vs C with float16_t where this is easy to miss and may get optimized away?

@zeux @castano @rygorous Yeah...exactly that.

Its main use is for creating function approximations (floating & fixed point).

https://www.sollya.org/

Sollya software tool

@mbr @zeux @rygorous Very neat! How did you come up with that approach? I noticed that the ten inputs that are off by one follow an interesting pattern:
9, 13 2*9, 2*13, 4*9, 4*13, 8*9, 8*13, 16*9, 16*13
but didn't know what to do with that.

@castano @zeux @rygorous Just by observing that the extended multiplier:

(2^-8 + 2^-16) + 2^-24

and that can be refactored into:

2^-8 *(1 + (2^-8 + 2^-16))

Yeah. The first thing I did as well was looked at was the incorrectly rounded.