New blog post: "Exact UNORM8 to float" https://fgiesen.wordpress.com/2024/11/06/exact-unorm8-to-float/ nobody asked, but here we go.
Exact UNORM8 to float

GPUs support UNORM formats that represent a number inside [0,1] as an 8-bit unsigned integer. In exact arithmetic, the conversion to a floating-point number is straightforward: take the integer and…

The ryg blog

@rygorous Super interesting! I recently had this problem of precision when converting colors. I came with a slightly different solution to this problem, giving similar exact results and slightly 10% faster. Code in #csharp

static float ByteToFloat(byte value) => (value + value + value) * (1.0f / (3.0f * 255.0f));

@xoofx Very nice! Added.
@rygorous I'm curious if you have found an exact version for UNORM16 that is faster than a plain / 65535.0f? I have a basic mul version but it is just 1% faster so probably not worth.
@xoofx @rygorous It looks like exact FP32 div via two muls works for SNORM8 (first mul by 31) and UNORM8 (first mul by 3) and SNORM16 (first mul by 73), but a coefficient search of this particular functional form comes up blank for UNORM16. (NB: The SNORM forms also need an extra clamp to treat -128 as -127 / -32768 as -32767)

@corsix @xoofx @rygorous I think the best I can do on UNORM16 is:

float unorm16(int x) {
float f = x * (1 / 65536.f);
return f + f * (0x10001 / 4294967296.f);
}

Not amazing, but the first scale factor is a power-of-two, so it can be folded into the int-to-float operation on A64 (https://docsmirror.github.io/A64/2022-09/scvtf_float_fix.html), such that it's just two ops, SCVTF + FMADD.

It works without FMAs too, and x86 just needs an extra MULSS, but it's less convincing in those cases: https://godbolt.org/z/bsfx7WM7h

SCVTF (scalar, fixed-point) -- A64

@corsix @xoofx @rygorous Oh, looks like that can work for UNORM8 too... The magic number (0x1010102) is a bit cursed, but it'd be hard to beat on A64:

float unorm8(int x) {
float f = x * (1 / 256.f);
return f + f * (0x1010102 / 4294967296.f);
}

@corsix @xoofx @rygorous And one for the AVX-512 people...

float unorm8(int x) {
// fesetround is slow, use _mm_mul_round_ss
fesetround(FE_TOWARDZERO);
float result = (float)x * (0x01010102 / 4294967296.f);
fesetround(FE_TONEAREST);
return result;
}

@dougall @xoofx @rygorous Also a candidate on RV32F/RV64F due to the per-instruction rounding modes there.
@corsix @xoofx @rygorous Oh, nice! But not the vector instructions? Haha, oh well. I swear every time I learn something cool about RISC-V, I learn something disappointing at the same time.