Mastodawn

Floating point question:

I have two half precision floating point numbers. I convert them to single precision, add them, and convert the sum back to f16.

What’s the maximum I should expect the difference between the converted sum and the result of adding the two f16s directly?

I implemented f16 addition in software and originally was testing by converting to f32, adding, and converting back to f16 but I was getting large differences on some inputs (e.g., adding the smallest positive subnormal to negative small powers of 2 like -2^-12). Large here meaning integer differences of the underlying u16 of more than 8.

When I test my implementation against hardware f16, all 2^32 sums are bit-for-bit correct for all finite and infinite sums. Every computation involving a NaN gives me a NaN (but I’m not checking it’s giving the “right” NaN, just that it’s a NaN).

When I’m back at a computer, I can compute the answer to my question, but I’d like to understand the maximum difference analytically rather than merely empirically.