Floating point question:

I have two half precision floating point numbers. I convert them to single precision, add them, and convert the sum back to f16.

What’s the maximum I should expect the difference between the converted sum and the result of adding the two f16s directly?

I implemented f16 addition in software and originally was testing by converting to f32, adding, and converting back to f16 but I was getting large differences on some inputs (e.g., adding the smallest positive subnormal to negative small powers of 2 like -2^-12). Large here meaning integer differences of the underlying u16 of more than 8.

When I test my implementation against hardware f16, all 2^32 sums are bit-for-bit correct for all finite and infinite sums. Every computation involving a NaN gives me a NaN (but I’m not checking it’s giving the “right” NaN, just that it’s a NaN).

When I’m back at a computer, I can compute the answer to my question, but I’d like to understand the maximum difference analytically rather than merely empirically.

Related question, how many bits do I need to use to add the significands?

I found that converting the 11-bit significands to 32-bit integers by shifting them left by 19 before normalizing to the larger exponent lead to correct results using rounding with ties to even. Using 16-bit integers by shifting left by 3 doesn’t give me the correct results.

My thinking was I need a few bits to determine how to round and I wanted to make sure the signed significand addition didn’t overflow my type (either i32 or i16).

My current intuition is I need twice as many bits (so 22 plus 1 for the sum plus 1 for the sign).