I successfully implemented approximate reciprocal (1/x) and reciprocal square root (1/sqrt(x)) for MRISC32 today.
I use a 256-entry LUT and get about 7-8 bits of precision (full precision with two Newton-Raphson iterations).
Those are quite cheap instructions. Single cycle/no latency, and less than 40 ALMs in the FPGA for 32-bit floating-point.
Useful for the Quake 3D rendering loops. I got another couple of FPS by switching from FDIV to FRECIPA.
