Cool numerics rule of thumb shared by @nan on Twitter today:
Computing inverse p-th root of a matrix "A^{-1/p}" will lose about log2(K/p) bits of precision,
where K is the condition number (largest over smallest eigenvalue).
So, a condition number of 10^6 (not uncommon at all in DL) loses 19 bits, leaving you with only 4 bits in fp32.
That's why very careful implementation of higher-order optimizers like Shampoo is necessary (or every strong "regularization" which may help or hurt overall)





















