Despite not yet being a benchmark, the First Proof project is by far the best measure of model usefulness for science and math research available today, and I very much hope that frontier labs continue to take future rounds seriously.
https://www.daniellitt.com/blog/2026/2/20/mathematics-in-the-library-of-babel

Mathematics in the Library of Babel — Daniel Litt
Mathematics isn't only about saying true things. It's about asking the right questions, being confused, stumbling about, getting distracted, being wrong, recognizing when you're wrong, being stuck. Mostly being stuck. It's about clinging to a giant edifice and feeling it out until you understand som




