Earlier this year, the second #AIMO (artificial intelligence mathematical olympiad) concluded, with the winning team solving 34/50 in the final set of math problems (that had been selected to be harder for AI than the first AIMO).
The competition was restricted to open source models and run with a limited amoutn of compute. The AIMO has now conducted a retest of these problems both for the top two teams from that competition (NemoSkills and imagination research), as well as OpenAI's o3 model, both with comparable levels of compute resources, and with high resources. Unsurprisingly, the high resource models did better, with the high resource o3 model scoring as high as 47/50, or even 50/50 if given two tries at each question. On the other hand, the gap between the open source models and the commercial models for a fixed amount of compute was relatively slight.
More details of this experiment are available at https://aimoprize.com/updates/2025-09-05-the-gap-is-shrinking