Whenever a frontier lab drops a new model you always see their employees posting things like "you’ll be surprised by how good we made our new model! throw your hardest problems at it". Today (while trying to build up an intuition for how GRPO works) I think I realised that there are actually two things going on there:
1. "Our model is better than you think / give it credit for"
2. "We need hard tasks and examples to train on so we can make the next version of our model even better."

