Whenever a frontier lab drops a new model you always see their employees posting things like "you’ll be surprised by how good we made our new model! throw your hardest problems at it". Today (while trying to build up an intuition for how GRPO works) I think I realised that there are actually two things going on there:

1. "Our model is better than you think / give it credit for"
2. "We need hard tasks and examples to train on so we can make the next version of our model even better."

The first is easy to appreciate, but to understand the second you have to understand a bit about the RL algorithm that's mostly ruled the waves since early 2025. I wrote up my notes in a blog below, but it's interesting that we're seeing a bit of a return to the old ways of doing RL (for instance, the critic model approach as found in PPO).

For example, GLM 5.2 came out yesterday and they explicitly highlight how they needed PPO for long-horizon agentic RL work. (https://z.ai/blog/glm-5.2)

Blog even includes a little video of me talking through my understanding of GRPO... (with obviously the huge caveat that I'm early days in my learning etc etc!)

https://alexstrick.com/posts/2026-06-18-grpo-explained.html