| https://linkedin.com/in/strickvl | |
| github | https://github.com/strickvl |
| blog | https://mlops.systems/ |
| https://linkedin.com/in/strickvl | |
| github | https://github.com/strickvl |
| blog | https://mlops.systems/ |
Took a little detour into what softmax helps with (and where it might not) and also thought a bit about scheduling for the reward and whether it might make sense to start with more signal to start with, or less signal and then add more and more signal by way of temperature values that could change. (It seems this isn't really a thing in RL training in the same way it is for the learning rate where you *do* do scheduling a lot in practice, but fun to consider!)
https://alexstrick.com/posts/2026-06-20-my-first-rl-environment.html

My first hands-on RL day. Before any weights move, three ways to shape good and bad structured-extraction traces — filter, reward-weight (with a surprise softmax-and-temperature detour), or push away with full RL — then my first verifiers environment: a dataset plus a rubric, no trainer.
And a huge thanks to @adithya_s_k, @_lewtun, @lvwerra, @QGallouedec, @ben_burtenshaw and @sergiopaniego for their banger of a blog back in May which was super useful!
https://huggingface.co/spaces/AdithyaSK/rl-environments-guide
My goal is to build out some environments of my own in various frameworks and also to do the training as well, so I'll be exploring the space in practice in the coming days / weeks. I wrote a little something (including another video) on how I think about placing frameworks within my mental model of the five main things that are going on in RL.
https://alexstrick.com/posts/2026-06-19-reading-rl-environment-landscape.html here's my blog
For example, GLM 5.2 came out yesterday and they explicitly highlight how they needed PPO for long-horizon agentic RL work. (https://z.ai/blog/glm-5.2)
Blog even includes a little video of me talking through my understanding of GRPO... (with obviously the huge caveat that I'm early days in my learning etc etc!)
Whenever a frontier lab drops a new model you always see their employees posting things like "you’ll be surprised by how good we made our new model! throw your hardest problems at it". Today (while trying to build up an intuition for how GRPO works) I think I realised that there are actually two things going on there:
1. "Our model is better than you think / give it credit for"
2. "We need hard tasks and examples to train on so we can make the next version of our model even better."