@strickvl

4 Followers
3 Following
297 Posts
Machine Learning Engineer, researcher (& author of a few books in my old life as a historian).
Love learning languages (machine and human), cats and sharks. Studying Mathematics @ the Open University. Budding J enthusiast.
linkedinhttps://linkedin.com/in/strickvl
githubhttps://github.com/strickvl
bloghttps://mlops.systems/
Up next: freezing the data and eval criteria for my real dataset that I'll think through now before I start messing around with training (since I'll inevitably bias myself / the results).

Took a little detour into what softmax helps with (and where it might not) and also thought a bit about scheduling for the reward and whether it might make sense to start with more signal to start with, or less signal and then add more and more signal by way of temperature values that could change. (It seems this isn't really a thing in RL training in the same way it is for the learning rate where you *do* do scheduling a lot in practice, but fun to consider!)

https://alexstrick.com/posts/2026-06-20-my-first-rl-environment.html

My first RL environment: three stages, no trainer – Alex Strick van Linschoten

My first hands-on RL day. Before any weights move, three ways to shape good and bad structured-extraction traces — filter, reward-weight (with a surprise softmax-and-temperature detour), or push away with full RL — then my first verifiers environment: a dataset plus a rubric, no trainer.

Alex Strick van Linschoten
Wrote my first RL environment this evening. A very simple on, mind, but 'verifiers' (by @PrimeIntellect and @willccbb) makes it very easy to slot in the pieces.

And a huge thanks to @adithya_s_k, @_lewtun, @lvwerra, @QGallouedec, @ben_burtenshaw and @sergiopaniego for their banger of a blog back in May which was super useful!

https://huggingface.co/spaces/AdithyaSK/rl-environments-guide

The ultimate guide to RL environments: building and scaling them in the LLM era - a Hugging Face Space by AdithyaSK

Building and scaling RL environments for LLM training

My goal is to build out some environments of my own in various frameworks and also to do the training as well, so I'll be exploring the space in practice in the coming days / weeks. I wrote a little something (including another video) on how I think about placing frameworks within my mental model of the five main things that are going on in RL.

https://alexstrick.com/posts/2026-06-19-reading-rl-environment-landscape.html here's my blog

How to read an RL framework without believing its README – Alex Strick van Linschoten

The RL tooling space grows weekly. Rather than memorise frameworks, I read each one against a five-stage mental model — and stay skeptical of what its README claims to do.

Alex Strick van Linschoten
I'm now transitioning from the part of my agentic RL exploration where I learned the high-level concepts to seeing what people are doing in practice. Part of that has meant navigating the world of RL environment/training frameworks which I have to say is slightly overwhelming! (new domain vocabulary, new players, explosion of projects...)

For example, GLM 5.2 came out yesterday and they explicitly highlight how they needed PPO for long-horizon agentic RL work. (https://z.ai/blog/glm-5.2)

Blog even includes a little video of me talking through my understanding of GRPO... (with obviously the huge caveat that I'm early days in my learning etc etc!)

https://alexstrick.com/posts/2026-06-18-grpo-explained.html

The first is easy to appreciate, but to understand the second you have to understand a bit about the RL algorithm that's mostly ruled the waves since early 2025. I wrote up my notes in a blog below, but it's interesting that we're seeing a bit of a return to the old ways of doing RL (for instance, the critic model approach as found in PPO).

Whenever a frontier lab drops a new model you always see their employees posting things like "you’ll be surprised by how good we made our new model! throw your hardest problems at it". Today (while trying to build up an intuition for how GRPO works) I think I realised that there are actually two things going on there:

1. "Our model is better than you think / give it credit for"
2. "We need hard tasks and examples to train on so we can make the next version of our model even better."

https://alexstrick.com/technical.html#category=reinforcement-learning some of my study notes blog posts are here, but at this point probably this is more useful for me than it's useful for anyone else! I do hope to graduate to doing some more practical mini-projects soon, so I probably those might have more external value!
Technical Blog – Alex Strick van Linschoten

Personal and technical writings from Alex Strick van Linschoten

Alex Strick van Linschoten