RLMs prove that there is hidden unlockable fluid intelligence in LLMs, specifically on tasks that require genuine test-time reasoning, where no amount of memorisation helps, e.g. ARC-AGI-2
Couple of condensed points:
- REPL-as-environment pattern is general, not just long-context trick. RLM paper uses long context as the motivating use case for writing symbolic programs over the input prompt, but the pivotools / Symbolica papers show agentic coding (persistent REPL + iterative interaction + optional recursive self-calling) dramatically improves reasoning on short-context tasks too
- RLM is a third scaling axis alongside Chain-of-Thought and tool calling, where RLM controls behaviour for context management.
- RLM trajectories are a trainable objective via RL, improvements shown when fine tuning for RLM trajectories
- Underlying mechanism for grounding LLM reasoning in concrete code execution and feedback has broad applicability.
- recursive delegation provides real additional gain
- interleaved thinking currently fragile but transformative: pivotools complained that many inference providers and open-weight models respond without actually doing interleaved reasoning
#LLM #AI #RLM #Research