Mastodawn

Designing agentic loops

https://simonwillison.net/2025/Sep/30/designing-agentic-loops/

Designing agentic loops

Coding agents like Anthropic’s Claude Code and OpenAI’s Codex CLI represent a genuine step change in how useful LLMs can be for producing working code. These agents can now directly …

Simon Willison’s Weblog

Show thread

mccoyb Sep 30, 2025

I think this is a strictly worse name than "agentic harness", which is already a term used by open-source agentic IDEs (https://github.com/search?q=repo%3Aopenai%2Fcodex%20harness&... or https://github.com/openai/codex/discussions/1174)

Any reason why you want to rename it?

Edit: to say more about my opinions, "agentic loop" could mean a few things -- it could mean the thing you say, or it could mean calling multiple individual agents in a loop ... whereas "agentic harness" evokes a sort of interface between the LLM and the digital outside world which mediates how the LLM embodies itself in that world. That latter thing is exactly what you're describing, as far as I can tell.

Build software better, together

GitHub is where people build software. More than 150 million people use GitHub to discover, fork, and contribute to over 420 million projects.

GitHub

Show thread

simonw Sep 30, 2025

I like "agentic harness" too, but that's not the name of a skill.

"Designing agentic loops" describes a skill people need to develop. "Designing agentic harnesses" sounds more to me like you're designing a tool like Claude Code from scratch.

Plus "designing agentic loops" includes a reference to my preferred definition of the term "agent" itself - a thing that runs tools in a loop to achieve a goal.

Show thread

prats226 Sep 30, 2025

Context engineering is another name people have given to same skill?

Show thread

simonw Sep 30, 2025

I think that's actually quite different.

Context engineering is about making sure you've stuffed the context with all of the necessary information - relevant library documentation and examples and suchlike.

Design the agentic loop is about picking the right tools to be provided to the model. The tool descriptions may go in the context but you also need to provide the right implementations of them.

Show thread

tptacek Sep 30, 2025

They feel pretty closely connected. For instance: in an agent loop over a series of tool calls, which tool results should stay resident in the context, which should be summarized, which should be committed to a tool-searchable "memory", and which should be discarded? All context engineering questions and all kind of fundamental to the agent loop.

Show thread

simonw Sep 30, 2025

Yeah, "connected" feels right to me.

Those decisions feel to me like problems for the agent harness to solve - Anthropic released a new cookbook about that yesterday: https://github.com/anthropics/claude-cookbooks/blob/main/too...

claude-cookbooks/tool_use/memory_cookbook.ipynb at main · anthropics/claude-cookbooks

A collection of notebooks/recipes showcasing some fun and effective ways of using Claude. - anthropics/claude-cookbooks

GitHub

Show thread

tptacek Sep 30, 2025

One thing I'm really fuzzy on is, if you're building a multi-model agent thingy (like, can drive with GPT5 or Sonnet), should you be thinking about context management tools like memory and autoediting as tools the agent provides, or should you be wrapping capabilities the underlying models offer? Memory is really easy to do in the agent code! But presumably Sonnet is better trained to use its own builtins.

Show thread

prats226 Sep 30, 2025

It boils down to information loss in compaction driven by LLM's. Either you could carefully design tools that only give compacted output with high information density so models have to auto-compact or organize information only once in a while which eventually is going to be lossy.

Or you just give loads of information without thinking much about it, assuming models will have to do frequent compaction and memory organization and hope its not super lossy.

Show thread

tptacek Sep 30, 2025

Right, just so I'm clear here: assume you decide your design should be using a memory tool. Should you make your own with a tool call interface or should you rely on a model feature for it, and how much of a difference does it make?

Show thread

mike_hearn Sep 30, 2025

I recently built my own coding agent, due to dissatisfaction with the ones that are out there (though the Claude Code UI is very nice). It works as suggested in the article. It starts a custom Docker container and asks the model, GPT-5 in this case, to send shell scripts down the wire which are then run in the container. The container is augmented with some extra CLI tools to make the agent's life easier.

My agent has a few other tricks up its sleeve and it's very new, so I'm still experimenting with lots of different ideas, but there are a few things I noticed.

One is that GPT-5 is extremely willing to speculate. This is partly because of how I prompt it, but it's willing to write scripts that try five or six things at once in a single script, including things like reading files that might not exist. This level of speculative execution speeds things up dramatically especially as GPT-5 is otherwise a very slow model that likes to think about things a lot.

Another is that you can give it very complex "missions" and it will drive things to completion using tactics that I've not seen from other agents. For example, if it needs to check something that's buried in a library dependency, it'll just clone the upstream repository into its home directory and explore that to find what it needs before going back to working on the user's project.

None of this triggers any user interaction due to running in the container. In fact, no user interaction is possible. You set it going and do something else until it finishes. The model is very much to queue up "missions" that can then run in parallel and you merge them together at the end. The agent also has a mode where it takes the mission, writes a spec, reviews the spec, updates the spec given the review, codes, reviews the code, etc.

Even though it's early days I've set this agent missions that it spent 20 minutes of continuous uninterrupted inferencing time on, and succeeded excellently. I think this UI paradigm is the way to go. You can't scale up AI assisted coding if you're constantly needing to interact with the agent. Getting the most out of models requires maximally exploiting parallelism, so sandboxing is a must.

Show thread

tptacek Sep 30, 2025

This also feels like the structure that Sketch.dev uses --- it's asynchronous running in a YOLO mode in a container on a cloud instance, with very little interaction (the expectation is you give it tasks and walk away). I have friends that queue up lots of tasks in the morning and prune down to just a couple successes in the afternoon. I'd do this too but for the scale I work on merge conflicts are too problematic.

I'm working on my own dumb agent for my own dumb problems (engineering, but not software development) and I'd love to hear more about the tricks you're spotting.

Show thread

tptacek Sep 30, 2025

This is great. I feel like most of the oxygen is going to go to the sandboxing question (fair enough). But I'm kind of obsessed with what agent loops for engineering tasks that aren't coding look like, and also the tweaks you need for agent loops that handle large amounts of anything (source code lines, raw metrics or oTel span data, whatever).

There was an interval where the notion of "context engineering" came into fashion, and we quickly dunked all over it (I don't blame anybody for that; "prompt engineering" seemed pretty cringe-y to me), but there's definitely something to the engineering problems of managing a fixed-size context window while iterating indefinitely through a complex problem, and there's all sorts of tricks for handling it.

Show thread

m4r71n Oct 1, 2025

I can imagine an agentic loop that updates dependencies à la Dependabot/Renovate-style by going through the changelog of a new version, reviewing new code changes, and evaluating whether it's worth it to upgrade (or even dangerous to do so, either from stability or security point of view). Too often these tools are used to blindly respin builds with the latest and greatest versions, which is what gets most people in trouble when their NPM deps become malicious.

Show thread

tptacek Oct 1, 2025

You should write it! You might be surprised how easy it is to get something working.