Components of A Coding Agent

How coding agents use tools, memory, and repo context to make LLMs work better in practice

Ahead of AI

> long contexts are still expensive and can also introduce additional noise (if there is a lot of irrelevant info)

I think spec-driven generation is the antithesis of chat-style coding for this reason. With tools like Claude Code, you are the one tracking what was already built, what interfaces exist, and why something was generated a certain way.

I built Ossature[1] around the opposite model. You write specs describing behavior, it audits them for gaps and contradictions before any code is written, then produces a build plan toml where each task declares exactly which spec sections and upstream files it needs. The LLM never sees more than that, and there is no accumulated conversation history to drift from. Every prompt and response is saved to disk, so traceability is built in rather than something you reconstruct by scrolling back through a chat. I used it over the last couple of days to build a CHIP-8 emulator entirely from specs[2]. I have some more example projects on GitHub[3]

1: https://github.com/ossature/ossature

2: https://github.com/beshrkayali/chomp8

3: https://github.com/ossature/ossature-examples

I like it a lot, I find the chat driven workflow very tiring and a lot of information gets lost in translation until LLMs just refuse to be useful.

How does the human intervention work out? Do you use a mix of spec and audit editing to get into the ready to generate state? How high is the success/error rate if you generate from tasks to code, do LLMs forget/mess up things or does it feel better?

The spec driven approach is potentially better for writing things from scratch, do you have any plans for existing code?

Thanks!

> How does the human intervention work out? Do you use a mix of spec and audit editing to get into the ready to generate state?

Yes, the flow is: you write specs then you validate them `ossature validate` which parses them and checks they are structurally sound (no LLM involved), then you'd run `ossature audit` which flags gaps or contradictions in the content, and from that it produces a toml build plan that you can read and edit directly before anything is generated. You can reorder tasks, add notes for the llm, adjust verification commands, or skip steps entirely. So when you run `ossature build` to generate, the structure is already something you have signed off on.

> The spec driven approach is potentially better for writing things from scratch, do you have any plans for existing code?

Right now it is best for greenfield, as you said. I have been thinking about a workflow where you generate specs from existing code and then let Ossature work from those, but I am honestly not sure that is the right model either. The harder case is when engineers want to touch both the code and the specs, and keeping those in sync through that back and forth is something I want to support but have not figured out a clean answer for yet. It's on the list, if you have any thoughts please feel free to open an issue! I want to get through some of the issues I am seeing with just spec editing workflow (and re-audit/re-planning) first, specifically around how changes cascade through dependent tasks.

Regarding success rate, each task requires a verification command to run and pass after generation and if it fails, a separate fixer agent tries to repair it using the error output. The number of retry attempts is configurable. I did notice that the more concise and clear the spec is the more likely it is for capable models to generate code that works (obviously) but that's what auditing is supposed to help with. One interesting case about the chip-8 emulator I mentioned above is that even mentioning the correct name of the solution to a specific problem was not enough, I had to spell out the concrete algorithm in the spec (wrote more details here[1]). But the full prompt and response for every task is saved to disk, so when something does go wrong one can read the exact prompt/response and fix-attempts prompt/response for each task.

I wrote more details in an intro post[2] about Ossature, if useful.

1: https://log.beshr.com/chip8-emulator-from-spec/

2: https://ossature.dev/blog/introducing-ossature/

Writing a CHIP-8 Emulator from Spec | The Log

This looks great, and I’ve bookmarked to give it a go.

Any reason you’ve opted for custom markdown formats with the @ syntax rather than using something like frontmatter?

Very conscious that this would prevent any markdown rendering in github etc.

Hey, you seem to have similar view on this. I know ideas are cheap but hear me out:

You talk with agent A it only modifies this spec, you still chat and can say "make it prettier" but that agent only modifies the spec, the spec could also separate "explicit" from "inferred".

And of course agent B which builds only sees the spec.

User actually can care about diffs generated by agent A again, because nobody wants to verify diffs on agents generated code full of repetition and created by search and replace. I believe if somebody implements this right it will be the way things are done.

And of course with better models spec can be used to actually meaningfully improve the product.

Long story short what industry misses currently and what you seem to be understanding is that intent is sacred. It should be always stored, preferably verbatim and always with relevant context ("yes exactly" is obviously not enough). Current generation of LLMs can already handle all that. It would mean like 2-3x cost but seem so much worth it (and the cost on the long run could likely go below 1x given typical workflows and repetitions)