"Across multiple coding agents and LLMs, we find that context files tend to reduce task success rates compared to providing no repository context, while also increasing inference cost by over 20%"

I've suspected this all along. Folks spending mucho-plenty time curating project-level .md files have been deluding themselves that it helps.

https://arxiv.org/abs/2602.11988

Evaluating AGENTS.md: Are Repository-Level Context Files Helpful for Coding Agents?

A widespread practice in software development is to tailor coding agents to repositories using context files, such as AGENTS.md, by either manually or automatically generating them. Although this practice is strongly encouraged by agent developers, there is currently no rigorous investigation into whether such context files are actually effective for real-world tasks. In this work, we study this question and evaluate coding agents' task completion performance in two complementary settings: established SWE-bench tasks from popular repositories, with LLM-generated context files following agent-developer recommendations, and a novel collection of issues from repositories containing developer-committed context files. Across multiple coding agents and LLMs, we find that context files tend to reduce task success rates compared to providing no repository context, while also increasing inference cost by over 20%. Behaviorally, both LLM-generated and developer-provided context files encourage broader exploration (e.g., more thorough testing and file traversal), and coding agents tend to respect their instructions. Ultimately, we conclude that unnecessary requirements from context files make tasks harder, and human-written context files should describe only minimal requirements.

arXiv.org

The best results I've managed to get were when I kept contexts small and task-specific, solving one problem at a time. They can pay attention (literally) to surprisingly little at a time.

And summaries of the code are out of date the moment the tool starts changing it.

Just like in real life :-)

This is a big tick for some of my AI-Ready Software Developer principles.

But, heck, it just seems so obvious!

https://codemanship.wordpress.com/2025/10/28/the-ai-ready-software-developer-12-ground-truth/

The AI-Ready Software Developer #12 – Ground Truth

When Large Language Models hit the headlines in late 2022, with much speculation about impending Artificial General Intelligence (AGI) and the displacement of hundreds of millions of knowledge work…

Codemanship's Blog

Well, this has stirred a hornet's nest. Some folks luuurve their project context files!

"Yes, I see the hard data, Jason. But in my experience..."

@jasongorman
My thesis is that it's an Illusion of Control problem. The context file gives you the illusion that you have control. Changing it changes the output. Accepting that the changes are basically random takes your only tool of "control" away.
It reminds me of the story that people prefer using the map of a wrong city over giving up the map and using no map at all 😬
@realn2s I think you're very probably right. The problem with probabilistic systems that *seem* to understand us is that we very easily fool ourselves. Confirmation bias is very much in play.

@jasongorman @realn2s Jason, I quote your comment about seeing the face of Jesus in a piece of toast all the time.

Turing, apparently, thought rather too highly of human intelligence.

@jasongorman

The other response when folks don't like what the data says is it just needs more time.

@jasongorman “What do you mean the data shows it’s ineffective? Go find better data!”

Everyone’s in love with ELIZA’s sister 🤷‍♂️

@thirstybear All I need to do is find the 0.1% of clients who actually care what's real and care about what works
@jasongorman Same here, although a slightly different client demographic. It is all very frustrating. Seems everyone is infected with AI psychosis.
@jasongorman so, honest question, how do you feel about all the externalities you are enabling by promoting the use of the slop machine? As in, how the models were trained, how they treated the lack of consent of the sites they scraped, what other uses and effects on society they have, and how many resources they used to train and to run.

@jasongorman

I've been playing with Recursive Language Models and I think some version of that technique will work its way into most attention-heavy and long-context functionalities. And it amounts to what you say, LLMs are bad at attention. RLMs try to tackle that by exposing the (sub)models to less context at a time.

@jasongorman I just wish people were willing to spend that amount of time and effort curating project-level (and other) documentation for actual humans.

@jasongorman @joe The paper’s conclusion is subtly different than that. It says that auto-generated AGENTS.md provide no value, but a manually crafted one provides positive marginal returns.

The real takeaway should be:
- You should have project files like an AGENTS.md.
- You should use it to address real issues like “always compile using command xyz” to have the agent work the way you want, rather than auto-gen slop.
- If you do that your cost of inference also won’t shoot up 20%.

@mergesort @joe It's the "minimal requirements" doing the heavy lifting here. Only what the LLM needs for the task at hand.
@jasongorman @joe Fair enough! I generally agree with that, though I will say my AGENTS.md is like 200-300 lines which doesn’t feel very minimal but is appropriately what I’ve needed to add as I’ve been working with AI in my codebase for the last 1-2 years.
@mergesort @joe Can you modularise it into task-specific files?
@jasongorman @joe It depends. Most of these are around how I want the project itself to build and run, but I do have Skills to handle more granular things now. (Which I would describe as task-specific files.)
@mergesort @joe That's exactly what they. It's all just context to an LLM :-)
@jasongorman @joe I think we agree! I write a bunch about this over at build.ms (like here: https://build.ms/2025/10/17/your-first-claude-skill), I was just noting a subtle distinction and earnestly wasn’t trying to start a debate over small differences. 😄
Your First Claude (and ChatGPT) Skill

Learn a new and powerful way to build software on-demand, with little more than a simple description. No code required.

@mergesort @joe That's exactly what they are. Anthropic presumably acknowledging here that big global contexts are not a good idea?
@jasongorman @joe I don’t think there’s ever been much debate about that in the AI community, people have been trying to minimize token load since the earliest days of the ChatGPT API. There have been many intermediate solutions (MCP, RAG, etc), but this is a core reason why agentskills.io has become a pretty defacto standard from Claude to Codex to even OpenClaw.
@mergesort @joe You can't beat entropy 🙂
@jasongorman I am confused about this. I have a basic CLAUDE.md and also some CODING_GUIDELINES.md that describe how I would like the generated Go code to look. If I do not include these, those instructions are not followed and I have to specify these things every time I start a new task. Are you saying I should not do these things at all? Or is there a better way?
@st3fan Task-specific context files
@jasongorman pretty much every task is me asking Claude Code to write code for which I want to give the same guidance. . I can move these instructions into a skill but then instead of letting the agent read my once at the beginning of a day it will read it every time I start the “work on a feature” skill. Which seems less optimal re token spend?
@st3fan Perhaps take smaller steps?

@jasongorman Smaller than "Add a function that ..." ?

I think for me this research is falling apart. It just doesn't match my reality of working with coding agents. I'm getting considerably better results when I add some instructions to the repo. The code and process is much closer to what I want it to be.

(Re smaller steps - it is not a problem really - i'm also having great success with fairly large plans / tasks)

@st3fan How are you measuring those results?

@jasongorman

First, I am not vibe coding - I work more transactionally and read all code.

So I have looked at the code that is generated both with and without instructions added to the repo or globally.

It is probably not a big surprise that my guidelines are not being followed when they are not present. This results in code that does not meet my standards.

I am sure there is some overhead like the paper mentions. But my personal experience is that instructions actually do help in a big way

@st3fan If it only reads it at the start of the day, does that mean you're letting the context run for the whole day?

@jasongorman I don't think that is actually correct. I am pretty sure that Claude will use your CLAUDE.md files when you start a new session or when you /clear or /compact - i will test this but I am pretty confident they become part of the always present system prompt.

I do use long running sessions but work in smaller features / changes .. I find it helps a lot to keep a lot of context alive. (Also costs more tokens)

@st3fan LLMs are stateless. That context will have to be fed into the model with every interaction. It's only a "session" because the client (e.g. Claude Code) maintains state client-side.
@jasongorman my hunch, and I've not messed around with these tools (sorry, there's a big icky factor around them for me,) is that treating it more like some sort of numerical methods/finite element analysis tool where you specify the thing you're simulating and its parameters, and less like "your plastic pal who's fun to be with," may be more productive.
@emma There's a lot of interesting research in statistical mechanics around LLMs and deep learning.
@jasongorman that makes sense, and I'm also hearing about people running the same prompt on multiple agents which has the flavor of the Genetic Algorithms work from the late 80s and early 90s.
@emma If you want to know the limits of a technology, ask a physicist 🙂
@jasongorman or if you want a new technology. The data stores of the 2010s had their origins in the Babar experiment at SLAC.