Mastodawn

"Across multiple coding agents and LLMs, we find that context files tend to reduce task success rates compared to providing no repository context, while also increasing inference cost by over 20%"

I've suspected this all along. Folks spending mucho-plenty time curating project-level .md files have been deluding themselves that it helps.

https://arxiv.org/abs/2602.11988

Evaluating AGENTS.md: Are Repository-Level Context Files Helpful for Coding Agents?

A widespread practice in software development is to tailor coding agents to repositories using context files, such as AGENTS.md, by either manually or automatically generating them. Although this practice is strongly encouraged by agent developers, there is currently no rigorous investigation into whether such context files are actually effective for real-world tasks. In this work, we study this question and evaluate coding agents' task completion performance in two complementary settings: established SWE-bench tasks from popular repositories, with LLM-generated context files following agent-developer recommendations, and a novel collection of issues from repositories containing developer-committed context files. Across multiple coding agents and LLMs, we find that context files tend to reduce task success rates compared to providing no repository context, while also increasing inference cost by over 20%. Behaviorally, both LLM-generated and developer-provided context files encourage broader exploration (e.g., more thorough testing and file traversal), and coding agents tend to respect their instructions. Ultimately, we conclude that unnecessary requirements from context files make tasks harder, and human-written context files should describe only minimal requirements.

arXiv.org

Show thread

Emma needs ☕️ and paying work Mar 4

@jasongorman my hunch, and I've not messed around with these tools (sorry, there's a big icky factor around them for me,) is that treating it more like some sort of numerical methods/finite element analysis tool where you specify the thing you're simulating and its parameters, and less like "your plastic pal who's fun to be with," may be more productive.

Show thread

Jason Gorman Mar 4

@emma There's a lot of interesting research in statistical mechanics around LLMs and deep learning.

Show thread

Emma needs ☕️ and paying work

@jasongorman that makes sense, and I'm also hearing about people running the same prompt on multiple agents which has the flavor of the Genetic Algorithms work from the late 80s and early 90s.

Show thread

Jason Gorman Mar 4

@emma If you want to know the limits of a technology, ask a physicist 🙂

Show thread

Emma needs ☕️ and paying work Mar 4

@jasongorman or if you want a new technology. The data stores of the 2010s had their origins in the Babar experiment at SLAC.