Very thoughtful analysis by @grimalkina of the experimental design and results from the recent METR study on “the impact of early-2025 AI on experience open-source developer productivity”.

https://www.fightforthehuman.com/are-developers-slowed-down-by-ai-evaluating-an-rct-and-what-it-tells-us-about-developer-productivity/

#metr #cursor

Are developers slowed down by AI? Evaluating an RCT (?) and what it tells us about developer productivity

Seven different people texted or otherwise messaged me about this study which claims to measure “the impact of early-2025 AI on experienced open-source developer productivity.” You know, when I decided to become a psychological scientist I never imagined that “teaching research methods so we can actually evaluate evidence about developers”

Fight for the Human

The METR paper would have been a perfect fit with the “registered report” format of, e.g., the Empirical Software Engineering journal. In this way, several of the issues in the setup would have been identified early on, and independent of the (desirability of the) outcomes

https://2025.msrconf.org/track/msr-2025-registered-reports

MSR 2025 - Registered Reports - MSR 2025

Welcome to the website of the Mining Software Repositories 2025 conference! The Mining Software Repositories (MSR) conference is the premier venue for software analytics research, according to which software engineering data is analyzed using a mixture of data science, machine learning/artificial intelligence and qualitative methodologies. The goal of the conference is to improve software engineering practices by uncovering interesting and actionable information about software systems and projects using the vast amounts of software data such as source control systems, defect tracking syste ...

A threat to (external) validity not discussed in the METR report (as far as I can see) is that most likely the LLM used (Claude) was trained on the repositories used in the study. Thus, one would expect Cursor to behave poorer in industrial settings (with private code bases not seen by Claude before).

This paragraph in the study (p.36) also has some validity implications:

> for the duration of the study, we periodically provide feedback to developers on their implementation notes and video recordings. We occasionally email developers with tips on how to use Cursor more effectively if we notice low-hanging fruit (e.g. reminding developers to explicitly tag relevant files when prompting agents) from reviewing their screen recordings.

https://metr.org/blog/2025-07-10-early-2025-ai-experienced-os-dev-study/

Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity

@avandeursen Who would have thought that "AI" is just some dudes in California spying into your coding session...
@avandeursen oh I missed that....how are they not concerned with contamination from this??
@grimalkina @avandeursen I mean, sure that would contaminate the study, but given their claimed outcomes doesn't it only make their claim stronger? That is, this intervention would only seem to make the AI users more effective than without this intervention... It's like giving the AI users the best possible scenario in a way that wasn't available to those in the non-AI group.

@joshuagrochow @avandeursen it's impossible to know without knowing if that intervention on the part of the researchers was controlled or standardized. If it was a defined part of the study plan, it should be a defined part of the treatment not an ad hoc thing??

Flippantly dismissing the control part of an RCT means you don't care about the basic logic of it...?

We don't get to put our thumbs on the scale in a study.

This is incorrect about the randomization. ALL users were in both conditions

@avandeursen I am convinced that we need to incorporate a LOT more about the structure of the codebase/repo etc as a necessary element of the kind of LLM performance we're eliciting. I don't know what features should be part of the taxonomy we use for that but this is a really important observation

@avandeursen @grimalkina What I find truly frustrating is AI promotion is never held to the same standard.

We’re awash in constant hype & anecdotal stories pushing AI usage in places where it should never be used. People are being forced to use it in their jobs & we know that it’s being used in ways it impact real people’s lives with little to no critical pushback.

We should hold criticism to a high standard, but we consistently fail to do so with the grifters & promoters.

@causticmsngo @avandeursen I feel it's important to do this too but my work on how developers are feeling threatened by AI and how to prevent it absolutely does not get shared or amplified nearly as much as stuff like this, which makes it somewhat depressing to imagine it'll be worth it to do work on those topics. Even "AI critical" people don't seem to be interested in supporting work on the human centered elements vs just screaming and contempt. I find it all exhausting as a social scientist.

@causticmsngo @avandeursen fwiw there are a lot of voices especially in education breaking down and criticizing AI claims

You might enjoy Dan Meyer who does some good critical breakdowns of research claims about AI
https://substack.com/@danmeyer

Dan Meyer | Substack

Currently: @Amplify. Previously: @Desmos, Stanford University, high school math teacher. Always: recreational math user.

@grimalkina @avandeursen Appreciate the link.

Mostly I follow @emilymbender, @alex, @molly0xfff, & listen to https://www.techwontsave.us & https://www.dair-institute.org/maiht3k/ to keep up with AI.

I’m very glad there are competent voices out there. As a practicing technologist, I do speak out on it within my peer group so having reputable social science commentary is helpful.

Tech Won’t Save Us

A left-wing podcast for better technology and a better world.

Tech Won't Save Us