LLMs do not meaningfully "refactor" at anything other than a junior engineering level. They can basically do some window dressing and move code around between files. True refactoring means creating new abstractions, which LLMs can't do because they can't form world-models.
LLMs have not been shown to possess the capability of inductive reasoning. If you show an LLM a bunch of planetary orbits, they'll create a Ptolemaic monstrosity of epicycles and complexity. They cannot rediscover Newton from first principles.

Harvard/MIT tested whether foundation models learn true world models (the underlying structure generating data) by how they adapt to synthetic tasks. Even our best models fail to create these models, instead relying on task-specific heuristics that do not generalize.

https://arxiv.org/abs/2507.06952

What Has a Foundation Model Found? Using Inductive Bias to Probe for World Models

Foundation models are premised on the idea that sequence prediction can uncover deeper domain understanding, much like how Kepler's predictions of planetary motion later led to the discovery of Newtonian mechanics. However, evaluating whether these models truly capture deeper structure remains a challenge. We develop a technique for evaluating foundation models that examines how they adapt to synthetic datasets generated from some postulated world model. Our technique measures whether the foundation model's inductive bias aligns with the world model, and so we refer to it as an inductive bias probe. Across multiple domains, we find that foundation models can excel at their training tasks yet fail to develop inductive biases towards the underlying world model when adapted to new tasks. We particularly find that foundation models trained on orbital trajectories consistently fail to apply Newtonian mechanics when adapted to new physics tasks. Further analysis reveals that these models behave as if they develop task-specific heuristics that fail to generalize.

arXiv.org
@nateberkopec
Think of the yarn we could wind with that model.
#yarn
@nateberkopec aka, they have no imagination. you can’t manufacture imagination
@nateberkopec how does this strike you (I hated it but on a limited-understanding basis because I'm not a coder)
https://hachyderm.io/@cate/115979723818923758
cate (@[email protected])

From mid-2025 but interesting on LLMs and coding. "Nobody cares if the logic board traces are pleasingly routed. If anything we build endures, it won’t be because the codebase was beautiful." https://fly.io/blog/youre-all-nuts/

Hachyderm.io
@noodlemaz I think the truly vibeslopped codebases we already have (Beads and Gastown) show how there are limits to this. If even the simplest project becomes a 200k line monstrosity, it's hard to work with both for humans and for LLMs.
@nateberkopec @jardo wasn’t it in 2023 when AI started producing better sorting algorithms than we could as humans? https://deepmind.google/blog/alphadev-discovers-faster-sorting-algorithms/
AlphaDev discovers faster sorting algorithms

In our paper published today in Nature, we introduce AlphaDev, an artificial intelligence (AI) system that uses reinforcement learning to discover enhanced computer science algorithms – surpassing those honed by scientists and engineers over decades.

Google DeepMind
@eljojo @jardo sure, not the same problem in the least though
@nateberkopec @eljojo yeah, that's neat but AlphaDev isn't an LLM and what it was doing wasn't refactoring.

@jardo @nateberkopec @eljojo

Yeah if you hear that AI was used to do something groundbreaking, it's always* something made specifically to do that groundbreaking thing, very often reinforcement learning, and never an LLM.

*so far

@jardo @nateberkopec AlphaEvolve, using LLMs, has advanced the state of the art in matrix multiplication and other fields like hardware design. https://deepmind.google/blog/alphaevolve-a-gemini-powered-coding-agent-for-designing-advanced-algorithms/

how do you define refactoring? what I see here is that we have new algorithms that work on the same input and produce the same output as previous algorithms, but do it better. that’s a refactoring to me.

the software rewrote parts of verilog to make it faster.

AlphaEvolve: A Gemini-powered coding agent for designing advanced algorithms

New AI agent evolves algorithms for math and practical applications in computing by combining the creativity of large language models with automated evaluators

Google DeepMind
@eljojo so? optimization has completely different goals than refactoring despite both sharing the requirement of not changing the output. this all has nothing to do with what Nate was saying
@jardo it really feels like we’re splitting hairs on what “true refactoring” means. Nate’s thesis is that the machine can’t go beyond what a junior engineer can do, because they lack world models, while I see the machine making better code than we humans can. I guess my thesis is that you don’t need a world model in order to write superior abstractions?
@eljojo optimization isn't about abstractions... you've completely missed the point of what Nate was saying... I'm out.

@nateberkopec well, you can tell LLM how do you want to refactor your code and it's already a big time saver compared to limited refactoring capabilities of modern IDEs (especially VS Code), right?

I would also add that in my opinion it's not because of lack of world model (which might be true), but because LLMs do not actually care about the code structure at all — they are text transform tools. They can work with obfuscated minimized code and still do meaningful changes

@nateberkopec even if you dramatically narrow your definition of “refactoring” to be like “rename this variable” or “extract this block of code to a function”, JetBrains software did this far more reliably 10 years ago.

LLMs are useful and I do use them, but they don’t possess even basic “reasoning” capabilities.

@nateberkopec mostly true. it's possible to tech Agents to use LSPs, which helps with simple mechanical refactorings. But creative refactoring is problematic indeed