Harvard/MIT tested whether foundation models learn true world models (the underlying structure generating data) by how they adapt to synthetic tasks. Even our best models fail to create these models, instead relying on task-specific heuristics that do not generalize.

Foundation models are premised on the idea that sequence prediction can uncover deeper domain understanding, much like how Kepler's predictions of planetary motion later led to the discovery of Newtonian mechanics. However, evaluating whether these models truly capture deeper structure remains a challenge. We develop a technique for evaluating foundation models that examines how they adapt to synthetic datasets generated from some postulated world model. Our technique measures whether the foundation model's inductive bias aligns with the world model, and so we refer to it as an inductive bias probe. Across multiple domains, we find that foundation models can excel at their training tasks yet fail to develop inductive biases towards the underlying world model when adapted to new tasks. We particularly find that foundation models trained on orbital trajectories consistently fail to apply Newtonian mechanics when adapted to new physics tasks. Further analysis reveals that these models behave as if they develop task-specific heuristics that fail to generalize.
From mid-2025 but interesting on LLMs and coding. "Nobody cares if the logic board traces are pleasingly routed. If anything we build endures, it won’t be because the codebase was beautiful." https://fly.io/blog/youre-all-nuts/

In our paper published today in Nature, we introduce AlphaDev, an artificial intelligence (AI) system that uses reinforcement learning to discover enhanced computer science algorithms – surpassing those honed by scientists and engineers over decades.
Yeah if you hear that AI was used to do something groundbreaking, it's always* something made specifically to do that groundbreaking thing, very often reinforcement learning, and never an LLM.
*so far
@jardo @nateberkopec AlphaEvolve, using LLMs, has advanced the state of the art in matrix multiplication and other fields like hardware design. https://deepmind.google/blog/alphaevolve-a-gemini-powered-coding-agent-for-designing-advanced-algorithms/
how do you define refactoring? what I see here is that we have new algorithms that work on the same input and produce the same output as previous algorithms, but do it better. that’s a refactoring to me.
the software rewrote parts of verilog to make it faster.
@nateberkopec well, you can tell LLM how do you want to refactor your code and it's already a big time saver compared to limited refactoring capabilities of modern IDEs (especially VS Code), right?
I would also add that in my opinion it's not because of lack of world model (which might be true), but because LLMs do not actually care about the code structure at all — they are text transform tools. They can work with obfuscated minimized code and still do meaningful changes
@nateberkopec even if you dramatically narrow your definition of “refactoring” to be like “rename this variable” or “extract this block of code to a function”, JetBrains software did this far more reliably 10 years ago.
LLMs are useful and I do use them, but they don’t possess even basic “reasoning” capabilities.