New post: "We mourn our craft" https://nolanlawson.com/2026/02/07/we-mourn-our-craft/
No comment on this one.
New post: "We mourn our craft" https://nolanlawson.com/2026/02/07/we-mourn-our-craft/
No comment on this one.
@nolan I can't speak to what I haven't seen, and I can't take people's word for this stuff because there's a *massive* amount of hype.
Just endless waves of dubious benchmarks, demos that turn out to be fake or broken, reporting that isn't actually fact-based.
So, I can only speak to what I've seen.
And what I've seen ain't good.
@nolan Importantly, this was also the situation a year ago, and a year ago people also said "just wait six months". And I did, and it's fundamentally the same situation.
The agents can produce more code, larger projects. But that's actually worse because that's even harder to fix and maintain.
@varx I get the skepticism; there is a lot of junk and bunk out there. My experience comes from working at a small startup where people are already pushing the boundaries of multi-agent orchestrations and whatnot.
I tried to cover this in a recent post; I think my experiment is pretty conclusive. Honestly you could try the experiment yourself with newer models or more loops and probably make the number shoot up: https://nolanlawson.com/2026/01/31/building-a-browser-api-in-one-shot/
@nolan I read that post (I follow the RSS feed) but there's a really important point that you don't seem to cover:
Is that code usable?
It passes a lot of tests. Is it good enough to use in a real browser? (Functionality, performance, security.) Is it easy enough to work with that you could get it into good enough shape to use? Is it maintainable? *How do you know?*
@nolan A lot of my work has been in security. One of the things a lot of people don't appreciate is that security is largely about what "features" *don't* exist. For example, the feature that lets an attacker read your email. 😃 You have to try to prove that negative.
This is important because a lot of people evaluate software by taking it for a test drive and seeing that the happy path works. But that can never work for security.
The way you write secure software is by having a secure development process; by developing and communicating threat models; by recognizing dangerous patterns and guiding the software around that.
LLMs are notoriously bad at all of this. I don't think this will be better in six months.
@varx The Web Platform Tests are a pretty high bar of quality. If you read through them, most of them are about bizarre edge cases that, yes, include security, e.g. https://github.com/w3c/IndexedDB/issues/476
The code is probably awful when it comes to maintenance, reusability, etc., but I'm starting to wonder if any of those values matter anymore.
There are of course exceptions, e.g. a common joke in W3C circles is about the "hit testing spec" that doesn't exist, but WPTs are otherwise pretty exhaustive.

Chromium (bug), Gecko (bug), and WebKit (bug) all make transactions inactive during structured serialization of object values. The spec addresses this in the "clone a value" algorithm (see also str...
@nolan I can see companies successfully discarding reusability (as repugnant as that would be), but maintainability isn't something you can escape. That's my bet.
Security also isn't something you can test your way out of. Tests show that the software *does* a thing, rather than that it *doesn't* do a thing. There are such things as security tests but they're usually written with specific implementations in mind, preventing certain kinds of easy mistakes from creeping into the codebase unnoticed. You can't take a set of security tests for one implementation and trust that they'll do anything for another one. (Many are also regression tests for a specific impl, written in retrospect...)
@varx For sure, the test I mentioned above was in response to a known use-after-free in Firefox. That said, I find it interesting that a lot of people seem to be saying, "AI can do _other_ people's job, but the thing I'm a specialist at? No way." This could be construed as a variant of the Gell-Mann amnesia effect (experts can spot the BS) or just cope.
I'm not sure, but I do know that many people at the security company I work for are taking claims like this seriously: https://socket.dev/blog/the-next-open-source-security-race-triage-at-machine-speed
@nolan I mean, I definitely think it can do an awful lot of jobs... badly. :-) The question for me is whether companies are actually OK with that reduction of quality. I'm concerned that they will be, at least in the short to medium term.
I do think that LLMs can often find vulns, and *in the hands of an expert* can assist in securing software. Otherwise it's just a flood of crap, as the article notes.
But coming back to my original question:
Is that code you generated usable?
- Would you feel comfortable if your daily browser used this generated code instead of hand-written? Would you install it on your family's computers?
- If no, what would it take for your answer to change?
@varx If your question is "Can LLMs generate a production-ready browser that I would trust today?" then the answer is obviously no. However, the fact that it can get within spitting distance with a single prompt should give one pause. It's easy to see how this story ends.
More important I think is your first point – "good enough" or "worse is better" has been a defining trait of most software for a long time. I think we can both agree that we'll see a lot of crap software shipped this year. 😆
@nolan I'm not much into betting, but I wonder if you'd like to make a prediction.
What do you think is the probability that six months from now, you would be able to prompt an LLM to create a program with the following constraints?
- Complexity on the order of IndexedDB
- Not reimplementing something that already exists (I have several reasons for this constraint)
- Accordingly, not driven off of a pre-existing test suite
- High enough quality that you would risk your family's privacy and security on it
- Maintainable code (either by human or further LLMs)
- Takes at most a workday for prompting + manual fixups
- API costs of at most $500
Or, how far out do you think it would be before this seems 90% likely to work?
(Feel free to quibble with any of my constraints. I chose those because that's what it would take to impress me.)