Look this isn’t at all a defense of slop code, but it has me thinking — how much does code quality matter, and why?

It’s maintenance, right? We care about readability because we know we’ll have to make changes, fix bugs, etc.

But so … imagine a codebase that’s magically bug-free and feature-complete. (I’m aware this is a strawman - that’s the point, it’s a thought experiment.) Does it matter if this codebase is well-written? I’m not sure it does! (1/5)

Code quality has always been ONE factor; it’s never been always the most important. Eg we often accept complex internals as the price for a clean external API; and we all write sloppy code for one-offs, prototypes, etc. So part of me accepts the “code quality doesn’t matter” argument. I can see a vision of agentic engineering with systems that prove correctness; if an agent produces code that is provably correct, maybe the quality really doesn’t matter! (2/5)
I’m far from convinced that this is actually possible. It’s certainly not now — and I’m not talking about models. Testing and verification tools are nowhere near where they’d need to be, regardless of model quality. Today, code quality DOES still matter; even the best-case version of agentic engineering can’t produce code that’ll never require maintenance. But I can see a possible future where code quality might not matter, or will matter a lot less, and that’s FASCINATING. (3/5)
Specifically what I find fascinating is: the tooling that would be required to make agentic engineering begin to live up to the hype — much better testing tools, formal business logic specification languages, more powerful and easier to use formal verification tools, better static analysis tooling, etc — would be massively useful to software engineering quite regardless of the existence/utility/quality of LLMs. (4/5)
Will we actually build them? I sort of doubt it: the history of software development, and of course the current trajectory, suggests we’ll continue to yolo our way through it. I wouldn’t exactly say I’m optimistic, but hope springs eternal. (5/5)

@jacob Responding to the thread as a whole, I think code readability will matter just as much as cockpit black boxes matter.

The universe is against us even with bug-free hardware and software. Even the perfect self-driving car will kill some people. The perfect AI doctor will lose some patients. The perfect automated factory will assemble some lemons.

We will need to be able to learn why and how it happened. Step by step. So, logs and readable code. Otherwise trusting automation is impossible

@ambv @jacob Definitely...

I was going to say something about accountability, but this is pretty much my thought too.

Especially as it's incredibly unlikely that anyone would one-shot a perfect implementation, even more so for something that is extremely critical.

If we can't understand "why" something went wrong, how do we account for it? Was it negligence? Can someone be held accountable? Is the prompting wrong or is the code wrong?

I can see why that's a dream scenario for a corporation, though.

@pythonbynight @ambv Ok yes I like this: “explainabibility” (I need a better word but fine) is one reason why code quality matters. Like, if a dam fails, we’d really like to have the blueprints — and for them to be readable! — so we can figure out where we screwed up. “The dam works because the lake holds water” isn’t sufficient.
@jacob @pythonbynight @ambv "Explainability" is the word generally used for this requirement when it comes to the use of machine learning systems in finance (loan approvals, that kind of thing), so if there's a better word, it hasn't been found yet.
@ancoghlan @jacob @pythonbynight @ambv there's also "legibility", in the Seeing Like a State sense, ie. you have people (or LLMs) who report to you, and you need to be able to make sense of what they're doing in order to hold them accountable.
@jacob I think this is a subtle category error, in the sense that "readability" (and "code quality" more generally) is a transitive adjective. Readable to whom? High quality according to whose taste? We strive for an "objective" sense of code quality because the audience we are usually addressing is the pool of potential candidates who may become future maintainers of the code, and that's a nebulous group. But, that group's nebulousness eventually must be removed as it becomes "the current team"
@jacob all of that is to say: the reason readability matters is that if you want to maintain some code, some specific group of humans must maintain an understanding of its structure, such that they can effectuate *and be held accountable for* required changes. The putative cost reduction of an "agentic" tool inherently assumes that you can shrink that group by replacing some of them with LLMs. Maybe, eventually, some will be able to. But ultimately *somebody* still needs to read all the code.
@jacob the world in which agentic tools could have the level of success that you're imagining with trash-level code quality would seem to me to be the same world where anyone annoyed with a subscription-model app would simply download a 30-year-old abandonware replacement from archive dot org, because they'd be comfortable with long-term stasis. which seems unrealistic to me.
@glyph Yeah, that’s one conclusion I’m coming to thanks to this discussion. I was aware of the slipperiness of “quality”, and was kinda doing that deliberately because exploring weather quality matters is (to me) more interesting than defining quality precisely. But what I hadn’t noticed was that “done” was just as slippery — and in order for quality (whatever definition) not to matter, there has to be no maintenance, and that requires “done” to actually be a thing.
@glyph (Tho - I think software engineering would be in a better place if “done” was more common. Like, is 80% of the TCO of a bridge in maintenance? I sort of doubt it. There are times when I think that calling it software “engineering” is a bit of a joke.)
@jacob @glyph I prefer to think of it as aspirational.
@jacob @glyph calling it software engineering is absolutely a joke, it has been a joke for a while, and the rush to adopt LLMs and mostly not read their output is probably the strongest evidence we have that it's been a joke.

@jacob But aren't we then just shifting all the complexity, skill requirement and quality concerns to the business logic specification language, which will maybe be less verbose than code, but certainly not easier to reason about and write perfect specifications in?

The maintenance burden of software isn't just high because there's a bug in that manually-written for loop, it's mostly because requirements weren't clear enough or change.

@jacob I keep coming back to the joke of that one physics professor that in this universe, not only energy, mass, and momentum are conserved, but also complexity. I don't think it was a joke.

@jacob I keep wanting a deterministic layer in there somewhere. Letting an AI take over the test suite is the piece that feels riskiest to me. That's always been the layer of trust and confidence.

A steadily evolving test suite that I understand is my best guarantee that things will continue to work at least as well as they have in the past. If I give over control of that layer, I don't know where my confidence that things will work comes from.

@jacob I have been doing some of this, mostly cranking up the tools that already exist. I’m pretty sure it helps. But I can also see the agent jumping through all the hoops I give it, and they are both familiar and would be incredibly frustrating for me when hand programming.
@jacob @freakboy3742 Yes!! This is a big part of what makes me actually optimistic about the excitement around LLMs and the like: these systems afford us an opportunity to reflect on what we actually are trying to and then create structures conducive to that.

@jacob yeah, exactly. Lots of AI coding boosters are talking about the importance of tests, documentation, small modules… On one hand, yes! Great ideas! OTOH why did we have to wait for LLMs to prioritize this? (And as soon as we can avoid it, will we?)

But, I do like the move to smaller PRs, smaller files, more tests. My biggest challenge is that PR review has become a bigger bottleneck.

@jacob Instead of spending $TRILLIONS on ensloppenators, what if we spent a fraction of that money improving the deterministic tools we have?

@jacob Never thought about the provably correct path. I have seen the "who cares what the code looks like when it's only the AI that sees it" though.

But yeah, that's intriguing.

@jacob A thought when reading this: even if it were possible to make something perfect (serving your straw man) by today's version of perfect, we know that definition will change over time (e.g. https everywhere today, but not before Firesheep).

I think there are some parallels here to “I'm not doing anything wrong, so privacy doesn't matter to me.” The definition of “not doing anything wrong" has changed.

Also difficult maintenance via LLM will get *expensive* when it's unsubsidized.

@sean Ahh yes! “Done” and “correct” are slippery, I was thinking that, but the idea of their definition changing over time totally matches my experience and is a pretty great way of bringing the strawman back to earth.

@jacob code quality does not matter *if the code never has to change*. it’s always a forward looking statement: we want easy to understand code because we will have to modify it later.

if later never comes then i agree entirely;

but in my own very limited explorations imho the LLMs also find good code easier to read and reason about…

@jacob if we stop seeing transformative improvements in model quality, i suspect we’ll get more of a bifurcation along business models? facebook can move fast and break stuff; orgs with high uptime demands can’t. pressure to yolo will be higher with one than the other
@jacob I mean, we don't care about the machine code quality (not even in performance terms in many cases), so I do think we won't care about code quality in the future, but what I do care is determinism: the same input (prompt) should *always* produce the same output (generated code), I guess it'll come down to a new kind of programming-ish language that somehow provides that?