@jcoglan I've been telling my friends this:
I do not use LLM code generation because I do not want to review code hallucinations. However, my coworkers use LLMs, and I review their code, so I can't escape it. ๐
There will be AI's checking the code reviews soon enough.
Then all of this nonsense this will start crashing, falling out of the sky or exploding when not actually required to do so.
Actually depends IMO.
Writing production code first, having AI write unit tests afterwards? Terrible idea.
Using AI to generate missing unit tests in a years old legacy system? Better idea.
Doing AI coding in TDD style? Great idea. Turns out that writing the test first really improves AIs ability to come up with good production code.
TDD fans won't be surprised.
Very good and important questions.
We're talking about a technology that's growing exponentially: whatever the answer is today, it will be different next year.
All we can do today is experiment with all the new tools they release, in order to understand how they work and learn what they can do for us.
From my observation, as of today, AI can write relatively simple source units errorless in one shot and extend existing source units with smaller variations. It can often fix issues like failing tests or compiler errors on its own.
Hence, in a typical dialogue, the human can focus on code review, evolving the architecture and keeping the direction towards the requirements. Tests are crucial, as they allow AI to automatically detect and fix errors, which saves lots of time and nerves on the human side, and reduces the effects of review oversight.
Refactoring is possible, but I'm awaiting the integration of well-known automated refactorings as AI assistant tools, which would increase the efficiency and effectiveness.
Currently, I would say that the gain of AI augmented coding lies in speed (at the expense of quality) and, to some limited extend, automation. But as I said, given the current exponential growth, this will improve and it's possible that there will be unforseen use cases.
Well, other people made different experiences.
I, with my admittedly tiny toy project https://github.com/fxnn/news I'm currently working on, using Aider and Gemini 2.5 Pro. I'm curious to explore (soon) how AI does with larger legacy code.
Then there's Aider, the tool which I'm using for AI augmented coding, I believe it usually has 80-90% of a release's code written by AI: https://github.com/Aider-AI/aider
Then recently one of Cloudflare's repos was discussed, whoch they developed using Claude: https://github.com/cloudflare/workers-oauth-provider
By far no dumb question. Choosing the right step size is utterly important. With older / smaller models, you'll need to take pretty small steps to get useful results, like the "add this param" example you quoted. By far too small to be useful, if you ask me.
Then there's the people who came up with a really involved workflow, composed of multiple tools and prompts. I believe this one can be really powerful, but it just cries for automation and I don't want to feel like a trained monkey. But definitely have a look, this one is really interesting in terms of the prompts used: https://harper.blog/2025/02/16/my-llm-codegen-workflow-atm/
I'm mostly going for simpler prompts, but that's only possible with advanced models like Sonnet 3.7, o4-mini or (the one I'm using) Gemini 2.5 Pro. Real-world example from my history:
```
Add a separate mode to the application, in which it does not print the downloaded e-mails to stdout, but instead starts an HTTP server with a RESTful API.
We will add the actual API implementation later, some dummy plain-text response is enough for now.
```
This added the necessary commandline switches and the HTTP server boilerplate.
The next on was this:
```
Now we want `startHttpServer` to receive the same input as `processEmails`.
It shall also fetch e-mails, and then provide the aggregated stories from all e-mails under a `/stories` REST endpoint. As usual, start with proper unit tests.
```
And there I had my REST API.
Typically, I switch between such requirements-oriented prompts, and then more lower-level improvements/refactorings, making sure that the code quality and architecture is right, like this one:
```
In `main` func, handle `fetchAndSummarizeEmails` only once, for both `printEmails` and `startHttpServer`, because both need it in the same way. Also, handle errors directly, no need to pass them on to e.g. `startHttpServer`.
```
@fxnn @alanpaxton @akahn there's no reason to assume that reasoning lies on a continuum from whatever it is LLMs do today, such that quantitative improvement will mean they will eventually be able to reason
my recent attempts to drive an LLM using TDD have gone extremely badly, including it fabricating its own tests and showing no ability to comprehend the problem or my instructions
Yes, the steps we make shouldn't be too coarse, and also AI augmented coding is still somewhat adventurous. One of the important things it needs is a good "system prompt", some rules which tell it e.g. to never fix a test by deleting it.
I can recommend @kentbeck's recent blog posts on that matter, paywalled, e.g. https://open.substack.com/pub/tidyfirst/p/persistent-prompting
I can relate. While I get that people want to earn money with their blogs, I believe that there must be a better way than to feed Substack or Medium's pockets.
Anyways, the one good free blog on the general LLM topic (I'm reading regularly) is that from @simon, https://simonwillison.net.
Also I got recommended a blog post from https://harper.blog/posts/ which seems to feature some interesting AI coding posts lately.
Nothing more specific at the moment, unfortunately.
What makes you think that it's zero substance?
I see experienced developers every day applying tools like Claude Code to do the same task the developer would have done, in a fraction of the time, and often even in better quality.
I wonder what's "all hype" in such a technology.
@fxnn not engaging in the wider discussion here, but I'd seriously question "exponential growth" in this field. It had a single large step change, and has been very much incremental since. The improvements are linear at best, while the resources required to produce them are the only thing growing exponentially.
Like, I don't think this is even a particularly sceptical view - just facts on the ground. You can maybe discuss whether linear improvements have step change impacts, but the tech itself requires exponentially increasing resources to manifest linear improvements. That's why they've started switching to working on specialised models, reasoning, meta control systems, etc.
If you're relying on "we'll see continued improvements akin to Chat GPT's version 3 -> 5 trajectory", I think you'll be sorely disappointed.
@alanpaxton
On of my theories when the first studies came out showing folks loved LLMs for coding but by a very high number distrusted the output was that they find doing code review more breezy and it makes them feel like they are supervising someone and superior and therefore they have the illusion of being more productive.
Kind of similar to how folks insisted multi-tasking was superior until all the studies came out showing objectively that their productivity plummeted regardless of how good it subjectively felt to them.
For a field in which we talk about data so much, we do a really bad job job of measuring productivity and factors that affect productively it shows.
@jcoglan sorry for the literal screenshots, but check this out.
Wrote a library, then got asked to "put the problem into ChatGPT and see how close it gets to yours".
Well, it took me quite a long time to get the over-confident, defensive PoS to admit it f'ed up OCR* on a table.
(*kinda. I still don't know what exactly went wrong along the way, but since it barfs out and executes python code to get there, it might as well have screwed up some indices under the hood to mix up the table cells.)
@jcoglan ah well I think I see it now.. column index off by one
Having LLMs do tax reports and medical decisions will be SWEEET
@jcoglan This is what is freaking me out - I need a job and did software development a while back but can't handle the thought of reviewing AI code.
At least when you do it for a colleague there's an aspect of upskilling for all involved. For AI, only the billionaires benefit.
@yala @jcoglan
When I learnt programming in the 80s, I would always do that together with a friend. Having access to a computer was hard to get and you would share that time. After a brief stint as a lonely "professional" programmer in the 90s, that was "allowed" again but you had to call it pair programming.
I've been trying to avoid the ritual commonly known as code review (trckacr) for many reasons. I don't like it as a reviewer. I'm bad at it. I'm not a computer and something that looks plausible to me may not even compile. Cosmetic or stylistic errors/anomalies are easy, but reviews usually happen when the code is finished. Are you really going to suggest a full rewrite? A good friend of mine was a team lead at Google, and they lamented about how impossible code review sessions were and how producing quality code this way seems impossible.
It's not all black and white. I do find review comments helpful sometimes, especially when entering a new project. Usually that's about style or other cultural memes.
If you want to have (give or receive) truly impactful input from a second person on a piece of code, pair program it.
Anyway, I believe that the person who codes alone as opposed to pairing is cultivating the software crisis, and so is trckacr.
Someone I know who is really into using LLMs to code โ he uses one LLM to code, and another completely different LLM to code review.
He acts more like a technical lead or technical manager.