If you replace a junior with #LLM and make the senior review output, the reviewer is now scanning for rare but catastrophic errors scattered across a much larger output surface due to LLM "productivity."

That's a cognitively brutal task.

Humans are terrible at sustained vigilance for rare events in high-volume streams. Aviation, nuclear, radiology all have extensive literature on exactly this failure mode.

I propose any productivity gains will be consumed by false negative review failures.

@pseudonym basically this is why this method is failure from get go.
@pseudonym Especially since the sort of mistake that LLMs make is the sort of mistake that's hardest for humans to spot. They produce bad code that looks like good code, because they were trained on a lot of good code and told "Write code that looks like this".

@robinadams yes

I'm not sure if this is a but or an and...

The recent @squads blogpost by @EmmaDelescolle and @Tiziano notes that LLMs are good at reviews.

In an LLM friendly context, seniors will delegate shit work to LLM of course. So now we have the horrid situation where young coders don't learn coding, and senior teaching skills atrophy. I'm sure retrospectives on this are delegated to an LLM as we speak somewhere 🤪

Isn't this just the absolutely perfect shitstorm?

@pseudonym

@robinadams @pseudonym It's even worse in some ways. The tools don't just write code, they also write tests, run the tests, fix any failures, clean up, and document it. The result probably runs and does something close to the intent. At this point a human has to understand what's happened and _then_, without the benefit of hands-on involvement, spot the problems. Not easy.
OWASP Juice Shop | OWASP Foundation

Probably the most modern and sophisticated insecure web application for security trainings, awareness demos and CTFs. Also great voluntary guinea pig for your security tools and DevSecOps pipelines!

@pseudonym and because the high volume consists of what I’ve dubbed “plausible bullshit”, reviewers will have to battle a plethora of their biases as well.

There are fields (I’ve heard stories about protein and material design, and vulnerability discovery) where filtering the BS for real discoveries can be worth it. I’m guessing it works because there is a reality to test against.

But for the love of humanity, don’t use it for anything descriptive or abstract.

@avuko @pseudonym The main reason that machine learning works so well with material and protein design, weather forecasting, and such, is that there is good data available to “train” the model. The internet is the source of LLM training. It is full of garbage and LLMs are filling it with more garbage. The rule is the same as in 1970: GIGO (garbage in, garbage out). Only the scale is different.

@ELS @avuko @pseudonym Exactly this. The #AI_Slop is growing exponentially which in turn increases the slop bucket depth and size which in turn has already degraded the quality and validity of search engine results. Some estimates have put the accuracy and degradation at 20-35% *worse*. So having the exponential growth of #AI_Slop is in turn DEcreasing the accuracy and value of *search* exponentially as well. Doing all of that on *bigger and faster* machines and #LLMs will only hasten the processes in play and dramatically increase the probability of truly catastrophic outcomes and consequences.

And that is the case already in play, without bringing in all the issues raised in Bender and Hanna's recent book (mandatory reading)

https://www.amazon.com.au/AI-Fight-Techs-Create-Future/dp/1847928625

My first encounter with so-called "artificial intelligence" was in 1964-5 as an undergrad psychology student in an (snail mail) exchange with one of the pioneer researchers at Stanford. I've been involved in parts of it and tracked it ever since. It is critical to understand that it has taken OVER 60 YEARS to get to the mediocre state we are now in. It didn't happen "yesterday" or even in "the last 2 years" as some snake oil #AI_Salesmen would have everyone believe.
Time to #BeCarefulWhatYouWishFor

And its now 2026...

The AI Con: How To Fight Big Tech's Hype and Create the Future We Want : Bender, Emily M.: Amazon.com.au: Books

The AI Con: How To Fight Big Tech's Hype and Create the Future We Want : Bender, Emily M.: Amazon.com.au: Books

I like to say that LLMS are a great way to reduce junior development time at the cost of senior review time.

@pseudonym It's certainly like that.

FWIW though LLMs don't have any shame or feeling they need to manage their reputation.

If you tell the same LLM that produced the report that it is now the QA manager and it must review the report from the standpoints of checking for missing or inaccurate citations, dubious claims or non-concise text, it will rat itself out and can be told to fix what it found.

This is the same LLM entirely...

@hopeless @pseudonym you are suggesting that you can just layer more shit onto the shit and after enough layers of shit it becomes not shit.
@nor4 @hopeless @pseudonym if hidden well enough, it's ok to step in it, right 🤪
@nor4 @hopeless @pseudonym maybe it works better as a sewage treatment plant analogy. Probably still full of microplastics though.
@nor4 @hopeless @pseudonym much more likely to become a fuck shit stack. where my gerunds at? https://www.youtube.com/watch?app=desktop&v=CJQU22Ttpwc
Reggie Watts: Fuck Shit Stack

YouTube
@pseudonym This was my experience from the start, and is what made me gave up on LLM assisted coding. Of course, that was before I was aware of the abhorrent externalities that came with using the slop machine...

@ainmosni

Yup.

My thoughts aren't new.

Just felt the need to to pack them up into something bite-sized.

To explain where I see one of the fundamental design failures, as a function of even any potential "good stuff" that may arise.

@pseudonym is the problem the increased volume of code that the LLM is producing (as compared to the junior dev) — what you are calling “productivity gains"? because I can see this same argument being made for code produced by humans as well.
@xrisk @pseudonym Volume is a key factor here. But even if the volume was the same, LLMs are doomed to stagnate as devs—whose code was scraped for training data—are displaced.
@malstrom @pseudonym that’s an interesting claim. I don’t know enough about LLM research to make a judgement. I do know that LLMs trained on synthetic (other LLM-generated) data tend to perform worse, but have we reached the limits of what LLMs are capable of? In my limited understanding, if an LLM can “learn” fundamental programming “concepts” (the same way they can “learn” concepts across human languages — I could be wrong in my understanding here), they should (might?) be able to transfer/apply those concepts to not-before-seen domains (maybe with a bit of “reasoning” prodded in).
@xrisk @malstrom @pseudonym just for clarity, LLMs don't learn concepts

@wronglang @xrisk @malstrom

Correct. They don't learn concepts. That's the key confusion in so much of the discussion and use around them.

They have no world model, and don't reason at all. But they perform a very good facsimile of reasoning, because reasoning is embedded in and has shaped the patterns of speech, text, and code.

They pattern match. That's all. Full stop. But they do it so well it looks like speech, or code, or understanding.

@pseudonym @wronglang @xrisk @malstrom I'm not sure how to formally define learning, concepts, or reasoning, but there is some evidence the models are themselves computationally universal. As I understand it one of the main ways these models are trained is reinforcement learning with the objective of diagnosing and fixing software bugs using command line tools. This seems like more than pattern matching in any traditional sense.
@mirth @pseudonym @xrisk @malstrom they don't do concepts in the sense that if the correct thing to say is "that's your mom" the errors involved in an LLM generating the text "this is your mom" instead are similar to the errors involved in generating the text "fuck your mom" despite there being vastly different layers of concepts involved.
@wronglang @pseudonym @xrisk @malstrom That's a very specific technical claim, can you elaborate?
@mirth @pseudonym @xrisk @malstrom no, I think the statement stands on its own, and it's true under a fairly broad set of circumstances. There is a situation where the tokens "fuck your mom" might be quantitatively less likely in a sequence than the tokens "this is your mom", so the LLM might be less likely to make one mistake than the other but it wouldn't identify the later as a conceptually error about your mom.
@wronglang @pseudonym @xrisk @malstrom It depends what you mean by "identify". Those models are just (inordinately expensive) slabs of bits that can be used by people in different ways, one could perfectly well use one to compare attention maps of the most likely phrasing and the two alternatives you specify, or compute embeddings for these, and the results would likely be consistent with varying types of activity and levels of hostility. Just a word calculator, but a pretty fancy one.
@mirth @pseudonym @xrisk @malstrom so no concepts though
@wronglang @pseudonym @xrisk @malstrom In a precise sense of a specific linguistic or philosophical viewpoint, perhaps not. I admit that I am neither a linguist or philosopher, just someone who will likely work with computers the rest of my career and would like to understand the forces affecting me. This seems relevant so I will read deeper. In the first ten minutes I get the strong sense that there is not enough consensus about what "concept" means for broad claims to stand on their own.

@wronglang @pseudonym @xrisk @malstrom Starting with the Encyclopedia of Philosophy page below, one thing that in retrospect is unsurprising is that these debates are very old because the relationship between language, consciousness, intelligence, and which of these animals possess, has itself been going on for a long time. Even the framing of the debate assumes a narrow way of organizing the world into objects that runs counter to e.g. Taoist views.

https://plato.stanford.edu/entries/concepts

Concepts (Stanford Encyclopedia of Philosophy)

@mirth @pseudonym @xrisk @malstrom it would be an extraordinary claim to say that LLMs encode concepts so thankfully to responsibility is on their proponents to make the argument.

Another counter-example was the whole glue on pizza thing. Regardless of how you see reality a person would recognize that as a conceptual error.

@wronglang @pseudonym @xrisk @malstrom It's not a question of opinion or alignment, making such a strong claim about the models' internal workings requires a level of theoretical understanding that I don't think anybody has, not even the researchers that develop the things. The patterns inside the models overlap with what some philosophers consider "concept", many would disagree, and no serious person is going to argue the models don't emit huge amounts of garbage.
@wronglang @pseudonym @xrisk @malstrom To me this whole debate is similar to the question of whether animals have souls. Intellectually interesting, but a side show to questions like, for example, whether it's humane or ethical to raise, kill, and eat a pig (or how to do any of those things with less harm). And, in my opinion (this is is obviously not universal) you don't need to agree or even have an opinion about soul or not to have an opinion about meat production and consumption.
@mirth @pseudonym @xrisk @malstrom no: we can't effectively simulate a relatively straightforward brain whereas we simulate from LLMs all the time. Completely different concepts and all unrelated to souls.

@mirth @pseudonym @xrisk @malstrom the architecture of modern LLMs is relatively boring and there's a pretty broad range of researchers in machine learning and statistics who understand the techniques involved.

The extraordinary claim is that there's something mystical or poorly understood about the resulting program.

So the things you're claiming about researchers not understanding how their models work and how it maps to the idea of a concept, those are just wrong.

@pseudonym I follow many git repositories just out of general interest. In the past month or so, many of their subscription feeds have become unreadable for me because of the agents writing verbose messages all the time. The projects might get a lot of features, but like you wrote, who has the energy to read their outputs?

@pseudonym TIRED: 10x developer

HIRED: 10x junior intern

ALSO TIRED: Senior developer reviewing junior's copious output.

@pseudonym Recent Microsoft update releases seem to be a great case study for that
@pseudonym That and LLM code often looks very nice on the surface so it takes a lot of vigilance and thinking to find the subtle errors. Code from juniors tends to have more immediate signs of errors or wrong mental models.
@moink @pseudonym one of the benefits of people *having* a mental model

@pseudonym This.

I do a lot of "computer science labs", where students learn to write code, and they wave me down when they have questions. When their code doesn't do what they expect, it's often easy to figure out what went wrong because you can spot a bit of code that looks funky. And usually, the problem is in those few lines.

LLM code is meant to look like good code, so you don't get these little shortcuts.

@Moutmout

Good example I hadn't thought of.

Yes, human novice code mistakes have a "shape" to them a teacher can recognize quickly, or suspect because of how the error manifests.

These are different classes of "good looking" failures.

@Moutmout @pseudonym

Dunning Kreuger as a Service 🙃

@Moutmout and not just code: it took me just as long to find all the crucial mistakes in an AI translation as it would have taken me to do the translation myself.

Evaluation of the risks: https://www.draketo.de/software/ai-translation-evaluated#completely-changed
@pseudonym

AI Translation Evaluated: Effort and Risks

Verstreute Werke von ((λ()'Dr.ArneBab))

@pseudonym I have posed this conundrum before and the answer I received is that there is also an opportunity cost to not moving faster and the risk of a catastrophic bug may not outweigh the risk of being overtaken by competitors, especially since that was already happening before LLMs anyway.

Also, it *seems* models are improving at detecting these bugs, so they are being used to review changes, which, for the reasons you point out, they might be better at than people.

@toldtheworld @pseudonym I didn't think I'd see the day when I'd want to ask CEOs "If all your friends jumped off a cliff, would you do it too?"

Overtaken by competitors how? How is it "overtaken by" when what is actually happening is "my competitors are introducing fundamental flaws into their business model that will completely vitiate it as a workable product so all I have to do is wait for them to fail"?

Apparently the free market doesn't turn people into money-making machines that build products other people want, it turns CEOs into lemmings. Who knew?

@toldtheworld

The models may indeed get better at finding and fixing their own mistakes, and would not be subject to human fatigue, that's true. But it is never perfect, so you still need a human in the loop. You've just pushed back the time a bit before you missed a harder-to-detect error. Which is inevitable, because hallucinations / confabulations are a feature, not a bug, of essential LLM operations.

So you make more, faster, harder to spot errors. Better LLM checkers increase the risk.

@pseudonym also, when the senior retires, who replaces them?
@pseudonym This, %100. The Glass Cage by Nicholas Carr dives into this in depth with examples from aviation, and how full-automation of flight, makes it harder to recover from a disaster situation for pilots.

@max

Thanks for the reference. Didn't know that one.

@pseudonym @mayintoronto … and: there will be no juniors to grow into seniors. 😨

@deborahh @mayintoronto

Yup. This is my biggest structural concern, really. But I only had 500 characters to consider the previous post, and wanted to focus on the review cost of any "gains" one might have.

There are more related topics to discuss, but the breaking of the funnel to train the next generation of skilled people is huge.

@pseudonym - and by costs of false positives.
@pseudonym

Yesterday, I was working on some PowerShell-based automation. I'm a UNIX/Linux guy. I'm used to Bash. I'm used to Python and pythonic DSLs. I'm… You get the drift. I'm
not a Windows guy and I'm not PowerShell guy.

A few days ago, I got an email from Google telling me that, because I have a storage plan (mostly for photos storage), that use of Gemini was now included. So, I opted to try to use Gemini to bridge my PowerShell knowledge-gaps. I came to a couple conclusions:

• If you're a
truly junior "coder" (haven't mastered at least one "language" and regularly applied that master to "the real world), relying on LLMs is likely to lead you to creating smoking holes
• Those "smoking holes" are the results of the LLM sometimes providing partially or wholly incorrect answers: I've had to correct Gemini several times
• Even where "smoking holes" aren't a risk, LLMs are not adequately speculative. To illustrate, I was trying to solve a problem. Gemini suggested a given path to take. The suggested-path
looked more generalizable, so I asked, "I feel like there's a good chance I can do similar within this other, very analogous component. I'm going to run a test to validate." Gemini's response was effectively, "don't bother: the documentation doesn't indicate that that will work." A couple decades' experience under my belt, I know that documentation is sometimes incomplete or wrong (out of date). So, I proceeded to test my suspicion and, lo and behold, it worked. If you're lacking "feel" for things, you'd likely take the LLM's "don't bother" guidance and go down a different path, a path that might be a lot more byzantine.

@ferricoxide

Same background (Unix grey beard) with current focus on security, and your experience matched my own.

I was soaking in a lot more AI tools at last job, and experience and insight are key.

Recently I had a system suggest multiple times to do it "the easy way" which emphatically was not how I wanted it to work. I was able to gently guide it back to what I wanted.

Letting a senior dev do the work of a senior guiding a junior is about right. But still can't replace either.

@pseudonym

A lot of the problems I run into are "that's
one way you could do it, but definitely not the optimized way" in nature. You can tell that a lot of the answers come from scraping SubStack and similar forums rather than carefully-crafted code.

Personally, I'm not a fan of setting tons of ephemeral variables just so I can evaluate the variable. If I can directly evaluate the return from a subshell (or similar method), I generally prefer to do so …but it's not a method that you see in a lot of SubStack-type posts (thus, not something that LLMs typically recommend).

@ferricoxide @pseudonym Gemini 3 Pro just can't compete with even Claude Sonnet, let alone Claude Opus; it's a bit hopeless.

When I was doing some playing with improving my Rust skills I liked to write code and then ask the AI to make it more idiomatic which was helpful.

And also when prototyping an idea it helps to just let the AI do it, to get a sense from the output of how complex the implementation is and some ideas even if it needs substantial rework.

@pseudonym Yes. Very well put. I’m gonna use this …

@wendynather

Please do.

Glad it had some value.

Just my late night noodling about things.

@pseudonym Unless they're using LLM in aviation, nuclear, and radiology, who cares?