Mastodawn

If you replace a junior with #LLM and make the senior review output, the reviewer is now scanning for rare but catastrophic errors scattered across a much larger output surface due to LLM "productivity."

That's a cognitively brutal task.

Humans are terrible at sustained vigilance for rare events in high-volume streams. Aviation, nuclear, radiology all have extensive literature on exactly this failure mode.

I propose any productivity gains will be consumed by false negative review failures.

Show thread

toldtheworld Feb 26

@pseudonym I have posed this conundrum before and the answer I received is that there is also an opportunity cost to not moving faster and the risk of a catastrophic bug may not outweigh the risk of being overtaken by competitors, especially since that was already happening before LLMs anyway.

Also, it *seems* models are improving at detecting these bugs, so they are being used to review changes, which, for the reasons you point out, they might be better at than people.

Show thread

Pseudo Nym

@toldtheworld

The models may indeed get better at finding and fixing their own mistakes, and would not be subject to human fatigue, that's true. But it is never perfect, so you still need a human in the loop. You've just pushed back the time a bit before you missed a harder-to-detect error. Which is inevitable, because hallucinations / confabulations are a feature, not a bug, of essential LLM operations.

So you make more, faster, harder to spot errors. Better LLM checkers increase the risk.