If you replace a junior with #LLM and make the senior review output, the reviewer is now scanning for rare but catastrophic errors scattered across a much larger output surface due to LLM "productivity."

That's a cognitively brutal task.

Humans are terrible at sustained vigilance for rare events in high-volume streams. Aviation, nuclear, radiology all have extensive literature on exactly this failure mode.

I propose any productivity gains will be consumed by false negative review failures.

@pseudonym is the problem the increased volume of code that the LLM is producing (as compared to the junior dev) — what you are calling “productivity gains"? because I can see this same argument being made for code produced by humans as well.
@xrisk @pseudonym Volume is a key factor here. But even if the volume was the same, LLMs are doomed to stagnate as devs—whose code was scraped for training data—are displaced.
@malstrom @pseudonym that’s an interesting claim. I don’t know enough about LLM research to make a judgement. I do know that LLMs trained on synthetic (other LLM-generated) data tend to perform worse, but have we reached the limits of what LLMs are capable of? In my limited understanding, if an LLM can “learn” fundamental programming “concepts” (the same way they can “learn” concepts across human languages — I could be wrong in my understanding here), they should (might?) be able to transfer/apply those concepts to not-before-seen domains (maybe with a bit of “reasoning” prodded in).
@xrisk @malstrom @pseudonym just for clarity, LLMs don't learn concepts

@wronglang @xrisk @malstrom

Correct. They don't learn concepts. That's the key confusion in so much of the discussion and use around them.

They have no world model, and don't reason at all. But they perform a very good facsimile of reasoning, because reasoning is embedded in and has shaped the patterns of speech, text, and code.

They pattern match. That's all. Full stop. But they do it so well it looks like speech, or code, or understanding.

@pseudonym @wronglang @xrisk @malstrom I'm not sure how to formally define learning, concepts, or reasoning, but there is some evidence the models are themselves computationally universal. As I understand it one of the main ways these models are trained is reinforcement learning with the objective of diagnosing and fixing software bugs using command line tools. This seems like more than pattern matching in any traditional sense.
@mirth @pseudonym @xrisk @malstrom they don't do concepts in the sense that if the correct thing to say is "that's your mom" the errors involved in an LLM generating the text "this is your mom" instead are similar to the errors involved in generating the text "fuck your mom" despite there being vastly different layers of concepts involved.
@wronglang @pseudonym @xrisk @malstrom That's a very specific technical claim, can you elaborate?
@mirth @pseudonym @xrisk @malstrom no, I think the statement stands on its own, and it's true under a fairly broad set of circumstances. There is a situation where the tokens "fuck your mom" might be quantitatively less likely in a sequence than the tokens "this is your mom", so the LLM might be less likely to make one mistake than the other but it wouldn't identify the later as a conceptually error about your mom.
@wronglang @pseudonym @xrisk @malstrom It depends what you mean by "identify". Those models are just (inordinately expensive) slabs of bits that can be used by people in different ways, one could perfectly well use one to compare attention maps of the most likely phrasing and the two alternatives you specify, or compute embeddings for these, and the results would likely be consistent with varying types of activity and levels of hostility. Just a word calculator, but a pretty fancy one.
@mirth @pseudonym @xrisk @malstrom so no concepts though
@wronglang @pseudonym @xrisk @malstrom In a precise sense of a specific linguistic or philosophical viewpoint, perhaps not. I admit that I am neither a linguist or philosopher, just someone who will likely work with computers the rest of my career and would like to understand the forces affecting me. This seems relevant so I will read deeper. In the first ten minutes I get the strong sense that there is not enough consensus about what "concept" means for broad claims to stand on their own.

@wronglang @pseudonym @xrisk @malstrom Starting with the Encyclopedia of Philosophy page below, one thing that in retrospect is unsurprising is that these debates are very old because the relationship between language, consciousness, intelligence, and which of these animals possess, has itself been going on for a long time. Even the framing of the debate assumes a narrow way of organizing the world into objects that runs counter to e.g. Taoist views.

https://plato.stanford.edu/entries/concepts

Concepts (Stanford Encyclopedia of Philosophy)

@mirth @pseudonym @xrisk @malstrom it would be an extraordinary claim to say that LLMs encode concepts so thankfully to responsibility is on their proponents to make the argument.

Another counter-example was the whole glue on pizza thing. Regardless of how you see reality a person would recognize that as a conceptual error.

@wronglang @pseudonym @xrisk @malstrom It's not a question of opinion or alignment, making such a strong claim about the models' internal workings requires a level of theoretical understanding that I don't think anybody has, not even the researchers that develop the things. The patterns inside the models overlap with what some philosophers consider "concept", many would disagree, and no serious person is going to argue the models don't emit huge amounts of garbage.
@wronglang @pseudonym @xrisk @malstrom To me this whole debate is similar to the question of whether animals have souls. Intellectually interesting, but a side show to questions like, for example, whether it's humane or ethical to raise, kill, and eat a pig (or how to do any of those things with less harm). And, in my opinion (this is is obviously not universal) you don't need to agree or even have an opinion about soul or not to have an opinion about meat production and consumption.
@mirth @pseudonym @xrisk @malstrom no: we can't effectively simulate a relatively straightforward brain whereas we simulate from LLMs all the time. Completely different concepts and all unrelated to souls.

@mirth @pseudonym @xrisk @malstrom the architecture of modern LLMs is relatively boring and there's a pretty broad range of researchers in machine learning and statistics who understand the techniques involved.

The extraordinary claim is that there's something mystical or poorly understood about the resulting program.

So the things you're claiming about researchers not understanding how their models work and how it maps to the idea of a concept, those are just wrong.