Mastodawn

@aral Honestly I don't actually hate this.

It's a tool. The _user_ is responsible for what they're submitting. It's putting code generated by them in their name. I think this is actually good.

@aredridel @aral I really can’t agree with this, because it’s a question of accurate labeling not of “responsibility” or “authorship”. co-authored-by is perhaps the wrong method for labeling such things, but consider raw milk. ultimately, it is indeed the producer’s responsibility to ensure their product is free of contamination. but disclosure of its method of production is explicitly the kind of requirement that allows consumers of said product to make safe choices

@glyph Yeah, I disagree. Code isn't ingredients and it's not “contamination" any more than you should label “I used search and replace on this”

What you want to know is whether it was well engineered or not.

And in fact, this is almost entirely orthogonal to "safety”. This is an engineering product. The safety comes from processes and whether or not _anyone checked the work done was right_, not the inputs.

@aredridel "raw milk" isn't ingredients either, the difference is one of process, which is why I used it as an example. Raw milk contamination is more likely because the processes to keep it safe are harder to follow, require more continuous diligence on the part of the operators of that process, and thus contribute to more frequent failures. LLM output is exactly the same: it provokes vigilance decay.

@aredridel "search and replace" is not a fair comparison because search and replace does *not* cause vigilance decay, or risk of unknowing copyright infringement, etc. in the same way that "raw milk" and "grass fed" are just like… completely different disclosures with different consequential implications

@glyph Actually search and replace _does_ do that and in fact I was bit by vigilance decay in a search and replace problem literally yesterday. the comparison was intended.

@aredridel you are technically correct here (and indeed any automated tool with repeated human interaction my provoke _some_ measure of vigilance decay, one could argue that "flaky tests" cause it too) but I feel like you're talking past the actual argument here.

@glyph I'm specifically arguing that it's the _exact same phenomenon writ larger_ (which is a meaningful difference!)

But it's a difference in amount not kind.

Either you build processes to check things ("do engineering") or you don't (“vibes”)

@aredridel There are scales where differences in degree _become_ differences in kind.

Consider a more closely related phenomenon. There are many tools to check C/C++ code for memory safety errors. And, unsafe Rust code may exhibit exactly the same unsafe behaviors. Yet, C/C++ code and Rust code are categorically different in terms of the level of memory safety one may expect them to provide.

@aredridel Here we have an established "engineering" process, i.e. code review and continuous integration, designed for catching defects and process failures from a good-faith production of code from humans with an understanding of the system under development. That process is then subjected to a new type of code generation, where a machine that *maximizes plausibility while minimizing effort*, is throwing much larger volumes of code against the same mechanism. That's not the same process!

@glyph Yes, though I disagree with parts of it: it's changed the system and now we're dealing with the bottlenecks appearing in new places. Not always good ones!

But I don't think this is a change in kind. It's moved the problem in _really familiar_ ways to me, actually. It's what happens when you unleash people on a codebase who don't care for others, who offload work. You can rein that in, but you need feedback in the system to do it.

Itamar Turner-Trauring Apr 1

@aredridel @glyph I think it's different in that the impact on people who _do_ care is still very much there.

For some people there's a very positive emotional response to generating code (it's fun to build things! it's magic! no need to _learn_ which is always unpleasant, though that last one is likely less conscious).

OTOH code review is never fun, and now you have to do 4× as much, if not more, so you have very negative emotional response.

And so there's a very strong emotional push to auto-generate code, and a very strong emotional push to start skipping reviews and start post-facto rationalizing why this is OK (there's tests! The AI can fix it later even if you don't understand it! etc, you can watch people going through this in real time).

And this process can happen to people who care about others. Taking away this unpleasant burden of code review is helping your coworkers suffer less, after all. Taking away the emotional pain of thinking and learning is also helping your coworkers suffer less.

Inga stands with 🇺🇦 🇵🇸Apr 1

@itamarst @aredridel @glyph it's not even just 4x as much; every MR requires 4x (or more) as much effort as a human written one, because the modes of failure are completely different. For human written MRs a general heuristic of "if it looks good, it's good" is applicable to some extent, but LLMs are optimized to generate code that "looks good" and that makes reviewer's eyes glaze and that passes the review successfully, regardless of its actual quality.

@IngaLovinde Huh I don't find this at all. It looks like a featureless soup — that ‘eyes glaze over', I guess, is a fail to me.

@IngaLovinde Actually backing up, I think that's where I'm already a little sketched out by it. “looks good, probably is good" is how a lot of the supply chain attacks have slipped in.

Inga stands with 🇺🇦 🇵🇸Apr 2

@aredridel and that's one of the reasons why we have a web of trust of some kind, and changes by first-time contributors deserve extra scrutiny, and nobody would accept a huge new feature or a huge refactoring from an unknown first-time contributor.

And still most of the time one can expect that contributors, even the first-time ones, are acting in good faith and can be reviewed in good faith as collaborative contributors, not as adversaries who purposefully try to slip a vulnerability past code review, purposefully writing it in such a way that it looks plausibly like benign code.
With LLM-generated code, code reviews should treat it as written by an adversary _every_ time. And reviewing code written by an adversary consumes much much more effort than reviewing code written by collaborator in good faith... and why would one even spend any effort on reviewing code written by adversary, when discarding that code and closing the MR is an option?

https://research.swtch.com/xz-timeline

Enrico Apr 3

@IngaLovinde @aredridel
Are you describing CVE-2024-3094 ?

Trusting long time contributors and not really reviewing their code is what got almost all Linux boxes pawned…

research!rsc: Timeline of the xz open source attack

Inga stands with 🇺🇦 🇵🇸Apr 3

@illogical_me @aredridel I'm not saying that you should not really review long time contributors code. I'm saying that regular _thorough_ review (and in my experience, what most people in corporate setting typically do is _way_ less than that) is nowhere near enough the kind of review that's needed to catch sophisticated adversaries.
xz is a small project at least. But when you're working on a typical corporate project that has hundreds of lines changed per person per day, and not all people are good at reviewing even regular code: those who are good at reviews simply won't have enough time in the day to review all the code in "adversarial" mode with extra scrutiny.

Enrico

@IngaLovinde @aredridel When people describe corporate environments, I’m often shocked. You’ve worked in nicer places than I have. If someone accepts a basic code review comment like ‘you copy pasted this instead of extending the function,’ rather than shopping for another reviewer, I call that a win. That may be why AI doesn’t upset me much: with moderate feedback, I get better results from it than I often did from humans.