So Anthropic employees are using Claude Code to contribute AI-generated code to open source repositories and hiding the fact using their own internal “undercover mode”.

Totally trustworthy people.

(Any open source project that at the very least requires disclosure of AI-authored contributions should immediately ban Anthropic employees on principle.)

#AI #Anthropic #ClaudeCode #subterfuge

@aral Honestly I don't actually hate this.

It's a tool. The _user_ is responsible for what they're submitting. It's putting code generated by them in their name. I think this is actually good.

@aredridel @aral I really can’t agree with this, because it’s a question of accurate labeling not of “responsibility” or “authorship”. co-authored-by is perhaps the wrong method for labeling such things, but consider raw milk. ultimately, it is indeed the producer’s responsibility to ensure their product is free of contamination. but disclosure of its method of production is explicitly the kind of requirement that allows consumers of said product to make safe choices

@glyph Yeah, I disagree. Code isn't ingredients and it's not “contamination" any more than you should label “I used search and replace on this”

What you want to know is whether it was well engineered or not.

And in fact, this is almost entirely orthogonal to "safety”. This is an engineering product. The safety comes from processes and whether or not _anyone checked the work done was right_, not the inputs.

@aredridel "raw milk" isn't ingredients either, the difference is one of process, which is why I used it as an example. Raw milk contamination is more likely because the processes to keep it safe are harder to follow, require more continuous diligence on the part of the operators of that process, and thus contribute to more frequent failures. LLM output is exactly the same: it provokes vigilance decay.
@aredridel "search and replace" is not a fair comparison because search and replace does *not* cause vigilance decay, or risk of unknowing copyright infringement, etc. in the same way that "raw milk" and "grass fed" are just like… completely different disclosures with different consequential implications
@glyph Actually search and replace _does_ do that and in fact I was bit by vigilance decay in a search and replace problem literally yesterday. the comparison was intended.
@aredridel you are technically correct here (and indeed any automated tool with repeated human interaction my provoke _some_ measure of vigilance decay, one could argue that "flaky tests" cause it too) but I feel like you're talking past the actual argument here.

@glyph I'm specifically arguing that it's the _exact same phenomenon writ larger_ (which is a meaningful difference!)

But it's a difference in amount not kind.

Either you build processes to check things ("do engineering") or you don't (“vibes”)

@aredridel There are scales where differences in degree _become_ differences in kind.

Consider a more closely related phenomenon. There are many tools to check C/C++ code for memory safety errors. And, unsafe Rust code may exhibit exactly the same unsafe behaviors. Yet, C/C++ code and Rust code are categorically different in terms of the level of memory safety one may expect them to provide.

@aredridel Here we have an established "engineering" process, i.e. code review and continuous integration, designed for catching defects and process failures from a good-faith production of code from humans with an understanding of the system under development. That process is then subjected to a new type of code generation, where a machine that *maximizes plausibility while minimizing effort*, is throwing much larger volumes of code against the same mechanism. That's not the same process!

@glyph Yes, though I disagree with parts of it: it's changed the system and now we're dealing with the bottlenecks appearing in new places. Not always good ones!

But I don't think this is a change in kind. It's moved the problem in _really familiar_ ways to me, actually. It's what happens when you unleash people on a codebase who don't care for others, who offload work. You can rein that in, but you need feedback in the system to do it.

@aredridel @glyph I think it's different in that the impact on people who _do_ care is still very much there.

For some people there's a very positive emotional response to generating code (it's fun to build things! it's magic! no need to _learn_ which is always unpleasant, though that last one is likely less conscious).

OTOH code review is never fun, and now you have to do 4× as much, if not more, so you have very negative emotional response.

And so there's a very strong emotional push to auto-generate code, and a very strong emotional push to start skipping reviews and start post-facto rationalizing why this is OK (there's tests! The AI can fix it later even if you don't understand it! etc, you can watch people going through this in real time).

And this process can happen to people who care about others. Taking away this unpleasant burden of code review is helping your coworkers suffer less, after all. Taking away the emotional pain of thinking and learning is also helping your coworkers suffer less.

@itamarst @aredridel @glyph it's not even just 4x as much; every MR requires 4x (or more) as much effort as a human written one, because the modes of failure are completely different. For human written MRs a general heuristic of "if it looks good, it's good" is applicable to some extent, but LLMs are optimized to generate code that "looks good" and that makes reviewer's eyes glaze and that passes the review successfully, regardless of its actual quality.
@IngaLovinde Huh I don't find this at all. It looks like a featureless soup — that ‘eyes glaze over', I guess, is a fail to me.
@IngaLovinde Actually backing up, I think that's where I'm already a little sketched out by it. “looks good, probably is good" is how a lot of the supply chain attacks have slipped in.