So Anthropic employees are using Claude Code to contribute AI-generated code to open source repositories and hiding the fact using their own internal “undercover mode”.

Totally trustworthy people.

(Any open source project that at the very least requires disclosure of AI-authored contributions should immediately ban Anthropic employees on principle.)

#AI #Anthropic #ClaudeCode #subterfuge

@aral Honestly I don't actually hate this.

It's a tool. The _user_ is responsible for what they're submitting. It's putting code generated by them in their name. I think this is actually good.

@aredridel @aral I really can’t agree with this, because it’s a question of accurate labeling not of “responsibility” or “authorship”. co-authored-by is perhaps the wrong method for labeling such things, but consider raw milk. ultimately, it is indeed the producer’s responsibility to ensure their product is free of contamination. but disclosure of its method of production is explicitly the kind of requirement that allows consumers of said product to make safe choices

@glyph Yeah, I disagree. Code isn't ingredients and it's not “contamination" any more than you should label “I used search and replace on this”

What you want to know is whether it was well engineered or not.

And in fact, this is almost entirely orthogonal to "safety”. This is an engineering product. The safety comes from processes and whether or not _anyone checked the work done was right_, not the inputs.

@aredridel "raw milk" isn't ingredients either, the difference is one of process, which is why I used it as an example. Raw milk contamination is more likely because the processes to keep it safe are harder to follow, require more continuous diligence on the part of the operators of that process, and thus contribute to more frequent failures. LLM output is exactly the same: it provokes vigilance decay.
@aredridel "search and replace" is not a fair comparison because search and replace does *not* cause vigilance decay, or risk of unknowing copyright infringement, etc. in the same way that "raw milk" and "grass fed" are just like… completely different disclosures with different consequential implications
@glyph Actually search and replace _does_ do that and in fact I was bit by vigilance decay in a search and replace problem literally yesterday. the comparison was intended.
@aredridel you are technically correct here (and indeed any automated tool with repeated human interaction my provoke _some_ measure of vigilance decay, one could argue that "flaky tests" cause it too) but I feel like you're talking past the actual argument here.

@glyph I'm specifically arguing that it's the _exact same phenomenon writ larger_ (which is a meaningful difference!)

But it's a difference in amount not kind.

Either you build processes to check things ("do engineering") or you don't (“vibes”)

@aredridel There are scales where differences in degree _become_ differences in kind.

Consider a more closely related phenomenon. There are many tools to check C/C++ code for memory safety errors. And, unsafe Rust code may exhibit exactly the same unsafe behaviors. Yet, C/C++ code and Rust code are categorically different in terms of the level of memory safety one may expect them to provide.

@aredridel Here we have an established "engineering" process, i.e. code review and continuous integration, designed for catching defects and process failures from a good-faith production of code from humans with an understanding of the system under development. That process is then subjected to a new type of code generation, where a machine that *maximizes plausibility while minimizing effort*, is throwing much larger volumes of code against the same mechanism. That's not the same process!

@glyph Yes, though I disagree with parts of it: it's changed the system and now we're dealing with the bottlenecks appearing in new places. Not always good ones!

But I don't think this is a change in kind. It's moved the problem in _really familiar_ ways to me, actually. It's what happens when you unleash people on a codebase who don't care for others, who offload work. You can rein that in, but you need feedback in the system to do it.

@aredridel @glyph I think it's different in that the impact on people who _do_ care is still very much there.

For some people there's a very positive emotional response to generating code (it's fun to build things! it's magic! no need to _learn_ which is always unpleasant, though that last one is likely less conscious).

OTOH code review is never fun, and now you have to do 4× as much, if not more, so you have very negative emotional response.

And so there's a very strong emotional push to auto-generate code, and a very strong emotional push to start skipping reviews and start post-facto rationalizing why this is OK (there's tests! The AI can fix it later even if you don't understand it! etc, you can watch people going through this in real time).

And this process can happen to people who care about others. Taking away this unpleasant burden of code review is helping your coworkers suffer less, after all. Taking away the emotional pain of thinking and learning is also helping your coworkers suffer less.

@itamarst Yes, this! This is one of the failure modes we need to steer around.

One of the things we can do is turn UP the standards, rather than down. I you're generating code for PRs, you now have no excuse not to Get It Right. And it's extremely reasonable to be quite rude to someone who's dumped slop on us.

@aredridel I've been interviewing for jobs, and I've asked about AI tools, and one guy told me "if you submit slop you'll be flayed" and that probably has better outcomes, yes.

But also I've heard "we're trying to figure out how to deal with quality" and that... didn't seem promising.

And I'm sure there are organizations where if you push back on quality, management response is to take away the requirement for code review. And that is another qualitative difference: the push for LLM code generation is often aggressively top-down. So the CEO who previously paid no attention to development processes is now intervening to change how they're done.

@itamarst Yeah, those exist! And sometimes that's even the right answer.

And we're _all_ trying to figure out quality right now, because this has been a change to the system.

@aredridel I am skeptical that we _all_ care about quality. My impression is a huge proportion of management level believe in the Magic of AI, or at least the magic of getting more work out of those fucking expensive workers they're wasting money on, and therefore cannot conceive or admit that quality might be a problem. Let alone identify long term issues like skill and knowledge degradation that have impacts related to reduced quality.

@itamarst @aredridel The typical exec only cares about quality to the degree that it impacts immediate profitability. This is why addressing technical debt is generally deprioritized relative to feature work, despite its immense (but less visible) organizational cost.

People keep saying "more guardrails!" will solve this, but every time we disconnect from the implementation and prompt our way through, we have a harder time understanding what we're building. It's the path of least resistance.

@dandean Curious: what is that opinion about the 'typical exec' based on?
@aredridel 20+ years of working with execs