So Anthropic employees are using Claude Code to contribute AI-generated code to open source repositories and hiding the fact using their own internal “undercover mode”.

Totally trustworthy people.

(Any open source project that at the very least requires disclosure of AI-authored contributions should immediately ban Anthropic employees on principle.)

#AI #Anthropic #ClaudeCode #subterfuge

@aral Honestly I don't actually hate this.

It's a tool. The _user_ is responsible for what they're submitting. It's putting code generated by them in their name. I think this is actually good.

@aredridel @aral I really can’t agree with this, because it’s a question of accurate labeling not of “responsibility” or “authorship”. co-authored-by is perhaps the wrong method for labeling such things, but consider raw milk. ultimately, it is indeed the producer’s responsibility to ensure their product is free of contamination. but disclosure of its method of production is explicitly the kind of requirement that allows consumers of said product to make safe choices

@glyph Yeah, I disagree. Code isn't ingredients and it's not “contamination" any more than you should label “I used search and replace on this”

What you want to know is whether it was well engineered or not.

And in fact, this is almost entirely orthogonal to "safety”. This is an engineering product. The safety comes from processes and whether or not _anyone checked the work done was right_, not the inputs.

@aredridel @glyph It is ingredients. It's not search-and-replace. It's literally incorporating parts of an unknown set of almost-surely-copyrighted works, without license or attribution, into the submission the person is misrepresenting as their own.

@aredridel @glyph What "AI coding tools" *should* be putting in commit messages is:

Co-Authored-By: An unknown and unknowable set of people who did not consent to their work being used this way and to which there is no license for inclusion.

@dalias Morally arguable but not actually true under the copyright regime that exists.

At what point does learning from others constitute their authorship?

@aredridel LLM slop is nothing like "learning from others".

But if you recall, we even took precautions against that. FOSS projects reimplementing proprietary things were careful to exclude anyone who might had read the proprietary source, disassembled proprietary code, worked at the companies who wrote or had access to that code, etc.

@dalias Yes. Do you know why?
@aredridel So that it would be abundantly clear, in any plausibly relevant jurisdiction, that the work was not derivative and not infringing.
@dalias Right. It's a massive hedge on a specific facet of copyright law.

@dalias @aredridel A test which LLMs fail by the very virtue of their functioning mechanisms.

It's all fundamentally derivative of the training dataset and it has been exposed both to AGPL and to proprietary datasets.

@lispi314 Has any legal authority weighed in on that claim yet?
@aredridel The legal authority is irrelevant when the source code and empirical evidence is present to back my assertion.

The corruption of the court matters not for the fact that the provenance can be verified as a derivation of input.

@lispi314 If you're making claims about copyright law — like whether something is derivative — and to legal documents like the AGPL, legal authority is very much relevant.

Programmers really gotta stop treating licenses like they're code that get executed. That's not how it works. That's not how any of this works.

@aredridel

legal authority is very much relevant.

The court of today is not the court of tomorrow. Therefore you do not take any risk on the matter if you want to be absolutely safe (you do, especially if your code is infrastructure of any sort and has to be valid everywhere).

That being said, I am also taking the argument from an ideological stance.

One does not simply include Proprietary Malware into Free Software.

It is disrespectful of the users' Freedoms and to oneself.

And in the case of plagiarized Free Software, it is still disrespectful to not to provide due reference to the source's original author.

@aredridel @lispi314 The facts of the matter are completely and utterly obvious.

Now, we live in a world where legal authorities are under complete capture by billionaires pushing this drug, so I am not going to make any predictions about how courts will rule. Even if they do rule in favor of these companies, those rulings will not be treated as precedents that benefit us.

And they will not be accepted by our communities.

What defines FOSS is not whether a court says it's non-infringing, but whether our communities agree that it was made respecting the intent and consent of the authors who licensed it.

@dalias Have you checked with the Free Software Foundation about that?

(Seriously, if it's a moral argument you're making, it's way stronger if you actually make it!)

Now "respect the intent of the author" is a fascinating concept and one worth examining!

@aredridel The FSF is a fan club for a sex pest, so no, I have not checked with them. I am speaking for the communities I would want to be a part of.

@dalias Right. You're appealing to a definition of "FOSS" that isn't entirely clear what it is. And the people who do usually have (some) claim to that authority, the common uses of it, are not the ones you're using.

I'm sympathetic to that but I can't tell what it is in an appeal to an unstated norm for a community that I can't quite identify.

@aredridel @dalias it's an existing community that's pretty well-defined as:

Everyone who believes that the *intent* of open-source licenses should be respected regardless of whether legal machinations actually enforce that.

It's super interesting to observe right now how that community is smaller than "people who say they're committed to FOSS" but the community is clearly at least a substantial subset of open-source contributors and maintainers, and regardless of what happens with the whole current "AI" debacle, we're mostly going to continue building human-authored code, giving it away for free (with an attribution requirement or more) and hoping that others will respect that simple requirement, and shaming/shunning those who flaunt it and brag about doing so (or in this case try as hard as they can to maliciously break the good citation practices and attributions that are part of the lifeblood of the community.

Some think this community will shrink and atrophy over time; others imagine it will be around cleaning up the mess after the AI bubble bursts. Whatever your expectation, saying "I think it's fine for Anthropic employees to actively undermine open-source attribution principles" tells everyone clearly that you're not interested in being part of the community that cares about those.

@tiotasram @dalias Yeah, that's never been at all unified. As long as I've participated in free software and free culture movements, there's always been a legalist side, an ideological side, and a community oriented side at least, plus the schism of 'permissive' vs 'copyleft'.

Never mind the corporate vs hacker aspects.

@aredridel @dalias

There's certainly a ton of different ideological approaches that contradict each other coming into play; IMO that makes for a healthy community from the anarchist perspective, and events like this where incompatible parts of it shuffle off are normal and acceptable. I'm going to vigorously oppose LLM-generated code and those who defend/promote it, the people in that camp who once believed themselves to be part of the "open source community" are going to have to recon with the fact sooner or later that the tools they promote are inimical to that community, and one side or the other will win the ideological battle over the term "open source" but the two camps won't be collaborating as freely any more given that one of them is actively preying upon the work of the other.

So far to me this schism seems much deeper than the permissive vs. copyleft debate (although to some extent it cuts along some of the same fault lines).

@tiotasram @aredridel @dalias @timnitGebru @emilymbender The revelations about Claude and its ecosystem this week are increasingly weighting me against Generative AI in FLOSS projects generally. Bruce Perens clearly saw the writing on the wall when he left the Open Source Initiative and founded PostOpen.org.
@aredridel @tiotasram @dalias "OK, we're gonna do #enshittification now, watch this" -- Ed Zitron being sarcastically funny when paraphrasing @pluralistic
@tiotasram @aredridel @dalias My response to agentic AI and the unwarranted IP theft is... to go all-in some of the ideas being promoted by Bruce Perens at PostOpen.org. You may find we see a shift to "source available", in a far more limited capacity, as small independent players like me seek to limit the scope of the damage and economic distortion of the false expectations Generative AI is creating when it is arguably being completely misused. The benefits still only accrue to the few!

@bms48 Yep. Not the model I want to see, but one I think is quite justified (and probably the right one in a few cases)

I want to see more people look at making software _designed_ in its structure to be open and liberatory and not be used in corporate control systems so much. It means making things that are shaped differently: less oauth2 and free libraries to implement common control structures, and more like systems built to give users direct control to manipulate their structure. Plugins and primitives that can be combined and reordered and self-hosting and peer to peer.

Like, the license has never particularly kept things being very good, and the stuff that _really_ gets widely used is permissively licensed anyway.

I've always thought that _what_ we are building should make it undesirable to copy and use in a corporate setting.

@aredridel You have captured some of the position I seem to be converging on. Yes, permissive licenses have their advantages generally, but enforcing otherwise unwritten or implied social contracts by legal fiat is difficult, if not impossible. I'd have thought copyrighting APIs at English CDPA level would offer some protection but it may not extend internationally in scope; the enforceability of the Google vs Oracle OpenJDK SCOTUS judgements at English law is the pertinent question here.
@aredridel Case in point: my code powers every single iPhone in current circulation. I don't get a royalty, because I made my code freely available at the point of use. I arguably scored my PhD scholarship on the basis of Apple taking and using my BSD licensed code, so I can't totally deride them for being free riders. In turn, I claimed and received a code bounty in 2009-2010 to cover the cost of the work, and support myself. Such are the pros and cons of permissive licensing for authors...

@bms48 Really good case in point!

Most of my stuff is also permissively licensed, though I've experimented with other models a bit. I think in the end I've reaped quite a bit of reward, but none of it directly financial. Lots of "clout" I guess you'd call it, and a fair number of jobs where the amount of open work I've done sure factored into it. So in many ways, similar, if less definite.

@aredridel So, it may be necessary for me to employ a combination of source-available techniques, proprietary licensing (TANSTAAFL!), and perhaps the Mozilla Public License 2.0 as opposed to the GPL. Interestingly Oxide Computer's new offerings in an adjacent network infrastructure space to me and my current work are doing the latter.
@tiotasram @aredridel @dalias Without respect for authors and attribution, the whole system breaks down. SPDX tagging source (which I've been doing this week, using LLMs to disambiguate which license class was actually in use based on how it was worded, ironically, but just using regexp myself to tag the source itself) is just the beginning, but I fear SPDX is arguably abused by the convenience of some who just see FLOSS as a source of free product, not something with its own processes.
@tiotasram @aredridel @dalias FWIW Qwen hallucinated on the task of identifying OSS license from text on Monday, but the other major cloud models generally got it right.
@aredridel @dalias that’s just deflecting, asking dalias to define something that is not even important to the point he’s making
@mirabilos Yeah I can't tell what point is being made because it's unstated.
@dalias @aredridel @lispi314 Ultimately legal definitions do matter, though I was semi-privately accused of "empty sophistry" in one forum for making this point. My personal conviction has always been that you cannot enforce sharing by fiat, it just doesn't work, and taking away your own ability to retain control of the fruits of your labour by adopting copyleft at law may not be in your rational self-interest even when actively seeking to share code! Hence BSD, not GPL, for me, for 25+ years.
@dalias @aredridel @timnitGebru The LLMs cannot or will not cite or respect license-mandated attribution clauses without deliberate system prompting to do so, and because of their stochastic nature, there is no guarantee that they will. There is circumstantial evidence to suggest they were system prompted NOT to cite, so as to obfuscate the nature of the IP theft that was arguably taking place. This represents a deliberate, cynical reverse wealth transfer at the expense of the rest of society.
@dalias @aredridel @timnitGebru But it got even worse this week with the #Claude code revelations. System prompting your agent to mimic humans deliberately in an effort to evade flagging at code reviews? REALLY? This is #Anthropic 's #Carter #Burke from #Aliens moment. Flagrant abuses of otherwise unwritten social contracts like this are exactly what needs to get these companies sh!tcanned from democratic society. Have they no shame? It was fair to assume they were scooping up user data at scale

@aredridel @dalias
> but not actually true under the copyright regime that exists

Under the copyright regime that exists in the US specifically, the generated code is at best not copyrightable at all (and therefore cannot be included into any projects with licenses relying on copyright).
Of course maintainers of said projects might decide to yolo it, but also they might decide to not; and in this case, the intentional deception by antropic becomes even more significant fraud.

@IngaLovinde @dalias That's the thing. If it matters _tell the people submitting PRs_. The tool is just a tool (a capricious annoying frustrating tool) but it's the _people_ doing this that need to be accountable.
@IngaLovinde @aredridel The ruling you're talking about was a case about actually *generated* code, before "gen AI" was a real thing, not obviously-derivative transformations of a corpus.

@dalias @IngaLovinde @aredridel AFAICT it merely confirms that AI output cannot be copyrighted as new work of its own (naturally, as the human creativity aspect is missing and it’s merely an algorithmic transformation on a deterministic machine, PRNG inputs notwithstanding).

It does not reduce the claims of the authors of the works it ingested to regurgitate the output.

@aredridel @dalias it is true.

And LLMs cannot learn. They are merely a lossy compression/decompression thing. They regurgitate a somewhat averaged completion of the prompt from the other works they ingested.

@mirabilos
Not, strictly, true, though I get what you're going at.

There's a few phenomena going on that shape these tools beyond that.

- Emergent complexity
- "Memory" records
- Embedded context
- Incorporating social inputs

I still think they are strictly tools, but ones that can self-adapt with alarming power if you configure them right.

They have a medianizing effect on a lot of their output (that's actually one of the reasons they're good at code. We generally want code to be "normal". It's one of the many reasons it's pretty bad for more artistic creative work, morally and technically.)

But that's not the same as only repeating the median. The temperature, the randomness injected in makes them actually jump to stuff that is at times nonsensical but also at times clever. It's just randomness, but then with a heap of context and congruence applied that is rather interesting.

@aredridel that’s all not learning because learning is not something an algorithm can do

@dalias Got examples to show to support that position?

Remember copyright is a _legal_ regime and the legal regime seems quite oriented toward that NOT being the case.

@aredridel I am not going to argue over the complete capture of the US legal system by fascists and try to prove that we have or will have some court rule in our favor.

FOSS is global, the arc of justice is long, and it is completely irresponsible to assume that there will not be a reckoning, one day, somewhere, that renders former-FOSS full of this slop inrfringing in at least some jurisdictions.

This is stuff we thought deeply about and took serious precautions against, not relying on public domain working the same way everywhere, etc., until the AI bros just came along and said "oh nothing matters anymore just do whatever you want as long as you're powerful".

@aredridel @dalias the Text and Data Mining exception to copyright law only applies ⓐ to models for analytics (discovery of patterns, trends and correlations; §44b UrhG), and ⓑ to works whose right holders didn’t opt out (ibid. p.3); there’s absolutely no basis on which “genAI” even could be considered permissible, as both the reproduction (§16) and the right to make changes, editions and other derivatives (§23) are protected by law, by default, and always require a licence.

So, thrice denied, by existing law.