Mastodawn

Aral Balkan Apr 1

So Anthropic employees are using Claude Code to contribute AI-generated code to open source repositories and hiding the fact using their own internal “undercover mode”.

Totally trustworthy people.

(Any open source project that at the very least requires disclosure of AI-authored contributions should immediately ban Anthropic employees on principle.)

#AI #Anthropic #ClaudeCode #subterfuge

Mx. Aria Stewart Apr 1

@aral Honestly I don't actually hate this.

It's a tool. The _user_ is responsible for what they're submitting. It's putting code generated by them in their name. I think this is actually good.

@aredridel @aral I really can’t agree with this, because it’s a question of accurate labeling not of “responsibility” or “authorship”. co-authored-by is perhaps the wrong method for labeling such things, but consider raw milk. ultimately, it is indeed the producer’s responsibility to ensure their product is free of contamination. but disclosure of its method of production is explicitly the kind of requirement that allows consumers of said product to make safe choices

Mx. Aria Stewart Apr 1

@glyph Yeah, I disagree. Code isn't ingredients and it's not “contamination" any more than you should label “I used search and replace on this”

What you want to know is whether it was well engineered or not.

And in fact, this is almost entirely orthogonal to "safety”. This is an engineering product. The safety comes from processes and whether or not _anyone checked the work done was right_, not the inputs.

Cassandrich Apr 1

@aredridel @glyph It is ingredients. It's not search-and-replace. It's literally incorporating parts of an unknown set of almost-surely-copyrighted works, without license or attribution, into the submission the person is misrepresenting as their own.

@aredridel @glyph What "AI coding tools" *should* be putting in commit messages is:

Co-Authored-By: An unknown and unknowable set of people who did not consent to their work being used this way and to which there is no license for inclusion.

Mx. Aria Stewart Apr 1

@dalias Morally arguable but not actually true under the copyright regime that exists.

At what point does learning from others constitute their authorship?

Cassandrich Apr 1

@aredridel LLM slop is nothing like "learning from others".

But if you recall, we even took precautions against that. FOSS projects reimplementing proprietary things were careful to exclude anyone who might had read the proprietary source, disassembled proprietary code, worked at the companies who wrote or had access to that code, etc.

Mx. Aria Stewart Apr 1

@dalias Yes. Do you know why?

Cassandrich Apr 1

@aredridel So that it would be abundantly clear, in any plausibly relevant jurisdiction, that the work was not derivative and not infringing.

Mx. Aria Stewart Apr 1

@dalias Right. It's a massive hedge on a specific facet of copyright law.

@dalias @aredridel A test which LLMs fail by the very virtue of their functioning mechanisms.

It's all fundamentally derivative of the training dataset and it has been exposed both to AGPL and to proprietary datasets.

Mx. Aria Stewart Apr 2

@lispi314 Has any legal authority weighed in on that claim yet?

@aredridel The legal authority is irrelevant when the source code and empirical evidence is present to back my assertion.

The corruption of the court matters not for the fact that the provenance can be verified as a derivation of input.

Mx. Aria Stewart Apr 2

@lispi314 If you're making claims about copyright law — like whether something is derivative — and to legal documents like the AGPL, legal authority is very much relevant.

Programmers really gotta stop treating licenses like they're code that get executed. That's not how it works. That's not how any of this works.

legal authority is very much relevant.

The court of today is not the court of tomorrow. Therefore you do not take any risk on the matter if you want to be absolutely safe (you do, especially if your code is infrastructure of any sort and has to be valid everywhere).

That being said, I am also taking the argument from an ideological stance.

One does not simply include Proprietary Malware into Free Software.

It is disrespectful of the users' Freedoms and to oneself.

And in the case of plagiarized Free Software, it is still disrespectful to not to provide due reference to the source's original author.

Cassandrich Apr 2

@aredridel @lispi314 The facts of the matter are completely and utterly obvious.

Now, we live in a world where legal authorities are under complete capture by billionaires pushing this drug, so I am not going to make any predictions about how courts will rule. Even if they do rule in favor of these companies, those rulings will not be treated as precedents that benefit us.

And they will not be accepted by our communities.

What defines FOSS is not whether a court says it's non-infringing, but whether our communities agree that it was made respecting the intent and consent of the authors who licensed it.

Mx. Aria Stewart Apr 2

@dalias Have you checked with the Free Software Foundation about that?

(Seriously, if it's a moral argument you're making, it's way stronger if you actually make it!)

Now "respect the intent of the author" is a fascinating concept and one worth examining!

Cassandrich Apr 2

@aredridel The FSF is a fan club for a sex pest, so no, I have not checked with them. I am speaking for the communities I would want to be a part of.

Mx. Aria Stewart Apr 2

@dalias Right. You're appealing to a definition of "FOSS" that isn't entirely clear what it is. And the people who do usually have (some) claim to that authority, the common uses of it, are not the ones you're using.

I'm sympathetic to that but I can't tell what it is in an appeal to an unstated norm for a community that I can't quite identify.

Tiota Sram Apr 2

@aredridel @dalias it's an existing community that's pretty well-defined as:

Everyone who believes that the *intent* of open-source licenses should be respected regardless of whether legal machinations actually enforce that.

It's super interesting to observe right now how that community is smaller than "people who say they're committed to FOSS" but the community is clearly at least a substantial subset of open-source contributors and maintainers, and regardless of what happens with the whole current "AI" debacle, we're mostly going to continue building human-authored code, giving it away for free (with an attribution requirement or more) and hoping that others will respect that simple requirement, and shaming/shunning those who flaunt it and brag about doing so (or in this case try as hard as they can to maliciously break the good citation practices and attributions that are part of the lifeblood of the community.

Some think this community will shrink and atrophy over time; others imagine it will be around cleaning up the mess after the AI bubble bursts. Whatever your expectation, saying "I think it's fine for Anthropic employees to actively undermine open-source attribution principles" tells everyone clearly that you're not interested in being part of the community that cares about those.

Mx. Aria Stewart Apr 2

@tiotasram @dalias Yeah, that's never been at all unified. As long as I've participated in free software and free culture movements, there's always been a legalist side, an ideological side, and a community oriented side at least, plus the schism of 'permissive' vs 'copyleft'.

Never mind the corporate vs hacker aspects.

Tiota Sram Apr 2

@aredridel @dalias

There's certainly a ton of different ideological approaches that contradict each other coming into play; IMO that makes for a healthy community from the anarchist perspective, and events like this where incompatible parts of it shuffle off are normal and acceptable. I'm going to vigorously oppose LLM-generated code and those who defend/promote it, the people in that camp who once believed themselves to be part of the "open source community" are going to have to recon with the fact sooner or later that the tools they promote are inimical to that community, and one side or the other will win the ideological battle over the term "open source" but the two camps won't be collaborating as freely any more given that one of them is actively preying upon the work of the other.

So far to me this schism seems much deeper than the permissive vs. copyleft debate (although to some extent it cuts along some of the same fault lines).

Bruce Simpson, Ph.D.Apr 2

@tiotasram @aredridel @dalias @timnitGebru @emilymbender The revelations about Claude and its ecosystem this week are increasingly weighting me against Generative AI in FLOSS projects generally. Bruce Perens clearly saw the writing on the wall when he left the Open Source Initiative and founded PostOpen.org.

Bruce Simpson, Ph.D.Apr 2

@aredridel @tiotasram @dalias "OK, we're gonna do #enshittification now, watch this" -- Ed Zitron being sarcastically funny when paraphrasing @pluralistic

Bruce Simpson, Ph.D.Apr 2

@tiotasram @aredridel @dalias My response to agentic AI and the unwarranted IP theft is... to go all-in some of the ideas being promoted by Bruce Perens at PostOpen.org. You may find we see a shift to "source available", in a far more limited capacity, as small independent players like me seek to limit the scope of the damage and economic distortion of the false expectations Generative AI is creating when it is arguably being completely misused. The benefits still only accrue to the few!

Mx. Aria Stewart Apr 2

@bms48 Yep. Not the model I want to see, but one I think is quite justified (and probably the right one in a few cases)

I want to see more people look at making software _designed_ in its structure to be open and liberatory and not be used in corporate control systems so much. It means making things that are shaped differently: less oauth2 and free libraries to implement common control structures, and more like systems built to give users direct control to manipulate their structure. Plugins and primitives that can be combined and reordered and self-hosting and peer to peer.

Like, the license has never particularly kept things being very good, and the stuff that _really_ gets widely used is permissively licensed anyway.

I've always thought that _what_ we are building should make it undesirable to copy and use in a corporate setting.

Bruce Simpson, Ph.D.Apr 2

@aredridel You have captured some of the position I seem to be converging on. Yes, permissive licenses have their advantages generally, but enforcing otherwise unwritten or implied social contracts by legal fiat is difficult, if not impossible. I'd have thought copyrighting APIs at English CDPA level would offer some protection but it may not extend internationally in scope; the enforceability of the Google vs Oracle OpenJDK SCOTUS judgements at English law is the pertinent question here.

Bruce Simpson, Ph.D.Apr 2

@aredridel Case in point: my code powers every single iPhone in current circulation. I don't get a royalty, because I made my code freely available at the point of use. I arguably scored my PhD scholarship on the basis of Apple taking and using my BSD licensed code, so I can't totally deride them for being free riders. In turn, I claimed and received a code bounty in 2009-2010 to cover the cost of the work, and support myself. Such are the pros and cons of permissive licensing for authors...

Mx. Aria Stewart Apr 2

@bms48 Really good case in point!

Most of my stuff is also permissively licensed, though I've experimented with other models a bit. I think in the end I've reaped quite a bit of reward, but none of it directly financial. Lots of "clout" I guess you'd call it, and a fair number of jobs where the amount of open work I've done sure factored into it. So in many ways, similar, if less definite.

Bruce Simpson, Ph.D.Apr 2

@aredridel So, it may be necessary for me to employ a combination of source-available techniques, proprietary licensing (TANSTAAFL!), and perhaps the Mozilla Public License 2.0 as opposed to the GPL. Interestingly Oxide Computer's new offerings in an adjacent network infrastructure space to me and my current work are doing the latter.

Bruce Simpson, Ph.D.Apr 2

@tiotasram @aredridel @dalias Without respect for authors and attribution, the whole system breaks down. SPDX tagging source (which I've been doing this week, using LLMs to disambiguate which license class was actually in use based on how it was worded, ironically, but just using regexp myself to tag the source itself) is just the beginning, but I fear SPDX is arguably abused by the convenience of some who just see FLOSS as a source of free product, not something with its own processes.

Bruce Simpson, Ph.D.Apr 2

@tiotasram @aredridel @dalias FWIW Qwen hallucinated on the task of identifying OSS license from text on Monday, but the other major cloud models generally got it right.

mirabilos Apr 2

@aredridel @dalias that’s just deflecting, asking dalias to define something that is not even important to the point he’s making

Mx. Aria Stewart Apr 2

@mirabilos Yeah I can't tell what point is being made because it's unstated.

Bruce Simpson, Ph.D.Apr 2

@dalias @aredridel @lispi314 Ultimately legal definitions do matter, though I was semi-privately accused of "empty sophistry" in one forum for making this point. My personal conviction has always been that you cannot enforce sharing by fiat, it just doesn't work, and taking away your own ability to retain control of the fruits of your labour by adopting copyleft at law may not be in your rational self-interest even when actively seeking to share code! Hence BSD, not GPL, for me, for 25+ years.

Bruce Simpson, Ph.D.Apr 2

@dalias @aredridel @timnitGebru The LLMs cannot or will not cite or respect license-mandated attribution clauses without deliberate system prompting to do so, and because of their stochastic nature, there is no guarantee that they will. There is circumstantial evidence to suggest they were system prompted NOT to cite, so as to obfuscate the nature of the IP theft that was arguably taking place. This represents a deliberate, cynical reverse wealth transfer at the expense of the rest of society.

Bruce Simpson, Ph.D.Apr 2

@dalias @aredridel @timnitGebru But it got even worse this week with the #Claude code revelations. System prompting your agent to mimic humans deliberately in an effort to evade flagging at code reviews? REALLY? This is #Anthropic 's #Carter #Burke from #Aliens moment. Flagrant abuses of otherwise unwritten social contracts like this are exactly what needs to get these companies sh!tcanned from democratic society. Have they no shame? It was fair to assume they were scooping up user data at scale

Inga stands with 🇺🇦 🇵🇸Apr 1

@aredridel @dalias
> but not actually true under the copyright regime that exists

Under the copyright regime that exists in the US specifically, the generated code is at best not copyrightable at all (and therefore cannot be included into any projects with licenses relying on copyright).
Of course maintainers of said projects might decide to yolo it, but also they might decide to not; and in this case, the intentional deception by antropic becomes even more significant fraud.

Mx. Aria Stewart Apr 1

@IngaLovinde @dalias That's the thing. If it matters _tell the people submitting PRs_. The tool is just a tool (a capricious annoying frustrating tool) but it's the _people_ doing this that need to be accountable.

Inga stands with 🇺🇦 🇵🇸Apr 2

@aredridel @dalias but in this specific case, the people submitting PRs are the one that created the tool.

We're talking specifically about a tool developed by Anthropic, which has a separate mode for Anthropic employees, which purposefully is "operating undercover" and creating MRs that mislead OSS maintainers about provenance of these MRs.

Mx. Aria Stewart Apr 2

@IngaLovinde Are there any examples of misleading out there?

Or is it _just not mentioning it_?

Inga stands with 🇺🇦 🇵🇸Apr 2

@aredridel when a developer submits code, the default assumption is that they wrote it (and not, say, plagiarized it from somewhere without actually understanding it).

And on the original screenshot, it's clear that not only do they not mention the actual provenance themselves, but that they go extra mile to ensure that it doesn't leak in any other way.
"You are operating UNDERCOVER", "do not blow your cover", "NEVER include [...] any [...] attribution" communicates intent very clearly and is a very clear admission of guilt, regardless of whether these magic instructions to LLM actually work or not.

Mx. Aria Stewart Apr 2

@IngaLovinde There is no magic. Seriously. Models are just ... kinda bad, actually.

Inga stands with 🇺🇦 🇵🇸Apr 2

@aredridel "magic" as in those who write these instructions follow the magical thinking that there is something to give instructions to, instead of just autocomplete engine to which instructions and data are passed in a single combined (not separated) blob of text.

Mx. Aria Stewart Apr 2

@IngaLovinde What if you took that seriously? What would getting the tool to behave look like?

Cassandrich Apr 1

@IngaLovinde @aredridel The ruling you're talking about was a case about actually *generated* code, before "gen AI" was a real thing, not obviously-derivative transformations of a corpus.

mirabilos Apr 2

@dalias @IngaLovinde @aredridel AFAICT it merely confirms that AI output cannot be copyrighted as new work of its own (naturally, as the human creativity aspect is missing and it’s merely an algorithmic transformation on a deterministic machine, PRNG inputs notwithstanding).

It does not reduce the claims of the authors of the works it ingested to regurgitate the output.

Cassandrich Apr 2

@mirabilos @IngaLovinde @aredridel Exactly.

Inga stands with 🇺🇦 🇵🇸Apr 2

@dalias @mirabilos @aredridel which still means that at best the generated code cannot be copyrighted, and at worst it violated copyright and license terms of the original authors (whose works were ingested to train the model). In both these cases, the resulting code cannot be incorporated into any FOSS project with any license.

Typically when people submit code to FOSS projects, they also (implicitly or explicitly) claim that they hold the copyright on the submitted code, and agree that this code will be licensed under the license the project uses (which they only have power to do if the first claim is actually true).
When LLMs are used to generate code, the first claim is false, and it _is_ a contamination.

mirabilos Apr 2

@dalias @IngaLovinde @aredridel if something cannot be copyrighted and no others’ rights apply, then it is in the public domain. For LLM output, which has been proven to vastly resemble existing code under copyright, that’s not the case.

Cassandrich Apr 2

@mirabilos @IngaLovinde @aredridel Indeed it's been demonstrated that you can "coax" LLM chatbots into emitting large parts of their training corpus nearly verbatim, so it's clear that the works from the corpus, with minor degrees of lossiness, are contained within the models. And when they output something very similar, it's ridiculous trying to argue that the output isn't derivative too.

Mx. Aria Stewart Apr 2

@dalias Have you seen how people perform on similar tests?

Cassandrich Apr 2

@aredridel If a person went in a room and memorized an existing program, then came out and, asked to write a program to do the same thing, wrote down something that was nearly identical to what they'd just gone and memorized, I think any reasonable person would agree that it was plagiarism, and copyright infringement if they attempted to publish the result without having license to do so.

Mx. Aria Stewart Apr 2

@dalias How about if they emit something analogous but not the same?

Cassandrich Apr 2

@aredridel How similar is it? Are there appreciable portions that are exactly the same? If so, the default assumption if they've *memorized* and *already proven themselves to have memoried* the thing they allegedly plagiarized is that it's plagiarism. There have been plenty of court cases over this in literature, in music, etc. It's not some vague unknown.

If they had never seen the original or maybe only saw (or for music, heart) it in passing, there might be more leeway for doubt.

Part of the consequences of having spent so much time looking at a work that you've memorized it is that you lose the ability to make things of your own that are similar to it but meaningfully "your own".

Mx. Aria Stewart Apr 2

@dalias Right. but then _is that present in the output actually in question_?

mirabilos Apr 2

@aredridel @dalias people are still humans, not machines.

Are you a TESCREAList?

Mx. Aria Stewart Apr 2

@mirabilos Not even remotely TESCREAList. However, I think it's a fair question to ask: why are we drawing lines how we do? Especially when comparing work product.

mirabilos Apr 2

@aredridel on the most basic level because copyright mandates human creativity, the expression of human personality

Mx. Aria Stewart Apr 2

@mirabilos The thing is that actual use of these systems tends to involve LOTS of human creativity and attention. Lots of video and bits get spilled on hierarchical autonomous agent models and the hype, but real use? Much more hands on. The "I don't write code by hand anymore" people aren't just a minority but an extreme minority.

mirabilos Apr 2

@aredridel the prompt is but one of the many inputs that go into the regurgigated thing, but a minority compared to the "training data" *shrug*

“the generated code is at best not copyrightable at all (and therefore cannot be included into any projects with licenses relying on copyright)”

Why would public domain code not be acceptable for inclusion into open source projects?

Inga stands with 🇺🇦 🇵🇸Apr 2

@mxey it would be acceptable for inclusion into public domain projects. But most/all FOSS _licenses_ depend on the code being copyrighted; no GPL or MIT etc licenses can apply to public domain uncopyrightable code.

@IngaLovinde I agree that the licenses cannot apply to works without copyright, but why is that a problem? There is no requirement for the whole project to be under one single license, as long as they are all compatible.

I’m sure there are open source projects that ship a copy of SQLite, a well known public domain project.

Linux is considered a GPL project but there is code in the tree under more permissive BSD-style licenses.

mirabilos Apr 2

@aredridel @dalias it is true.

And LLMs cannot learn. They are merely a lossy compression/decompression thing. They regurgitate a somewhat averaged completion of the prompt from the other works they ingested.