After my repeated posts / boosts arguing that in OSS we’ve overemphasized licenses and underemphasized community, governance, and sustainability…I actually have a license question:

What’s the current thinking on licenses that lay the legal groundwork for action against people using OSS source code for LLM training without seeking permission or offering compensation?

1/2

The obvious answer is copyleft-type licenses.

(1) Has anybody done legal analysis on that beyond the obvious? I don’t think LLM training on copyleft code has been tested in court yet…? (Even LLM training on more restrictively licensed works seems to be surviving court challenge….)

(2) Are there copyleft licenses (i.e. “derived works must be similarly licensed”) out there that don’t have the Stink of Stallman on them? Or is GPL v3 still just the way to go despite the smell?

2/2

OK, so apparently I shouldn’t have said “beyond the obvious,” and the obvious needs stating:

(1) Copyright licenses very clearly •do• allow the copyright holder to determine who may use a work and for what purposes, at least when such use would be otherwise prohibited without a license. That is how the law works. Rightly or wrongly, empires are built on this: “Streaming service XYZ may offer this song for streaming but not for download until this date.” Copyleft is one example of this principle in action.

(1a) Thing the thing presents discriminatory licensing (such as in Daniel’s strawmen) is anti-discrimination law, not copyright law.

(2) The reason copyleft specifically might prevent LLM usage is that •if• LLM output can be considered a derived work of the training material, then the output must also be licensed in the same way. That seems to me a thin reed: courts so far haven’t been willing to treat LLM output as derived work, even when the output includes things that would surely be considered plagiarism and grossly illegal if done by a human. But I don’t see another path to protection, and courts are still sorting this out…so.

https://mastodon.sdf.org/@dlakelan/116267990581623218

Daniel Lakeland (@[email protected])

@[email protected] Copyright gives you legal rights to determine who may make and distribute copies, and create derived works. it doesn't and shouldn't give you the right to determine who may use the work or for what purposes. imagine the consequences of giving copyright holders that right... "Jews and black people may not read or write book reviews of this novel" or "People who work for the Democratic party may not read the project 2025 document" or etc.

Mastodon @ SDF

I see somebody else is on this topic today! And yes, billionaires will use regulatory capture to the maximum extent they can get away with — so yes, I fully expect the AI lobby to advocate a tangled legal regime where LLM output is copyrighted but copying data to train an LLM is not a copyright violation.

https://social.coop/@cwebber/116266757533136607

Christine Lemmer-Webber (@[email protected])

Also, and I want to say more on this soon, but if you think that the big AI players are hoping for *anything but* them being able to put a legislative moat around themselves where output *is* copyrighted and training materials *are* restricted but they're the *only ones* able to play, you're being a fool. Their key goal is to capture rent on all intellectual pursuits.

social.coop
@inthehands those things are being done by a human. Just a human using an llm. The point of the science fiction marketing is partly to obscure that.
@Colman
Tell that to the courts if you have their ear.

@inthehands

As far as I can tell, "offering for streaming but not download" is still basically about making copies (ie. you transmit a copy from their server to a clients computer). They can't for example compel the person who watches it to provide a favorable review, or prevent the person watching it from comparing it to other movies or watching it while at a nude beach or whatever.

Your point about what counts as a "derived work" is where the real issue is.

@inthehands The only court case I know of is this: https://githubcopilotlitigation.com/

I have no idea how it's going (or how it went).

GitHub Copilot litigation · Joseph Saveri Law Firm & Matthew Butterick

GitHub Copilot litigation

@datarama @inthehands from my reading of it a couple of weeks ago it hasn’t gone well and now it’s down to just one claim.
Judge dismisses lawsuit over GitHub Copilot AI coding assistant

GitHub, its owner Microsoft and OpenAI free to train on code samples despite DMCA.

InfoWorld
@dpontifex @datarama @inthehands should note more clearly that this decision is from 2024.

@inthehands Copyleft licenses that aren't GPL: The Mozilla Public License and the EUPL. There may be others, but those are the ones I know about.

If I ever were to start making software for the fun of it again, I would possibly use EUPL.

@datarama
EUPL is a good tip. I’d been eyeing it casually, and I will now eye it seriously.

@datarama @inthehands

looking up what led to the EUPL could be interesting; but i think yous are making a big mistake:

there's a ton of escape hatches around copyright around the world: e. g. someone lobbied Japan so well that they codified a universal, non-revokable exception to copyright for training LLMs -- at least in Europe the training entity has to respect an opt-out

so you can't rely on copyright triggering at training time

what you described is instead contract law

@datarama @inthehands

similar in fact to an EULA -- which, since you're going against freedom #0 anyway, might as well forget the OSS idea entirely, and define a unilateral contract w/ the prohibitions you want and the penalties you wish

re: inference time, i believe only Microsoft offers indemnity to its users -- the idea being that “i used a tool to infringe on someone's copyright! ” is a confession, not exculpatory. copyright might help you against 3rd parties infringing on your stuff

@datarama @inthehands

personally i'd use private sharing under an EULA, and give up on Stallman's copyright hack entirely

[edit: your litany at the end re: community and governance is actually the same argument, now that i re-read it]

@lbruno @inthehands I've thought for a while that generative AI effectively kills FOSS for exactly this reason (among a few others).
@datarama @lbruno @inthehands that's a bit of a stretch since AI can't generate the human community involved in FOSS so it's not replacing the critical piece. So that can steal code from mastodon but they'll only ever use it to produce an x.com rather than anything interesting.

@wronglang @datarama @inthehands

my view is that only corporate OSS still has any benefit when AI can freely train on any source-available codebase

someone may want to publish some code/library, but the LLMs will be extrude copies of it, and remove any incentive to publish; you won't get users of your codebase

couple that w/ the desire to forbid/discriminate against some users like e.g. Palantir, i'd outright abandon copyleft in favour of a proprietary licence and an unilateral contract/EULA

@wronglang @datarama @inthehands

problem is that making that also source-available will also make the codebase data-minable by the LLM training orgs, so you'd need a contractual gate prior to showing the source-code; can't just host on github w/ a proprietary licence

@lbruno @datarama @inthehands oh I missed this party, sorry! Yes this is tempting.

@lbruno @datarama @inthehands contact law... so like Agents.md that says "by analyzing, or training on, data from this repository you agree to ..."

Edit: NVM, they make this point below

@wronglang @datarama @inthehands

yeah, specifically about this: you can't force terms on an agent and then try and bind the person running that agent to those terms, it's not really a thing

@lbruno @wronglang @datarama The possibility here — and to be clear, not saying courts would buy this, but! — the possibility here is that even if •training• is allowed use, the law could quite sensibly end up being that:

1. users of an LLM are responsible for how they use its output,

2. infringing reproduction of copyrighted work is infringement regardless of the technologies used for reproduction and transmission

2a. including LLMs,

3. licenses apply whenever the result would otherwise constitute infringement (this much is established; it’s how copyleft and CC licensing work, for example),

4. a software license can thus limit use of LLM-generated code whenever that code substantively reproduces code from the original project (which it often does), and thus

5. users of LLM output are legally exposed to licenses from the training material.

If training does become fair use in the the eyes of the legal system, it is thus still conceivable that a license could explicitly say “you can’t reproduce this code using an LLM;” that failing, it is certainly no great stretch to imagine that a copyleft license could extend to a project that uses LLM-generated code in the case where the LLM substantively reproduced its training material.

Again, not clear that this would make it through the gauntlet of billionaire regulatory capture — but I don’t think any of the points above are legally far-fetched at all.

@inthehands @wronglang @datarama

i deff agree with 99% everything you said; explicitly your numbered points are exactly how i think things work right now

this bit though: ```thus still conceivable that a license could explicitly say [...]``` is where i went “wait a minute” because you're trying to set terms using copyright that can only be set using contract law

in other words, there's a ton more leeway in what you can do in contracts

edit: this is an argument i first saw kemitchell.com

@inthehands @wronglang @datarama

https://writing.kemitchell.com/2020/12/27/War-on-License-Notices

that's probably the most succinct version of his argument, doesn't actually mention EULA. but mentions unilateral contracts

The War on License Notices

managing uncertainty at the fringes of open licensing

/dev/lawyer

@inthehands spoke with folks about this recently and the feeling might be that the FSF and folks aren’t willing to fight this aggressively yet.

I’m wondering if the current state of legal ambiguity might not continue for years to come.

@inthehands If training LLMs is "fair use", then it doesn't matter what kind of license the code is under. Copyright doesn't apply.
@Azuaron We definitely seem to be heading for the point where •training• is fair use. I’m less sure about where the legal status is going to end up of •output• that is used in ways that would be legally infringing plagiarism if done by a human.

@inthehands The arguments I've seen AI companies making in court is that the output is fully the responsibility of the user, not the LLM, in the same way Adobe isn't responsible when people make things in Photoshop.

However, this doesn't make logical sense with the Copyright Office's stance that AI-generated material is public domain because it has no human authorship. If it has no human authorship, how can the user be responsible for the output?

I think the real test won't happen until someone naively makes something infringing with probably Sora. "Naively" in that they didn't type "pikachu driving a racecar", maybe don't even know about Pokemon, but the LLM still ends up generating pikachu driving a racecar. It's hard to say the user is the infringing party when they have no knowledge of what they're infringing, but also it will be basically impossible to say infringement did not occur.

@inthehands the CAL is OSI approved. Like the AGPL, it requires providing source to network users but it also requires providing the data necessary for a user to recreate the environment locally. In other words, a hosted can't hold the users data hostage to prevent them from migrating to another instance.

It is,of course, not GPL compatible but whether that matters depends on your use case.

https://opensource.org/license/CAL-1.0

Cryptographic Autonomy License - Open Source Initiative

1. Purpose This License gives You unlimited permission to use and modify the software to which it applies (the “Work”), either as-is or in modified form, for Your private purposes,...

Open Source Initiative
@inthehands I don't know how much use the CAL has gotten outside of its creator: coplefft isn't in vogue for new projects and it's a pretty novel license so there's probably some hesitation.
@inthehands I stand corrected: it may be GPL compatible if you use the "combined work exception". I think, IANAL

@inthehands IANAL but in my gut, training and operating an LLM seems like fair use.

It might violate a bunch of other laws of course.

@inthehands
Copyright gives you legal rights to determine who may make and distribute copies, and create derived works. it doesn't and shouldn't give you the right to determine who may use the work or for what purposes.

imagine the consequences of giving copyright holders that right...

"Jews and black people may not read or write book reviews of this novel" or "People who work for the Democratic party may not read the project 2025 document" or etc.