Typical ML argument: "If I can read something legally, why can't I train an LLM on it?"

Humans are capable of reading things and later writing a similar thing that is still a copyright violation. If I go and write a book that follows the plot line of Star Wars, that's still a copyright violation, even if no text is literally the same. If I play the melody to a song on my piano and release it without the appropriate mechanical cover license, that's also a copyright violation.

The reason this does not happen often is that, as humans, we are aware that that's plagiarism and there are rules. Sometimes it happens by accident, and people still get sued and lose.

LLMs have no such awareness and routinely output things which are blatant copyright violations when appropriately prompted. That means the model weights encode that work, and therefore, are themselves a derivative work.

Your brain encodes a massive amount of copyrighted information. You are not a walking copyright violation because humans aren't data, can't be copied and distributed en masse, have human rights, etc. This is why "mind reading machines" are a classic dystopian plot point (monetizing your thoughts etc).

An LLM is not a human, does not have human rights, nor human privileges. It is data, and if it encodes copyrighted information, that's a derivative work. If you aren't following the license of the training data, that's a copyright violation.

@lina just considering the LLM itself as a derivative work, wouldn't it be legal to train one on CC BY-SA or GPL text as long as the weights are released under the same license? (Which wouldn't seem like a big deal for those "open" models)

@florian It would, if and only if:

1) You train only on compatible license content
2) You meet all the attribution requirements of those licenses (this is a big one)
3) Your weights are licensed under a compatible license themselves (usually the same for most copyleft licenses)
4) You understand that the output of the model may be copyrighted and require the same licensing and attribution (making the model unsuitable for, say, creative generation for publishing, but it would still be fine for a local voice assistant or something like that) and make your users aware of this.

Unfortunately, I haven't been able to find any models that meet any of those conditions, let alone all of them (other than KL3M which credibly claims to be trained on PD and non copyrightable content only).

@lina actually attribution might still be a major hurdle, you can totally supply all the attribution strings for your training corpus with the model, but is it really attribution if you can't point to where in the model a specific persons work is encoded (which you can't obviously)?

@florian I don't think there's any need to care about where a work is encoded. You do need a list of all authors though.

I think there's some flexibility in that you might not need to literally list every individual, though IANAL. For example, if you train on Wikipedia (only), you could plausibly get away with specifying the exact database dump you used (where the edit history data is available) without having to extract and collate the author list yourself. But if you're scraping something like GitHub that does not make explicit dumps available, yeah, you'd better at *least* list every project and commit ID you scraped (and even that might not be enough).

@lina @florian

Who has standing and losses to legally attack anyone for this?

Say I didn't follow these (arbitrary...) demands and I downloaded a MIT project from github as anyone can, and I put it into an LLM.

The maintainer of the MIT project has enough mental flexibility to look up from the concise, liberal text of the MIT license, and formulate a complaint to the judge about his losses and ownership of the (completely different) code the LLM emitted?

I guess we will all see, right?

@hopeless @florian Your argument is basically "open software licenses don't matter because nobody is actually going to sue people for violating them"

This is not a good argument.

@lina @florian

My point is that to attack someone through the courts, you must have standing - it's your copyright - and be able to show damages.

The people who licensed their work under MIT are already at peace with $0

"Permission is hereby granted, free of charge, to any person obtaining a copy of this software... to deal in the Software without restriction, including without limitation the rights .."

What damages would they show? Copyright maximalism is not compatible with FOSS.

@hopeless @florian That's really not how it works.

https://www.law.cornell.edu/uscode/text/17/504

They get to claim the profits made off of their copyrighted work, or statutory damages, at their choice. The latter doesn't require any money to move anywhere. You can be on the hook for $30k, or $150k for willful infringement.

17 U.S. Code § 504 - Remedies for infringement: Damages and profits

LII / Legal Information Institute

@lina @florian

No it is exactly how it works.

> "... the copyright owner’s actual damages and any additional profits of the infringer..."

Actual damages for works under MIT: $0. Additional profits when your own work is FOSS: $0.

> "... statutory damages..."

"...to be eligible for statutory damages and attorneys_fees, the creator must have registered their work with the u.s._copyright_office before the infringement occurred (or within three months of publication)."

https://uslawexplained.com/17_usc_504

17 U.S.C. § 504: The Ultimate Guide to Copyright Damages and Profits [US Law Explained]

@hopeless @florian Right, except:

In establishing the infringer’s profits, the copyright owner is required to present proof only of the infringer’s gross revenue, and the infringer is required to prove his or her deductible expenses and the elements of profit attributable to factors other than the copyrighted work.

So they get to claim all your revenue and you are on the hook for proving none of it derived from the infringement.

That is not easy. For example, if you only released the model for free online, but then landed an AI job, you would have to prove the model release was not a factor in you landing that job.

@lina @florian I'm glad you're making money, but my gross revenue is $0. It is very easy for me to show $0 profit on my FOSS work that is part-written by coding assists.

And I remind you, huggingface would be the first victim if any of this was relevant.

Also the point... they had to register their works at the US Copyright Office within 3 months to "get nukes". I guess some people like MSFT did this, but people who put their sources on github to be scraped under a liberal license?

@lina @Plettigoal

If you imagine yourself as Gordon Gecko looking for an easy meal, he still has to point to some of your code and say, "Your Honour, this for loop is almost identical to the for loop in my beloved SCO Linux" only to get a chart in the courtroom from the defence showing 100,000 examples of the same structure from prior art gleaned from FOSS on github.

Unless it's verbatim, which modern coding assists never are, they cannot even show any supposed lineage from The Precious.

@hopeless @Plettigoal You're making a bunch of arguments I disagree with, and again it all boils down to "if I can get away with breaking licenses it doesn't matter", which I find a rather unpalatable direction, so let's end this conversation here.

@lina @Plettigoal Just to be clear licenses have some power only because they can provide additional grants to negate the underlying law.

If the law doesn't provide penalties for the situation, eg, fair use, or your situation isn't applicable to the law, the license is simply impotent. You're then not "breaking" any license; most of my stuff is MIT and I follow any requirements to compose other compatible projects.

Anyway have a nice day.