Typical ML argument: "If I can read something legally, why can't I train an LLM on it?"

Humans are capable of reading things and later writing a similar thing that is still a copyright violation. If I go and write a book that follows the plot line of Star Wars, that's still a copyright violation, even if no text is literally the same. If I play the melody to a song on my piano and release it without the appropriate mechanical cover license, that's also a copyright violation.

The reason this does not happen often is that, as humans, we are aware that that's plagiarism and there are rules. Sometimes it happens by accident, and people still get sued and lose.

LLMs have no such awareness and routinely output things which are blatant copyright violations when appropriately prompted. That means the model weights encode that work, and therefore, are themselves a derivative work.

Your brain encodes a massive amount of copyrighted information. You are not a walking copyright violation because humans aren't data, can't be copied and distributed en masse, have human rights, etc. This is why "mind reading machines" are a classic dystopian plot point (monetizing your thoughts etc).

An LLM is not a human, does not have human rights, nor human privileges. It is data, and if it encodes copyrighted information, that's a derivative work. If you aren't following the license of the training data, that's a copyright violation.

Yes, this means that anyone downloading "open" models potentially puts themselves in as much legal risk as downloading a movie does.

Just *using* a cloud based LLM may or may not be safe depending on how a bunch of legal unknowns go, if the output happens to not qualify as a copyright violation. Of course, since you have no idea whether it will, you're basically playing copyright Russian roulette.

BTW, there is almost certainly a model size and architecture threshold here. If I train an image upscaler on copyrighted data, for example, it's quite improbable that it will be able to generate copyrighted data from an unrelated input (for typical upscaler designs and sizes). Most likely that is safe.

Is there a size threshold for LLMs where, for typical training data distributions, the LLM is highly unlikely to "memorize" any copyrightable information? Is that size threshold a usable LLM? These are open questions (that nobody seems to be interested in researching).

I'm willing to bet if such a size threshold exists, it's much smaller than the minimum LLM size useful for "general purpose" prompting though.

@lina

"You stand accused of downloading and then redistributing the BritDonnaAguilera model, which is able to generate music extremely close to that of popular pop singers."

@wakame I'm waiting for people using Suno or whatever to get sued for copying melodies. There has to be some juicy stuff in there...
@lina it’s only legal if you’re a mega corporation or rich. /s
@lina I don’t know if I will redistribute the vibecoded test suite I wrote. The only LLM-generated code I have published were trivial bugfixes that were clearly based on my own code.
@lina just considering the LLM itself as a derivative work, wouldn't it be legal to train one on CC BY-SA or GPL text as long as the weights are released under the same license? (Which wouldn't seem like a big deal for those "open" models)

@florian It would, if and only if:

1) You train only on compatible license content
2) You meet all the attribution requirements of those licenses (this is a big one)
3) Your weights are licensed under a compatible license themselves (usually the same for most copyleft licenses)
4) You understand that the output of the model may be copyrighted and require the same licensing and attribution (making the model unsuitable for, say, creative generation for publishing, but it would still be fine for a local voice assistant or something like that) and make your users aware of this.

Unfortunately, I haven't been able to find any models that meet any of those conditions, let alone all of them (other than KL3M which credibly claims to be trained on PD and non copyrightable content only).

@lina actually attribution might still be a major hurdle, you can totally supply all the attribution strings for your training corpus with the model, but is it really attribution if you can't point to where in the model a specific persons work is encoded (which you can't obviously)?

@florian I don't think there's any need to care about where a work is encoded. You do need a list of all authors though.

I think there's some flexibility in that you might not need to literally list every individual, though IANAL. For example, if you train on Wikipedia (only), you could plausibly get away with specifying the exact database dump you used (where the edit history data is available) without having to extract and collate the author list yourself. But if you're scraping something like GitHub that does not make explicit dumps available, yeah, you'd better at *least* list every project and commit ID you scraped (and even that might not be enough).

@lina @florian

Who has standing and losses to legally attack anyone for this?

Say I didn't follow these (arbitrary...) demands and I downloaded a MIT project from github as anyone can, and I put it into an LLM.

The maintainer of the MIT project has enough mental flexibility to look up from the concise, liberal text of the MIT license, and formulate a complaint to the judge about his losses and ownership of the (completely different) code the LLM emitted?

I guess we will all see, right?

@hopeless @florian Your argument is basically "open software licenses don't matter because nobody is actually going to sue people for violating them"

This is not a good argument.

@lina @florian

My point is that to attack someone through the courts, you must have standing - it's your copyright - and be able to show damages.

The people who licensed their work under MIT are already at peace with $0

"Permission is hereby granted, free of charge, to any person obtaining a copy of this software... to deal in the Software without restriction, including without limitation the rights .."

What damages would they show? Copyright maximalism is not compatible with FOSS.

@hopeless @florian That's really not how it works.

https://www.law.cornell.edu/uscode/text/17/504

They get to claim the profits made off of their copyrighted work, or statutory damages, at their choice. The latter doesn't require any money to move anywhere. You can be on the hook for $30k, or $150k for willful infringement.

17 U.S. Code § 504 - Remedies for infringement: Damages and profits

LII / Legal Information Institute

@lina @florian

No it is exactly how it works.

> "... the copyright owner’s actual damages and any additional profits of the infringer..."

Actual damages for works under MIT: $0. Additional profits when your own work is FOSS: $0.

> "... statutory damages..."

"...to be eligible for statutory damages and attorneys_fees, the creator must have registered their work with the u.s._copyright_office before the infringement occurred (or within three months of publication)."

https://uslawexplained.com/17_usc_504

17 U.S.C. § 504: The Ultimate Guide to Copyright Damages and Profits [US Law Explained]

@hopeless @florian Right, except:

In establishing the infringer’s profits, the copyright owner is required to present proof only of the infringer’s gross revenue, and the infringer is required to prove his or her deductible expenses and the elements of profit attributable to factors other than the copyrighted work.

So they get to claim all your revenue and you are on the hook for proving none of it derived from the infringement.

That is not easy. For example, if you only released the model for free online, but then landed an AI job, you would have to prove the model release was not a factor in you landing that job.

@lina @florian I'm glad you're making money, but my gross revenue is $0. It is very easy for me to show $0 profit on my FOSS work that is part-written by coding assists.

And I remind you, huggingface would be the first victim if any of this was relevant.

Also the point... they had to register their works at the US Copyright Office within 3 months to "get nukes". I guess some people like MSFT did this, but people who put their sources on github to be scraped under a liberal license?

@lina @Plettigoal

If you imagine yourself as Gordon Gecko looking for an easy meal, he still has to point to some of your code and say, "Your Honour, this for loop is almost identical to the for loop in my beloved SCO Linux" only to get a chart in the courtroom from the defence showing 100,000 examples of the same structure from prior art gleaned from FOSS on github.

Unless it's verbatim, which modern coding assists never are, they cannot even show any supposed lineage from The Precious.

@hopeless @Plettigoal You're making a bunch of arguments I disagree with, and again it all boils down to "if I can get away with breaking licenses it doesn't matter", which I find a rather unpalatable direction, so let's end this conversation here.

@lina @Plettigoal Just to be clear licenses have some power only because they can provide additional grants to negate the underlying law.

If the law doesn't provide penalties for the situation, eg, fair use, or your situation isn't applicable to the law, the license is simply impotent. You're then not "breaking" any license; most of my stuff is MIT and I follow any requirements to compose other compatible projects.

Anyway have a nice day.

@hopeless

> The people who licensed their work under MIT are already at peace with $0

You're ignoring the "subject to the following conditions" part. Assuming the LLM would infringe on the author's copyright without a license and doesn't follow the attribution requirement, they could say they would have been willing to sell a license without that requirement and claim lost revenue from that.

@MildDrop72 As I mentioned, if I compose a MIT project in my own MIT project, I am very careful to keep their attribution on those parts.

But when that project was learning material among hundreds of thousands of others and I get a coding assist to write new code, to my specifications, what does that have to do with that one MIT drop in the ocean? Please, go ahead and show the judge.

> MIT ... Lost revenue ... other license

Maybe, for GPL3: not a free license that already gives it all away.

@florian @lina > (which you can't obviously)?

You can. But none of them do it in part because it's storage-heavy and it kills performance.
@lina also the whole argument falls appart when companies don't even pay for a single copy of the work they're using to train on-

@lina

> Yes, this means that anyone downloading "open" models potentially puts themselves in as much legal risk as torrenting a movie does.

Huggingface still seems to be operating?

Also there is a big difference between bidirectionally torrenting somethng and "downloading" it... they are not interchangeable terms; normal humans are not getting into legal peril for downloading / streaming movies or music things (because it is not their "performance" of it but whoever is sending it to them).

@hopeless

> Huggingface still seems to be operating?

People get away with torrenting movies all the time too, doesn't mean it's legal.

Fair point on torrenting vs downloading though (because seeding), edited. But yes people do get in trouble for just downloading too. Depends on the country.

@lina @hopeless > People get away with torrenting movies all the time too, doesn't mean it's legal.

It really should be, and in a lot of places it is. Only the most awful of places have made it criminal to share things without profit.
@lispi314 @lina Yeah. I don't know why there is such a Copyright Maximalism speedrun going on here.
@lina
(Let's just hope this doesn't lead to giving the LLMs human rights)
@lina
Unfortunately the lawmakers and courts have been bought by the people who own these LLMs.
@lina If I retell Star Wars with different character names, not as a publication but just around a campfire with my family, that still isn't illegal, right? Lucasfilm/Disney only get to send in the legal team to black-bag me if I try to, e.g., publish a wax cylinder of my campfire stories (as sincere non-parody). Just saying a plot aloud isn't a problem, nor is sketching Mickey Mouse on a napkin, right?

@paul The story is still copyrighted, but telling it to your family wouldn't count as a "public performance" so wouldn't infringe copyright. Telling it to a crowd at a park probably would, though.

Copyright of characters is complicated and varies by jurisdiction. That said, Mickey Mouse is in the public domain now, so your sketch is totally fine as long as you aren't trying to sell it or pass it off as legitimate Disney merchandise (because Mickey is still trademarked).

(Disclaimer: IANAL, this is just my understanding.)

@lina On a related note, does camp fall under parody law if it's not intentional? Like, this image is probably covered because "Bugs Bunny + Spiderman" but I didn't specifically prompt for that. Diffusion models are just bad at multi-subject stuff.

@paul I have no idea tbh... ^^;;

Parody rights are also not universal, it's a very jurisdiction specific thing.

@lina Need to make sure you end up in a state court with a sense of humor. So, like, avoid the 5th circuit.
@lina @paul It's horrendous that people have truly tried to kill public storytelling & sharing of stories.

How is this not disgusting to more people?
@lispi314 @lina The legal framework does become philosophically intractable for a lot of edge cases. It's especially indefensible when the original IP depended on a folktale with unknown authorship. Feels like Monsanto patenting genes; very "Wait, they can DO that?"
@paul @lina Even in general, most of the remotely valid arguments for copyright that I've heard are really more arguments against capitalism.
@lina well said. The right to read does not apply to a chatbot, despite what Cory Doctorow might believe. From a tech standpoint, I’m pretty disappointed US district court judges aren’t willing to view the model itself as an infringing derivative though (the “walking copyright violation” scenario). So the only real source of liability is the output at the moment, and even then you still have to clear the fair use bar which involves tackling arguments that get support in the front end of the training.
@lina > This is why "mind reading machines" are a classic dystopian plot point (monetizing your thoughts etc)

All the copyright aspects don't matter if that atrocity is abolished though.

And the notion of not having backups is existential horror.
@lina US and EU law seems to be pointing in the direction of the model providers being potentially liable, but model users not being unless they do something stupid (like prompting the model to get those violations out).

@lina

Important tangent: copyright has little to do with plagiarism.

Copyright is a industry law to protect mass producers from a about 300 years ago, nowadays overgrown to ridiculous extents and backed by more propaganda than WW2 Nazi, Soviet, and American one put together.
There is no ethical framework for it; it is (and has always been) a purely economical issue.

Plagiarism and denunciations of it have existed for THOUSANDS, at the very least.
They have to do with honesty, transparency and accountability [or rather, their lack], but it has no legal encoding.
[And no, copyright sure as hell ain't.]

LLMs have issues on both ends, but they have little (nothing) to do with each other.

@lina Another take: An industrial regulation created by publishers to curtail competition, descended from and modeled on the state censorship law that it replaced, shouldn't be our framework for figuring out the ethics of LLMs nor, that matter, the ethics of anything else.

There are plenty of critical things one can say about LLMs, but the fact that training them involves copyright violation just puts them in the same company as, say, jazz musicians. It's not a point against them; it might even be a point in their favor.

@kfogel My argument does not require any system of copyright beyond basic author's moral rights. Since LLMs can reproduce copyrighted works, it follows that the copyrighted work is embedded in the LLM. Any interpretation that dismisses this to the extent the original author would have no interest nor control in their work when it is in the form of LLM/AI output (or any intermediate form capable of producing such output, or any output that is derivative, which LLMs can also do) would require essentially complete abolition of copyright as we know it, or otherwise would become a backdoor capable of negating the effect of any system of copyright.

While endless discussions could be had about the flaws of copyright legislation as it exists today, I don't think you'll find that "copyright shouldn't exist at all" is a very popular take, especially among creative fields.

Should you wish to make an argument against copyright-based opposition to GenAI, I think you'll have to present a workable and popularizable alternate system of copyright first to have a credible argument.

Either that or get UBI rolled out first, but good luck with that when the billionaire class is busy taking the world's money to deploy and run AI. And those billionaires are, morally, the furthest thing from jazz musicians.

@kfogel TL;DR if you want to argue copyright shouldn't be the foundation of an ethics argument, you're going to have to first build up the ethics of copying creative works from first principles, because "it's always ethical to copy" isn't going to get a lot of people on your side.
@lina I'm aware it's an argument that struggles for popular acceptance :-).