Typical ML argument: "If I can read something legally, why can't I train an LLM on it?"

Humans are capable of reading things and later writing a similar thing that is still a copyright violation. If I go and write a book that follows the plot line of Star Wars, that's still a copyright violation, even if no text is literally the same. If I play the melody to a song on my piano and release it without the appropriate mechanical cover license, that's also a copyright violation.

The reason this does not happen often is that, as humans, we are aware that that's plagiarism and there are rules. Sometimes it happens by accident, and people still get sued and lose.

LLMs have no such awareness and routinely output things which are blatant copyright violations when appropriately prompted. That means the model weights encode that work, and therefore, are themselves a derivative work.

Your brain encodes a massive amount of copyrighted information. You are not a walking copyright violation because humans aren't data, can't be copied and distributed en masse, have human rights, etc. This is why "mind reading machines" are a classic dystopian plot point (monetizing your thoughts etc).

An LLM is not a human, does not have human rights, nor human privileges. It is data, and if it encodes copyrighted information, that's a derivative work. If you aren't following the license of the training data, that's a copyright violation.

Yes, this means that anyone downloading "open" models potentially puts themselves in as much legal risk as downloading a movie does.

Just *using* a cloud based LLM may or may not be safe depending on how a bunch of legal unknowns go, if the output happens to not qualify as a copyright violation. Of course, since you have no idea whether it will, you're basically playing copyright Russian roulette.

BTW, there is almost certainly a model size and architecture threshold here. If I train an image upscaler on copyrighted data, for example, it's quite improbable that it will be able to generate copyrighted data from an unrelated input (for typical upscaler designs and sizes). Most likely that is safe.

Is there a size threshold for LLMs where, for typical training data distributions, the LLM is highly unlikely to "memorize" any copyrightable information? Is that size threshold a usable LLM? These are open questions (that nobody seems to be interested in researching).

I'm willing to bet if such a size threshold exists, it's much smaller than the minimum LLM size useful for "general purpose" prompting though.

@lina

"You stand accused of downloading and then redistributing the BritDonnaAguilera model, which is able to generate music extremely close to that of popular pop singers."

@wakame I'm waiting for people using Suno or whatever to get sued for copying melodies. There has to be some juicy stuff in there...