Typical ML argument: "If I can read something legally, why can't I train an LLM on it?"

Humans are capable of reading things and later writing a similar thing that is still a copyright violation. If I go and write a book that follows the plot line of Star Wars, that's still a copyright violation, even if no text is literally the same. If I play the melody to a song on my piano and release it without the appropriate mechanical cover license, that's also a copyright violation.

The reason this does not happen often is that, as humans, we are aware that that's plagiarism and there are rules. Sometimes it happens by accident, and people still get sued and lose.

LLMs have no such awareness and routinely output things which are blatant copyright violations when appropriately prompted. That means the model weights encode that work, and therefore, are themselves a derivative work.

Your brain encodes a massive amount of copyrighted information. You are not a walking copyright violation because humans aren't data, can't be copied and distributed en masse, have human rights, etc. This is why "mind reading machines" are a classic dystopian plot point (monetizing your thoughts etc).

An LLM is not a human, does not have human rights, nor human privileges. It is data, and if it encodes copyrighted information, that's a derivative work. If you aren't following the license of the training data, that's a copyright violation.

@lina just considering the LLM itself as a derivative work, wouldn't it be legal to train one on CC BY-SA or GPL text as long as the weights are released under the same license? (Which wouldn't seem like a big deal for those "open" models)

@florian It would, if and only if:

1) You train only on compatible license content
2) You meet all the attribution requirements of those licenses (this is a big one)
3) Your weights are licensed under a compatible license themselves (usually the same for most copyleft licenses)
4) You understand that the output of the model may be copyrighted and require the same licensing and attribution (making the model unsuitable for, say, creative generation for publishing, but it would still be fine for a local voice assistant or something like that) and make your users aware of this.

Unfortunately, I haven't been able to find any models that meet any of those conditions, let alone all of them (other than KL3M which credibly claims to be trained on PD and non copyrightable content only).

@lina actually attribution might still be a major hurdle, you can totally supply all the attribution strings for your training corpus with the model, but is it really attribution if you can't point to where in the model a specific persons work is encoded (which you can't obviously)?
@florian @lina > (which you can't obviously)?

You can. But none of them do it in part because it's storage-heavy and it kills performance.