WGA is bargaining to block use of their written work to train AI. (ht @harikunzru)

This is a smart move. Brief 🧵:

Generally, writers do not retain copyright for scripts written for or bought by production companies. Which means those companies have the right to do what they want with the written work. Including, potentially, using the works to train AI.
Companies can train AI on work written by writers in a way that allows them to generate written output that sounds similar to what a writer would’ve written. When this gets efficient and effective enough, expect companies to try to eliminate writing jobs altogether...

We've already seen tons of examples of of AI-generated writing, art, and music made to sound or look similar to known human creators.

(Honestly IMO the tech is not quite there yet. Often what you get is a pale imitation. But tech is always improving!)

WGA is now trying to negotiate a hard stop on use of their members' writing to train AI. This will mean that companies who buy a script or hire a writer can’t exploit writers’ works by training AI to put those same writers out of a job. This is a smart move!

Of course, this doesn’t address how much writing is publicly available through other means or writing that has already been used to train AI. And it doesn’t tell you how WGA members are meant to enforce their rights if they find a company has misused their works. (And how would they even find out in the first place?)

But still, a good start and definitely something to watch.

@tiffanycli amazing. I expected something like this to start happening but I did not expect it so soon.

Question: one way OpenAI et al try to evade this whole thing is by leaning into the "well it was made public on Teh Intertubes and we just scraped it 🤷‍♀️" defense. Who will be / should be liable if a script gets published on the open Web, and then sucked into an LLM training set?

Also, many everyday tools (office suites, for example) start integrating "AI". Similar question as above?

@rysiek @tiffanycli "It was on the internet" is not a defence to copyright infringement, in any way. Copyright specifically applies to things that are published or performed publicly.

More of a grey area what qualifies as copying/infringement in relation to an LLM, since it isn't something specifically contemplated by the law. But nothing prevents having multiple people involved in the copying, in different ways, *all* be liable.

@TorontoWill @tiffanycli sure. This is complicated IMVHO because:

1. the EU Copyright Directive's datamining exemption (which seems to hold for LLM training, apparently)

2. the fact that copyright law "kicks in" when something gets published, not when something gets "ingested" so to speak.

@TorontoWill @tiffanycli regarding 2.:

IANAL but as far as I understand the copyright law, it kicks in when something gets published. You either are allowed to publish something somewhere, or not.

But asking (quite reasonably in general!) for certain copyrighted work not to be used for training LLMs means asking to limit how certain material is used *post-publication*.

I.e. "you are allowed to publish this as long as no LLM will ever be trained on it". Which is… difficult to comply with?

@rysiek @tiffanycli I don't know specifics of EU law, but in general copyright is much more than just the right to publish. It's also the right to make copies (of something already published), the right to sell it, the right to convert it to something else (like a movie of a book), the right to perform it (e.g. a song), the right to exhibit it (e.g. art). Publication is just the tip of the iceberg.

@TorontoWill @tiffanycli yes, absolutely.

But my point is: if you have the right to publish something but you are barred from letting LLMs being trained on that something — do you actually have a right to publish?

Everything published (made public) might get sucked into an LLM training set, as we've seen with OpenAI etc. At this point anyone publishing anything already has to be aware of that…

@TorontoWill @tiffanycli we can ignore for now the legality of OpenAI sucking it into a training set for an LLM — I am specifically trying to wrap my head around the concept of someone having the right to publish something but not to let LLMs get trained on it.

And if WGA wants LLMs not to be trained on their scripts, they have to demand exactly that: not just "you shall not train LLMs on our scripts" but also "you will not allow any LLMs to be trained on our scripts."

@TorontoWill @tiffanycli otherwise that would be a loophole:

"Your honor, we did not ourselves train any LLMs on the material in question. We were allowed to publish and we did.

Third parties used the published materials to train LLMs on it, without our knowledge nor consent.

Now, WGA's terms did not stipulate we are not allowed to use such LLMs that happened to be trained on this material without our involvement…"

@rysiek @tiffanycli you're conflating copyright with contract. If the studios had agreed to what WGA proposed, then contractually they'd be prohibited from feeding writers' work into an LLM themselves, and they couldn't use an LLM that's been writers' work (or indeed use any LLM at all as the basis for a script). Whether someone else feeds a script into an LLM is outside their agreement, and it doesn't strictly matter if it would be copyright infringement.
@rysiek @tiffanycli As to whether it *is* infringement, like I said at the beginning, "grey area", and may vary among jurisdictions in the language of the law and how it's interpreted. But certainly it is very possible that feeding a writer's work into LLM infringes copyright (regardless of whether the work is published), as does using something from an LLM that has been trained on a writer's work.

@TorontoWill @tiffanycli you're right of course.

Let me re-phrase: if the WGA only asks for the studios not to feed the scripts into LLMs themselves, but studiosstill get full copyright to the scripts, this will have a trivial loophole of: studios are allowed to publish the scripts online; LLMs get trained on them by third parties; studios use these LLMs to replace writers.

@TorontoWill @tiffanycli so my point is that I think WGA needs to also ask for either or both of:

a). studios will ensure no LLMs are trained by anyone on the material

b). studios will refrain from using any LLMs possibly trained on these scripts by anyone

Do you see my point?

@rysiek @tiffanycli

OpenAI are trying to evade those things, but it doesn't mean that they have. We are seeing pushback on many fronts.

Samsung has already banned employees from using the tools after a leak of code.

And there's an ongoing class action suit against the image generators.

Samsung 'bans' employees from using OpenAI's ChatGPT

Samsung bans ChatGPT in order to mitigate damage caused by employees uploading sensitive data into the chatbot...

Tech Monitor

@rysiek @tiffanycli honestly, I think we'll start seeing the first big fights when people start publishing movies featuring Disney characters "tweaked" by AI.

It's going to take money on multiple sides of this issue to actually resolve it and it's going to take a few years for that to happen.

@gatesvp sorry for the non sequitur, but how did you do these links directly on words?!

@tiffanycli

Tech workers should do the same.

Negotiate a hard stop to the AI using GIT, stack overflow, and other code repositories to train AI to take their jobs...

GitHub Copilot · Your AI pair programmer

GitHub Copilot works alongside you directly in your editor, suggesting whole lines or entire functions for you.

GitHub

@tiffanycli

Remain curious to know how the WGA deals with (or intends to) their writers content that's *already* been
– ingested by #AI's
– surreptitiously used to train #AI's

Legal remedies (copyright violations, disgorgement, 'algorithmic destruction', etc.) seem hela-nebulous at best.