Mastodawn

Tiffany Li May 3, 2023

WGA is bargaining to block use of their written work to train AI. (ht @harikunzru)

This is a smart move. Brief 🧵:

Tiffany Li May 3, 2023

Generally, writers do not retain copyright for scripts written for or bought by production companies. Which means those companies have the right to do what they want with the written work. Including, potentially, using the works to train AI.

Show thread

Tiffany Li May 3, 2023

Companies can train AI on work written by writers in a way that allows them to generate written output that sounds similar to what a writer would’ve written. When this gets efficient and effective enough, expect companies to try to eliminate writing jobs altogether...

Show thread

Tiffany Li May 3, 2023

We've already seen tons of examples of of AI-generated writing, art, and music made to sound or look similar to known human creators.

(Honestly IMO the tech is not quite there yet. Often what you get is a pale imitation. But tech is always improving!)

Show thread

Tiffany Li May 3, 2023

WGA is now trying to negotiate a hard stop on use of their members' writing to train AI. This will mean that companies who buy a script or hire a writer can’t exploit writers’ works by training AI to put those same writers out of a job. This is a smart move!

Show thread

Michał "rysiek" Woźniak · 🇺🇦May 3, 2023

@tiffanycli amazing. I expected something like this to start happening but I did not expect it so soon.

Question: one way OpenAI et al try to evade this whole thing is by leaning into the "well it was made public on Teh Intertubes and we just scraped it 🤷‍♀️" defense. Who will be / should be liable if a script gets published on the open Web, and then sucked into an LLM training set?

Also, many everyday tools (office suites, for example) start integrating "AI". Similar question as above?

Show thread

TorontoWill May 3, 2023

@rysiek @tiffanycli "It was on the internet" is not a defence to copyright infringement, in any way. Copyright specifically applies to things that are published or performed publicly.

More of a grey area what qualifies as copying/infringement in relation to an LLM, since it isn't something specifically contemplated by the law. But nothing prevents having multiple people involved in the copying, in different ways, *all* be liable.

Show thread

Michał "rysiek" Woźniak · 🇺🇦May 3, 2023

@TorontoWill @tiffanycli sure. This is complicated IMVHO because:

1. the EU Copyright Directive's datamining exemption (which seems to hold for LLM training, apparently)

2. the fact that copyright law "kicks in" when something gets published, not when something gets "ingested" so to speak.

Show thread

Michał "rysiek" Woźniak · 🇺🇦

@TorontoWill @tiffanycli regarding 2.:

IANAL but as far as I understand the copyright law, it kicks in when something gets published. You either are allowed to publish something somewhere, or not.

But asking (quite reasonably in general!) for certain copyrighted work not to be used for training LLMs means asking to limit how certain material is used *post-publication*.

I.e. "you are allowed to publish this as long as no LLM will ever be trained on it". Which is… difficult to comply with?

Show thread

TorontoWill May 3, 2023

@rysiek @tiffanycli I don't know specifics of EU law, but in general copyright is much more than just the right to publish. It's also the right to make copies (of something already published), the right to sell it, the right to convert it to something else (like a movie of a book), the right to perform it (e.g. a song), the right to exhibit it (e.g. art). Publication is just the tip of the iceberg.

Show thread

Michał "rysiek" Woźniak · 🇺🇦May 3, 2023

@TorontoWill @tiffanycli yes, absolutely.

But my point is: if you have the right to publish something but you are barred from letting LLMs being trained on that something — do you actually have a right to publish?

Everything published (made public) might get sucked into an LLM training set, as we've seen with OpenAI etc. At this point anyone publishing anything already has to be aware of that…

Show thread

Michał "rysiek" Woźniak · 🇺🇦May 3, 2023

@TorontoWill @tiffanycli we can ignore for now the legality of OpenAI sucking it into a training set for an LLM — I am specifically trying to wrap my head around the concept of someone having the right to publish something but not to let LLMs get trained on it.

And if WGA wants LLMs not to be trained on their scripts, they have to demand exactly that: not just "you shall not train LLMs on our scripts" but also "you will not allow any LLMs to be trained on our scripts."

Show thread

Michał "rysiek" Woźniak · 🇺🇦May 3, 2023

@TorontoWill @tiffanycli otherwise that would be a loophole:

"Your honor, we did not ourselves train any LLMs on the material in question. We were allowed to publish and we did.

Third parties used the published materials to train LLMs on it, without our knowledge nor consent.

Now, WGA's terms did not stipulate we are not allowed to use such LLMs that happened to be trained on this material without our involvement…"

Show thread

TorontoWill May 3, 2023

@rysiek @tiffanycli you're conflating copyright with contract. If the studios had agreed to what WGA proposed, then contractually they'd be prohibited from feeding writers' work into an LLM themselves, and they couldn't use an LLM that's been writers' work (or indeed use any LLM at all as the basis for a script). Whether someone else feeds a script into an LLM is outside their agreement, and it doesn't strictly matter if it would be copyright infringement.

Show thread

TorontoWill May 3, 2023

@rysiek @tiffanycli As to whether it *is* infringement, like I said at the beginning, "grey area", and may vary among jurisdictions in the language of the law and how it's interpreted. But certainly it is very possible that feeding a writer's work into LLM infringes copyright (regardless of whether the work is published), as does using something from an LLM that has been trained on a writer's work.

Show thread

Michał "rysiek" Woźniak · 🇺🇦May 3, 2023

@TorontoWill @tiffanycli you're right of course.

Let me re-phrase: if the WGA only asks for the studios not to feed the scripts into LLMs themselves, but studiosstill get full copyright to the scripts, this will have a trivial loophole of: studios are allowed to publish the scripts online; LLMs get trained on them by third parties; studios use these LLMs to replace writers.

Show thread

Michał "rysiek" Woźniak · 🇺🇦May 3, 2023

@TorontoWill @tiffanycli so my point is that I think WGA needs to also ask for either or both of:

a). studios will ensure no LLMs are trained by anyone on the material

b). studios will refrain from using any LLMs possibly trained on these scripts by anyone

Do you see my point?