When it comes to AI art (or "art"), it's hard to find a nuanced position that respects creative workers' labor rights, free expression, copyright law's vital exceptions and limitations, and aesthetics.

--

If you'd like an essay-formatted version of this thread to read or share, here's a link to it on pluralistic.net, my surveillance-free, ad-free, tracker-free blog:

https://pluralistic.net/2024/05/13/spooky-action-at-a-close-up/#invisible-hand

1/

Pluralistic: AI “art” and uncanniness (13 May 2024) – Pluralistic: Daily links from Cory Doctorow

I am, on balance, opposed to AI art, but there are some important caveats to that position. For starters, I think it's unequivocally wrong - as a matter of law - to say that scraping works and training a model with them infringes copyright. This isn't a moral position (I'll get to that in a second), but rather a technical one.

2/

Break down the steps of training a model and it quickly becomes apparent why it's technically wrong to call this a copyright infringement. First, the act of making transient copies of works - even billions of works - is unequivocally fair use. Unless you think search engines and the @internetarchive shouldn't exist, then you should support scraping at scale:

https://pluralistic.net/2023/09/17/how-to-think-about-scraping/

3/

How To Think About Scraping – Pluralistic: Daily links from Cory Doctorow

And unless you think that Facebook should be allowed to use the law to block projects like Ad Observer, which gathers samples of paid political disinformation, then you should support scraping at scale, *even when the site being scraped objects* (at least sometimes):

https://pluralistic.net/2021/08/06/get-you-coming-and-going/#potemkin-research-program

After making transient copies of lots of works, the next step in AI training is to subject them to mathematical analysis. Again, this isn't a copyright violation.

4/

Pluralistic: 06 Aug 2021 – Pluralistic: Daily links from Cory Doctorow

Making quantitative observations about works is a longstanding, respected and important tool for criticism, analysis, archiving and new acts of creation. Measuring the steady contraction of the vocabulary in successive Agatha Christie novels turns out to offer a fascinating window into her dementia:

https://www.theguardian.com/books/2009/apr/03/agatha-christie-alzheimers-research

5/

Study claims Agatha Christie had Alzheimer's

Textual analysis detects signs of sharply declining faculties towards the end of writer's life

The Guardian

Programmatic analysis of scraped online speech is also critical to the burgeoning formal analyses of the language spoken by minorities, producing a vibrant account of the rigorous grammar of dialects that have long been dismissed as "slang":

https://www.researchgate.net/publication/373950278_Lexicogrammatical_Analysis_on_African-American_Vernacular_English_Spoken_by_African-Amecian_You-Tubers

6/

Since 1988, UCL Survey of English Language has maintained its "International Corpus of English," and scholars have plumbed its depth to draw important conclusions about the wide variety of Englishes spoken around the world, especially in postcolonial English-speaking countries:

https://www.ucl.ac.uk/english-usage/projects/ice.htm

7/

The International Corpus of English (ICE)

The final step in training a model is publishing the conclusions of the quantitative analysis of the temporarily copied documents as software code. Code itself is a form of expressive speech - and that expressivity is key to the fight for privacy, because the fact that code is speech limits how governments can censor software:

https://www.eff.org/deeplinks/2015/04/remembering-case-established-code-speech/

8/

EFF at 25: Remembering the Case that Established Code as Speech

One of EFF's first major legal victories was Bernstein v. Department of Justice, a landmark case that resulted in establishing code as speech and changed United States export regulations on encryption software, paving the way for international e-commerce. We represented Daniel J. Bernstein, a...

Electronic Frontier Foundation

Are models infringing? Well, they certainly *can* be. In some cases, it's clear that models "memorized" some of the data in their training set, making the fair use, transient copy into an infringing, permanent one. That's generally considered to be the result of a programming error, and it could certainly be prevented (say, by comparing the model to the training data and removing any memorizations that appear).

9

Not every seeming act of memorization *is* a memorization, though. While specific models vary widely, the amount of data from each training item retained by the model is *very* small. For example, Midjourney retains about one byte of information from each image in its training data. If we're talking about a typical low-resolution web image of say, 300kb, that would be one three-hundred-thousandth (0.0000033%) of the original image.

10/

Typically in copyright when one work contains 0.0000033% of another work, we don't even raise fair use. Rather, we dismiss the use as *de minimis* (short for *de minimis non curat lex* or "The law does not concern itself with trifles"):

https://en.wikipedia.org/wiki/De_minimis

Busting someone who takes 0.0000033% of your work for copyright infringement is like swearing out a trespassing complaint against someone because the edge of their shoe touched one blade of grass on your lawn.

11/

De minimis - Wikipedia

But some works or elements of work appear *many* times online. For example, the Getty Images watermark appears on millions of similar images of people standing on red carpets and runways, so a model that takes even in infinitesimal sample of each one of those works might still end up being able to produce a whole, recognizable Getty Images watermark.

12/

The same is true for wire-service copy or widely syndicated texts: there might be dozens or even hundreds of copies of these works in training data, resulting in the memorization of long passages from them.

This *might* be infringing (we're getting into some gnarly, unprecedented territory here), but again, even if it is, it wouldn't be a big hardship for model makers to post-process their models by comparing them to the training set, deleting any inadvertent memorizations.

13/

Even if the resulting model had *zero* memorizations, this would do nothing to alleviate the (legitimate) concerns of creative workers about the creation and use of these models.

So here's the first nuance in the AI art debate: as a *technical* matter, training a model isn't a copyright infringement. Creative workers who hope that they can use copyright to prevent AI from changing the creative labor market are likely to be very disappointed in court:

https://www.hollywoodreporter.com/business/business-news/sarah-silverman-lawsuit-ai-meta-1235669403/

14/

Sarah Silverman Hits Stumbling Block in AI Copyright Infringement Lawsuit Against Meta

The ruling builds upon findings from another federal judge overseeing a lawsuit against AI art generators, who similarly delivered a blow to fundamental contentions from plaintiffs in the case.

The Hollywood Reporter

But copyright law isn't a fixed, eternal entity. We write new copyright laws all the time. If *current* copyright law doesn't prevent the creation of models, what about a *future* copyright law?

Well, sure, that's a possibility. The first thing to consider is the possible collateral damage of such a law. The legal space for scraping enables a wide range of scholarly, archival, organizational and critical purposes.

15/

We'd have to be *very* careful not to inadvertently ban, say, the scraping of a politician's campaign website, lest we enable liars to run for office and renege on their promises, while they insist that they never made those promises in the first place. We wouldn't want to abolish search engines, or stop creators from scraping their own work off sites that are going away or changing their terms of service.

16/

Now, onto quantitative analysis: counting words and measuring pixels are *not* activities that you should need permission to perform, with or without a computer, even if the person whose words or pixels you're counting doesn't want you to. You should be able to look as hard as you want at the pixels in Kate Middleton's family photos, or track the rise and fall of the Oxford comma, and you shouldn't need anyone's permission to do so.

17/

Finally, there's publishing the model. There are plenty of published mathematical analyses of large corpuses that are useful and unobjectionable. I love me a good Google n-gram:

https://books.google.com/ngrams/graph?content=fantods%2C+heebie-jeebies&year_start=1800&year_end=2019&corpus=en-2019&smoothing=3

And large language models fill all kinds of important niches, like the Human Rights Data Analysis Group's LLM-based work helping the Innocence Project New Orleans' extract data from wrongful conviction case files:

https://hrdag.org/tech-notes/large-language-models-IPNO.html

18/

So that's nuance number two: if we decide to make a new copyright law, we'll need to be *very* sure that we don't accidentally crush these beneficial activities that don't undermine artistic labor markets.

This brings me to the most important point: *passing a new copyright law that requires permission to train an AI won't help creative workers get paid or protect our jobs*.

19/

Getty Images pays photographers the *least* it can get away with. Publishers contracts have transformed by inches into miles-long, ghastly rights grabs that take *everything* from writers, but *still* shifts legal risks onto them:

https://pluralistic.net/2022/06/19/reasonable-agreement/

Publishers like the *New York Times* bitterly oppose their writers' unions:

https://actionnetwork.org/letters/new-york-times-stop-union-busting

20/

Reasonable Agreement – Pluralistic: Daily links from Cory Doctorow

These large corporations already control the copyrights to *gigantic* amounts of training data, and they have means, motive and opportunity to license these works for training a model in order to pay us less, and they are engaged in this activity *right now*:

https://www.nytimes.com/2023/12/22/technology/apple-ai-news-publishers.html

21/

Apple Explores A.I. Deals With News Publishers

The company has discussed multiyear deals worth at least $50 million to train its generative A.I. systems on publishers’ news articles.

The New York Times

Big games studios are *already* acting as though there was a copyright in training data, and requiring their voice actors to begin every recording session with words to the effect of, "I hereby grant permission to train an AI with my voice" and if you don't like it, you can hit the bricks:

https://www.vice.com/en/article/5d37za/voice-actors-sign-away-rights-to-artificial-intelligence

22/

‘Disrespectful to the Craft:’ Actors Say They’re Being Asked to Sign Away Their Voice to AI

Motherboard spoke to multiple voice actors and advocacy organizations, some of which said contracts including language around synthetic voices are now very prevalent.

If you're a creative worker hoping to pay your bills, it doesn't matter whether your wages are eroded by a model produced without paying your employer for the right to do so, or whether your employer got to double dip by selling your work to an AI company to train a model, and then used that model to fire you or erode your wages:

https://pluralistic.net/2023/02/09/ai-monkeys-paw/#bullied-schoolkids

23/

Pluralistic: Copyright won’t solve creators’ Generative AI problem (09 Feb 2023) – Pluralistic: Daily links from Cory Doctorow