I'm quoted in this @arstechnica piece about that recent "AI generated" George Carlin special

I don't think it was written by AI

I found the whole thing grossly disrespectful, but I do slightly appreciate the meta-joke here that the AI generated text is fake and was actually written by humans

https://arstechnica.com/ai/2024/01/did-an-ai-write-that-hour-long-george-carlin-special-im-not-convinced/

Did an AI write that hour-long “George Carlin” special? I’m not convinced.

"Everyone is ready to believe that AI can do things, even if it can't."

Ars Technica
“The real story here is… everyone is ready to believe that AI can do things, even if it can't,” Willison told Ars. “In this case, it's pretty clear what's going on if you look at the wider context of the show in question. But anyone without that context, [a viewer] is much more likely to believe that the whole thing was AI-generated… thanks to the massive ramp up in the quality of AI output we have seen in the past 12 months.”

Confirmed by the New York Times:

> Danielle Del, a spokeswoman for Sasso, said Dudesy is not actually an A.I.
>
> “It’s a fictional podcast character created by two human beings, Will Sasso and Chad Kultgen,” Del wrote in an email. “The YouTube video ‘I’m Glad I’m Dead’ was completely written by Chad Kultgen.”

https://www.nytimes.com/2024/01/26/arts/carlin-lawsuit-ai-podcast-copyright.html

George Carlin’s Estate Sues Podcasters Over A.I. Episode

The lawsuit claims that an hourlong comedy special on YouTube violated Carlin’s copyright.

The New York Times
@simon I’m not able to read the article, but it sounds like a copyright claim issue. Why would it be any less of a copyright violation if it wasn’t A.I.? That is, they claim they wrote it and not A.I., so does that change the copyright infringement claim?

@ramsey I don't see how it's a copyright violation if someone wrote an hour of original material trying to imitate George Carlin's style - where's the copyrighted content they are duplicating?

The lawsuit still has legs though, see point 81: "Defendants have knowingly and intentionally utilized and continue to utilize the name, image and likeness of Carlin without the consent of Plaintiffs"

That's "rights of publicity" which I believe is a separate thing from copyright

https://deadline.com/wp-content/uploads/2024/01/George-Carlin-AI-lawsuit.pdf

@simon > I don't see how it's a copyright violation if someone wrote an hour of original material trying to imitate George Carlin's style - where's the copyrighted content they are duplicating?

This is where I’m interested in understanding how the court will respond to cases like this. In a sense, the author of the material trained their brain on George Carlin’s copyrighted material and produced a work that imitates his style.

How is an LLM any different?

@ramsey this is effectively the same argument that's core to the NYT lawsuit against OpenAI and Microsoft - the argument is that the LLM model itself is a derived work of the content that was used to train it, and that it falls outside of "fair use" criteria - that's the key question which needs to be decided in court
@simon How is the LLM responding when I ask it to quote from specific books? For example, I just prompted ChatGPT 3.5 to give me the first few paragraphs from The Hobbit, and it gave them to me verbatim.
@simon Not sure whether you saw my question here, but I’m still very curious and perplexed by this. If an LLM doesn’t store the full text of materials it was trained on, then how does it produce output like what I’m seeing?
@ramsey @simon I don’t know the details, specifically, but isn’t this somewhat like how you know what number comes after 1827391723793472349 without ever having counted to it?
@sean @simon Maybe? So, it can quote entire passages from books, based on that premise?

@ramsey @simon I’m not sure, either. Maybe it tokenizes and stores popular excerpts like the first few paragraphs.

I should probably have just stayed out of this; I admittedly don’t know what I’m talking about. (-:

@sean @simon Haha. It’s fun to guess (hypothesize) at what it does. 🤷‍♂️

I’m asking Simon because I know he’s done a lot of research on this. I’m very close to leaning towards LLMs not violating copyright if they don’t store copyrighted material and are only “learning” patterns. In that way, it’s very similar to the human brain. But if an LLM can reproduce the first few pages of copyrighted material, then thats problematic, for me.

@ramsey @sean @simon Training LLMs on data, for which no permission has been given is problematic to me.
@derickr @sean @simon I’m not saying it’s not problematic to me, but I’m open to thinking about it.

@ramsey @sean @simon

I dunno man. I'm pretty far on the other side. Giving model builders free range to train their stuff on things humans have built seems like a large transfer of wealth from the creative class to the technology class.

Also, if my kid's school wants to teach my kids music. They need to pay for that music. Even though it's just for training! Why give these model building billionaires a free ride?

@preinheimer @sean @simon I’m not saying they shouldn’t have to pay the creators.
@ramsey Thank you for correcting me!
@preinheimer I can’t tell whether this is sarcasm. How did I correct you?

@ramsey It's not sarcasm!

Just your clarification that you weren't suggesting that they shouldn't pay creators.

@preinheimer Stealing from creators to train their models is wrong and evil. My comment about (potentially) not violating copyright was more about how the LLM stores the information.

@ramsey @simon
My mental model of what an llm is that it's a "probability machine": given some input it generates the most probable output.

If you want to go deeper, I have found this article by Stephen Wolfram quite helpful: https://writings.stephenwolfram.com/2023/02/what-is-chatgpt-doing-and-why-does-it-work/

What Is ChatGPT Doing … and Why Does It Work?

Stephen Wolfram explores the broader picture of what's going on inside ChatGPT and why it produces meaningful text. Discusses models, training neural nets, embeddings, tokens, transformers, language syntax.

@ramsey my current mental model is that memorization can happen if it's seen multiple copies of the same text, such that it effectively encodes the probability of word 60 in that text as following words 1 through 59 as being extremely high
@simon I guess the question the courts will have to answer is whether capturing the probability at such a high level is enough to constitute holding a copy of the work, since the work can be reproduced with such a low level of effort, when prompted.
@ramsey yeah that feels like the right question to me - and honestly I don't think there's an obvious "right" answer to it, no idea how this will shake out in court
@ramsey but... the NYT lawsuit has lots of examples of it memorizing full articles - were those present multiple times in the training data or did OpenAI mark NYT content as specifically "high quality" in a way that made it more likely to memorize them?