Okay, people keep telling me to read this NY Mag profile of Emily Bender, and they're right. It's a fantastic read. However, this line is... wrong (or misleading). Everything that ChatGPT trains on is also covered by copyright. The idea that it can't do books because of copyright is just wrong. It can't train based on ebooks, because the ebooks are locked up and not publicly available (without great cost).

https://nymag.com/intelligencer/article/ai-artificial-intelligence-chatbots-emily-m-bender.html

@mmasnick also it's dead wrong about all e-books on Gutenberg or other open-access platforms
@mmasnick Is a bot trained by humans and their output expected to be better than humans - and wouldn't that cause ethical issues as well?
@mmasnick Enjoyed the article, but this quote from the article was cringe worthy as well. It's not that very few people understand how to make LLMs; it's that very few people can afford to train LLMs. As for the very precise $15.7 trillion dollar estimate ... no comment.
@mmasnick Everything ChatGPT trains on is covered by copyright -- but some of it is explicitly licensed for transformative use (e.g., Wikipedia, and a lot of Creative Commons stuff). And among the stuff that isn't, there's still, well... a spectrum of likelihoods that the copyright owner would actually sue. Trawling libraries is more likely to attract well-funded litigants than, say, random blogs.
@rst that's true, but even then, based on the rulings in book scanning cases (both Google books and Hathitrust), there's no way scanning for AI purposes isn't fair use.
@mmasnick
Wow. Yeah, the idea that copyright law only applies to books is... um... Well, it's dead wrong for one!
@SarahAnneDipity i mean, if we go back to the 1790, copyright only applied to books, maps, charts, so not the internet. But, also, the internet didn't exist. And copyright law has... changed.
@mmasnick I had the exact same thought. It also occurred to me that some LLMs probably have been trained on books (looks in the direction of Google Books). I also agree that it's a good profile.

@mmasnick The beginning and middle were excellent, and overall the article is a great antidote to the hype around LLMs. I had some issues with the article towards the end.

I’m unconvinced that everyone, or even most people, will end up confusing chatbots for people, at least not with the current line of research involving LLMs. Sure, some tech people might claim that they think there’s no difference between chatbots and humans, but for the time being I attribute this to tech bros boosting a tech bubble they benefit from, rather than anything more malicious.

And I do think it will be a bubble this time around: given the poor performance of those chatbots on tasks requiring logic, let alone those requiring real-world knowledge, they will fail to meet the vast amount of hype that’s been building behind them. (There is a huge risk that this will lead to larger amounts of spam, phishing, and short-form disinformation, but much of that problem is an issue of sheer quantity rather than inability to distinguish between man and machine.)

I am also not convinced the article’s discussions of fascism in this context are likely; in fact, I almost want to invoke Godwin’s Law, though I admittedly do not have either the knowledge or the confidence in my social awareness to conclusively do so When it comes to fascism I am far more worried about real world strong men like Putin, Orbán, and on our soil, DeSantis, who are a clear, and current, authoritarian threat.

@mmasnick There is compelling evidence that ChatGPT *was* trained on books under copyright, because it can be prompt-engineered into emitting them as outputs.

E.g., https://medium.com/@neonforge/chatgpt-copyright-concerns-and-potential-legal-consequences-for-openai-56feb6974c27

ChatGPT — Copyright Concerns and Potential Legal Consequences for OpenAI

As a reader and a fan of ChatGPT, I have always been impressed by its vast knowledge and ability to provide information on a wide range of topics. However, as I have used ChatGPT more and more, I…

Medium
@mmasnick *won't. Because they can't just ignore copyright on those and take the information for free.
@mmasnick please read this article, it's really interesting.
@mmasnick it’s why they don’t do correct references (and so shouldn’t pass college assessments. Ever.) because they are constrained not to quote copyrighted material they instead draw a picture that looks kinda like it might be right if you squint. Like AI artists putting text in pictures.