Tech companies have gotten increasingly secretive about the data, scraped from the internet without compensation or consent, used to train their AI models. So we looked closer. Here's our analysis of the 15 million websites in just one highly-filtered CommonCrawl web scrape-used to train models like Google's T5 & Facebook's LLaMA.
We found
-the copyright symbol appears >200M times
-pirated sites, 1 for e-books
-half of the top 10 were news sites https://www.washingtonpost.com/technology/interactive/2023/ai-chatbot-learning
See the websites that make AI bots like ChatGPT sound so smart

An analysis of a chatbot data set by The Washington Post reveals the proprietary, personal, and often offensive websites that go into an AI’s training data.

The Washington Post

@nitashatiku If you’re going to do this, why don’t you retrieve the ‘robots.txt’ from each site. See how many of them (1) have one, and (2) don’t disallow bots? And (3) have _explicit sitemaps_ to content.

The bots were invited in. You may hate it now, but they were invited.

That’s because _this is how it works_. Folks wanted SEO, so offered up their content up to be found.

I get the frustration, I really do, but it’s super-clear to me: we invited the bots in to read our content and…they did.

@cypherfox @nitashatiku Opting-in for content to be searchable shouldn't be the same as opting-in for it to be copied, though
@cypherfox @nitashatiku For commercial purposes specifically, I might add

@Quisley @nitashatiku I understand that that’s how you feel, but I don’t think that’s how it’ll play out in a court. And… In what way are search engines not commercial purposes?

They run ads next to your site links in search results. They have a money-printing press, for goodness sake. 🤣

Opting in to indexing is definitely opting in to a commercial use. That LLMs are not the commercial use you had in mind…well that’ll be a fascinating argument to watch, but I wouldn’t put money on either side.