Tech companies have gotten increasingly secretive about the data, scraped from the internet without compensation or consent, used to train their AI models. So we looked closer. Here's our analysis of the 15 million websites in just one highly-filtered CommonCrawl web scrape-used to train models like Google's T5 & Facebook's LLaMA.
We found
-the copyright symbol appears >200M times
-pirated sites, 1 for e-books
-half of the top 10 were news sites https://www.washingtonpost.com/technology/interactive/2023/ai-chatbot-learning
See the websites that make AI bots like ChatGPT sound so smart

An analysis of a chatbot data set by The Washington Post reveals the proprietary, personal, and often offensive websites that go into an AI’s training data.

The Washington Post

@nitashatiku If you’re going to do this, why don’t you retrieve the ‘robots.txt’ from each site. See how many of them (1) have one, and (2) don’t disallow bots? And (3) have _explicit sitemaps_ to content.

The bots were invited in. You may hate it now, but they were invited.

That’s because _this is how it works_. Folks wanted SEO, so offered up their content up to be found.

I get the frustration, I really do, but it’s super-clear to me: we invited the bots in to read our content and…they did.

@cypherfox @nitashatiku “I want my site to be discoverable in a search engine” and “I want to train someone else’s LLM” are two very different things

@chucker @nitashatiku The bulk of the data comes from the Common Crawl, an open source project to crawl sites which have robots.txt open to doing so.

https://commoncrawl.org

Read their process and reason for existing. If you disagree with it, that’s okay too, but they’re really open about it all.

It’s not like OpenAI or other organizations did the crawl themselves (for the most part, afaik). They’re relying on these open data projects.

It’s easy to block it if you want to, also.

Common Crawl - Open Repository of Web Crawl Data

We build and maintain an open repository of web crawl data that can be accessed and analyzed by anyone.

@chucker @cypherfox @nitashatiku Indeed: Getty Images found it’s watermark displayed prominently in many “AI art” (strong quotes) all over the Internet.
#AItheft