Tech companies have gotten increasingly secretive about the data, scraped from the internet without compensation or consent, used to train their AI models. So we looked closer. Here's our analysis of the 15 million websites in just one highly-filtered CommonCrawl web scrape-used to train models like Google's T5 & Facebook's LLaMA.
We found
-the copyright symbol appears >200M times
-pirated sites, 1 for e-books
-half of the top 10 were news sites https://www.washingtonpost.com/technology/interactive/2023/ai-chatbot-learning
See the websites that make AI bots like ChatGPT sound so smart

An analysis of a chatbot data set by The Washington Post reveals the proprietary, personal, and often offensive websites that go into an AI’s training data.

The Washington Post
I've been obsessed with this topic ever since I read ver since I read this excellent paper from
Jesse Dodge Allen Institute, @meg
@mmitchell_ai and others
https://arxiv.org/pdf/2104.08758.pdf and saw their graph of the top websites in Google's C4, definitely worth your time