Tech companies have gotten increasingly secretive about the data, scraped from the internet without compensation or consent, used to train their AI models. So we looked closer. Here's our analysis of the 15 million websites in just one highly-filtered CommonCrawl web scrape-used to train models like Google's T5 & Facebook's LLaMA.
We found
-the copyright symbol appears >200M times
-pirated sites, 1 for e-books
-half of the top 10 were news sites https://www.washingtonpost.com/technology/interactive/2023/ai-chatbot-learning
See the websites that make AI bots like ChatGPT sound so smart

An analysis of a chatbot data set by The Washington Post reveals the proprietary, personal, and often offensive websites that go into an AI’s training data.

The Washington Post

@nitashatiku If you’re going to do this, why don’t you retrieve the ‘robots.txt’ from each site. See how many of them (1) have one, and (2) don’t disallow bots? And (3) have _explicit sitemaps_ to content.

The bots were invited in. You may hate it now, but they were invited.

That’s because _this is how it works_. Folks wanted SEO, so offered up their content up to be found.

I get the frustration, I really do, but it’s super-clear to me: we invited the bots in to read our content and…they did.

@cypherfox @nitashatiku

stop defending the rich criminals, they'll never reward you for it

@troglodyt @nitashatiku Hah; you’re funny, I like it! 🤣

I don’t care about them; I care about the technology, and the law. Making crawling illegal because of copyright would make LLM technology (and search engines and other things I find deeply valuable) impossible, and I’m not okay with that.

Plus it’s always good to know what the law, IS rather than what you feel the law SHOULD be. See Field v. Google, Inc. for a useful example on crawling/scraping/indexing.

But you know…you do you. 👋

@cypherfox @nitashatiku

i think you should stop defending rich criminals, it's embarrassing and bad

@cypherfox @nitashatiku

you can spend a week defending the poor shoplifting, that'd ought to change your perspectives a bit

@troglodyt @nitashatiku Now you’re getting weird; I’m explaining the legal situation and pointing out that if you DON’T want bots reading your content, take even the smallest step to block them by making your robots.txt hostile to them.

I get that you conflate pointing out the way it works with approving of the people doing it, but…that’s not a ‘me’ problem.

Maybe take a look at https://commoncrawl.org and see if you really DO disapprove of their methods. You might be surprised.

Best of luck.

Common Crawl - Open Repository of Web Crawl Data

We build and maintain an open repository of web crawl data that can be accessed and analyzed by anyone.