Tech companies have gotten increasingly secretive about the data, scraped from the internet without compensation or consent, used to train their AI models. So we looked closer. Here's our analysis of the 15 million websites in just one highly-filtered CommonCrawl web scrape-used to train models like Google's T5 & Facebook's LLaMA.
We found
-the copyright symbol appears >200M times
-pirated sites, 1 for e-books
-half of the top 10 were news sites https://www.washingtonpost.com/technology/interactive/2023/ai-chatbot-learning
See the websites that make AI bots like ChatGPT sound so smart

An analysis of a chatbot data set by The Washington Post reveals the proprietary, personal, and often offensive websites that go into an AI’s training data.

The Washington Post

@nitashatiku If you’re going to do this, why don’t you retrieve the ‘robots.txt’ from each site. See how many of them (1) have one, and (2) don’t disallow bots? And (3) have _explicit sitemaps_ to content.

The bots were invited in. You may hate it now, but they were invited.

That’s because _this is how it works_. Folks wanted SEO, so offered up their content up to be found.

I get the frustration, I really do, but it’s super-clear to me: we invited the bots in to read our content and…they did.

@cypherfox @nitashatiku

stop defending the rich criminals, they'll never reward you for it

@troglodyt @nitashatiku Hah; you’re funny, I like it! 🤣

I don’t care about them; I care about the technology, and the law. Making crawling illegal because of copyright would make LLM technology (and search engines and other things I find deeply valuable) impossible, and I’m not okay with that.

Plus it’s always good to know what the law, IS rather than what you feel the law SHOULD be. See Field v. Google, Inc. for a useful example on crawling/scraping/indexing.

But you know…you do you. 👋

@cypherfox @nitashatiku

i think you should stop defending rich criminals, it's embarrassing and bad

@cypherfox @nitashatiku

you can spend a week defending the poor shoplifting, that'd ought to change your perspectives a bit