Tech companies have gotten increasingly secretive about the data, scraped from the internet without compensation or consent, used to train their AI models. So we looked closer. Here's our analysis of the 15 million websites in just one highly-filtered CommonCrawl web scrape-used to train models like Google's T5 & Facebook's LLaMA.
We found
-the copyright symbol appears >200M times
-pirated sites, 1 for e-books
-half of the top 10 were news sites https://www.washingtonpost.com/technology/interactive/2023/ai-chatbot-learning
See the websites that make AI bots like ChatGPT sound so smart

An analysis of a chatbot data set by The Washington Post reveals the proprietary, personal, and often offensive websites that go into an AI’s training data.

The Washington Post
I've been obsessed with this topic ever since I read ver since I read this excellent paper from
Jesse Dodge Allen Institute, @meg
@mmitchell_ai and others
https://arxiv.org/pdf/2104.08758.pdf and saw their graph of the top websites in Google's C4, definitely worth your time
@nitashatiku
Checked a few of my sites. Low down in the order, sure, but all have copyright statements and robot blockers. Hmmn.

@nitashatiku If you’re going to do this, why don’t you retrieve the ‘robots.txt’ from each site. See how many of them (1) have one, and (2) don’t disallow bots? And (3) have _explicit sitemaps_ to content.

The bots were invited in. You may hate it now, but they were invited.

That’s because _this is how it works_. Folks wanted SEO, so offered up their content up to be found.

I get the frustration, I really do, but it’s super-clear to me: we invited the bots in to read our content and…they did.

@cypherfox @nitashatiku Opting-in for content to be searchable shouldn't be the same as opting-in for it to be copied, though
@cypherfox @nitashatiku For commercial purposes specifically, I might add

@Quisley @nitashatiku I understand that that’s how you feel, but I don’t think that’s how it’ll play out in a court. And… In what way are search engines not commercial purposes?

They run ads next to your site links in search results. They have a money-printing press, for goodness sake. 🤣

Opting in to indexing is definitely opting in to a commercial use. That LLMs are not the commercial use you had in mind…well that’ll be a fascinating argument to watch, but I wouldn’t put money on either side.

@cypherfox @nitashatiku

stop defending the rich criminals, they'll never reward you for it

@troglodyt @nitashatiku Hah; you’re funny, I like it! 🤣

I don’t care about them; I care about the technology, and the law. Making crawling illegal because of copyright would make LLM technology (and search engines and other things I find deeply valuable) impossible, and I’m not okay with that.

Plus it’s always good to know what the law, IS rather than what you feel the law SHOULD be. See Field v. Google, Inc. for a useful example on crawling/scraping/indexing.

But you know…you do you. 👋

@cypherfox @nitashatiku

i think you should stop defending rich criminals, it's embarrassing and bad

@cypherfox @nitashatiku

you can spend a week defending the poor shoplifting, that'd ought to change your perspectives a bit

@troglodyt @nitashatiku Now you’re getting weird; I’m explaining the legal situation and pointing out that if you DON’T want bots reading your content, take even the smallest step to block them by making your robots.txt hostile to them.

I get that you conflate pointing out the way it works with approving of the people doing it, but…that’s not a ‘me’ problem.

Maybe take a look at https://commoncrawl.org and see if you really DO disapprove of their methods. You might be surprised.

Best of luck.

Common Crawl - Open Repository of Web Crawl Data

We build and maintain an open repository of web crawl data that can be accessed and analyzed by anyone.

@cypherfox @nitashatiku “I want my site to be discoverable in a search engine” and “I want to train someone else’s LLM” are two very different things

@chucker @nitashatiku The bulk of the data comes from the Common Crawl, an open source project to crawl sites which have robots.txt open to doing so.

https://commoncrawl.org

Read their process and reason for existing. If you disagree with it, that’s okay too, but they’re really open about it all.

It’s not like OpenAI or other organizations did the crawl themselves (for the most part, afaik). They’re relying on these open data projects.

It’s easy to block it if you want to, also.

Common Crawl - Open Repository of Web Crawl Data

We build and maintain an open repository of web crawl data that can be accessed and analyzed by anyone.

@chucker @cypherfox @nitashatiku Indeed: Getty Images found it’s watermark displayed prominently in many “AI art” (strong quotes) all over the Internet.
#AItheft
@nitashatiku i don't have them handy, but that makes me want to prompt "copyright AFP" or something like it to various AI image generators.
@nitashatiku great work!!! This is awesome.
@nitashatiku Note another post shares a gift link, not a paywall: https://elk.zone/mastodon.garden/@kevinschaul/110225457242842258
Kevin Schaul (@[email protected])

Just published this analysis on the websites that power AI chatbots. We looked specifically at C4, used by Google, Facebook and others. Some of my favorite findings: - RT.com (Russian propaganda) is 65th ranked site in here - Two of the top 100 sites are voter registration databases w names, addresses, party (why?) - More that 500,000 personal blogs, including mine - "©" appears > 200 million times More interesting nuggets in here: https://wapo.st/3AcdDUm

Mastodon Garden
@nitashatiku do you have the breakdown of volume from each source, instead of the unique token contribution? There could be a factor of word frequency skewing the insight here on importance. (Ie some sites with fewer unique tokens may still be contributing much more to the model’s probabilities due to volume, and token repetition)
@nitashatiku So sorry, paywalled, can't read.
@nitashatiku What’s a ‘pirated site’?
@ravigupta my poor shorthand for: an online market for pirated or counterfeited goods. US office for trade representative that identifies them in an annual report calls them "notorious markets."
@nitashatiku just fantastic work thank you!!
@nitashatiku "the copyright symbol appears >200M times" 😂 🙈
@nitashatiku I'm really enjoying your coverage of AI in the Washington Post. This new piece is another banger https://www.washingtonpost.com/technology/2023/07/05/ai-apocalypse-college-students/ Keep it up!
How elite schools like Stanford became fixated on the AI apocalypse

Student-led groups focused on AI Safety have popped up at Stanford University and other schools, backed by billionaires fixated on the AI apocalypse

The Washington Post