We found
-the copyright symbol appears >200M times
-pirated sites, 1 for e-books
-half of the top 10 were news sites https://www.washingtonpost.com/technology/interactive/2023/ai-chatbot-learning
@nitashatiku If you’re going to do this, why don’t you retrieve the ‘robots.txt’ from each site. See how many of them (1) have one, and (2) don’t disallow bots? And (3) have _explicit sitemaps_ to content.
The bots were invited in. You may hate it now, but they were invited.
That’s because _this is how it works_. Folks wanted SEO, so offered up their content up to be found.
I get the frustration, I really do, but it’s super-clear to me: we invited the bots in to read our content and…they did.
@chucker @nitashatiku The bulk of the data comes from the Common Crawl, an open source project to crawl sites which have robots.txt open to doing so.
Read their process and reason for existing. If you disagree with it, that’s okay too, but they’re really open about it all.
It’s not like OpenAI or other organizations did the crawl themselves (for the most part, afaik). They’re relying on these open data projects.
It’s easy to block it if you want to, also.