We found
-the copyright symbol appears >200M times
-pirated sites, 1 for e-books
-half of the top 10 were news sites https://www.washingtonpost.com/technology/interactive/2023/ai-chatbot-learning
@nitashatiku If you’re going to do this, why don’t you retrieve the ‘robots.txt’ from each site. See how many of them (1) have one, and (2) don’t disallow bots? And (3) have _explicit sitemaps_ to content.
The bots were invited in. You may hate it now, but they were invited.
That’s because _this is how it works_. Folks wanted SEO, so offered up their content up to be found.
I get the frustration, I really do, but it’s super-clear to me: we invited the bots in to read our content and…they did.
stop defending the rich criminals, they'll never reward you for it
@troglodyt @nitashatiku Hah; you’re funny, I like it! 🤣
I don’t care about them; I care about the technology, and the law. Making crawling illegal because of copyright would make LLM technology (and search engines and other things I find deeply valuable) impossible, and I’m not okay with that.
Plus it’s always good to know what the law, IS rather than what you feel the law SHOULD be. See Field v. Google, Inc. for a useful example on crawling/scraping/indexing.
But you know…you do you. 👋
i think you should stop defending rich criminals, it's embarrassing and bad
you can spend a week defending the poor shoplifting, that'd ought to change your perspectives a bit