Mastodawn

Clarkesworld Sep 28, 2023

If you've been blocking AI from scraping your website, there's another one to add. This time it's Google.

I've updated my post on the subject.

https://neil-clarke.com/block-the-bots-that-feed-ai-models-by-scraping-your-website/

#ThisShouldBeOptIn

Block the Bots that Feed “AI” Models by Scraping Your Website – Neil Clarke

Show thread

Adam Johnson

@clarkesworld

Great post, thanks for all the reserach. Maybe add FacebookBot to block Meta’s efforts?

“FacebookBot crawls public web pages to improve language models for our speech recognition technology.”

https://developers.facebook.com/docs/sharing/bot

Show thread

Adam Johnson

Sep 29, 2023

@clarkesworld Also I added a link from my somewhat popular #Django robots.txt post:

https://adamj.eu/tech/2020/02/10/robots-txt/

How to add a robots.txt to your Django site - Adam Johnson

robots.txt is a standard file to communicate to “robot” crawlers, such as Google’s Googlebot, which pages they should not crawl. You serve it on your site at the root URL /robots.txt, for example https://example.com/robots.txt.