If you've been blocking AI from scraping your website, there's another one to add. This time it's Google.

I've updated my post on the subject.

https://neil-clarke.com/block-the-bots-that-feed-ai-models-by-scraping-your-website/

#ThisShouldBeOptIn

Block the Bots that Feed “AI” Models by Scraping Your Website – Neil Clarke

I've modified this post to reflect that Google-Extended ISN'T a bot. It's a way of letting Google know you don't want your site used in this manner. It MUST be in your robots.txt to work at all.
@clarkesworld When did they add this? I swear I don't remember it from last time I went through https://developers.google.com/search/docs/crawling-indexing/overview-google-crawlers looking for "bots" to block
Google Crawler (User Agent) Overview | Google Search Central  |  Documentation  |  Google for Developers

Google crawlers discover and scan websites. This overview will help you understand the common Google crawlers including the Googlebot user agent.

Google for Developers
@clarkesworld Note that according to the linked Google article, "Google-Extended" isn't an actual crawler with its own user agent string - rather, it's a name that Google's existing crawlers interpret as a request to opt out of AI when they see it in robots.txt. So blocking it at the firewall etc would have no effect.
@jmorahan Hmmm. Ok. Updating document.

@clarkesworld

Thank you for this. I'm not entirely sure if it's necessary for me if I use GitHub Pages, but I added the file nonetheless.

@robertoqs @clarkesworld Also, does it work when the robots.txt-file is in a sub-folder and not root?
Block the AI bots · hugovk/hugovk.github.io@79a14a0

* https://neil-clarke.com/block-the-bots-that-feed-ai-models-by-scraping-your-website/ * https://stackoverflow.com/a/47652485/724176

GitHub

@eklem @clarkesworld

I think it has to be in the root directory.

@robertoqs @clarkesworld Thanks. I created a repo with the name [username].github.io as suggested by @hugovk and put a robots.txt in it, so I should be covered now.

@eklem @clarkesworld @hugovk

Well, if you were already using GitHub Pages, you already had a repository named [username].github.io, if I'm not mistaken. The point about robots.txt is to add it to the root directory, otherwise known as the main branch in GitHub.

@robertoqs @eklem @clarkesworld

I didn't have a [username].github.io repo until I created it this morning. But I did have other repos using GitHub Pages, and they are served like [username].github.io/other-repo

But until I created [username].github.io with robots.txt, there was nothing at [username].github.io/robots.txt for the others. As you say, it must be at the root.

Some links to docs: https://mastodon.social/@hugovk/111146631566244808

@hugovk @eklem @clarkesworld

Ah, I see. Then [username].github.io is only required when not using a custom domain, like in my website's case. That's what I was thinking about.

So what I did was simply to put robots.txt next to my HTML files. Also, thanks for the documentation. I love Stack Overflow, by the way.

@clarkesworld Google joining the 'bot party' is hardly surprising.

@clarkesworld

Great post, thanks for all the reserach. Maybe add FacebookBot to block Meta’s efforts?

“FacebookBot crawls public web pages to improve language models for our speech recognition technology.”

https://developers.facebook.com/docs/sharing/bot

@clarkesworld Also I added a link from my somewhat popular #Django robots.txt post:

https://adamj.eu/tech/2020/02/10/robots-txt/

How to add a robots.txt to your Django site - Adam Johnson

robots.txt is a standard file to communicate to “robot” crawlers, such as Google’s Googlebot, which pages they should not crawl. You serve it on your site at the root URL /robots.txt, for example https://example.com/robots.txt.

@clarkesworld Thanks! Three were already there and I added the other two.
@clarkesworld Thanks fam. Glad to see this mag is still running btw.
@clarkesworld Thanks! That got me around to add my robots.txt
@clarkesworld I'm pretty much a #Firefox guy now; hoping there's a difference.