If you've been blocking AI from scraping your website, there's another one to add. This time it's Google.

I've updated my post on the subject.

https://neil-clarke.com/block-the-bots-that-feed-ai-models-by-scraping-your-website/

#ThisShouldBeOptIn

Block the Bots that Feed “AI” Models by Scraping Your Website – Neil Clarke

@clarkesworld

Thank you for this. I'm not entirely sure if it's necessary for me if I use GitHub Pages, but I added the file nonetheless.

@robertoqs @clarkesworld Also, does it work when the robots.txt-file is in a sub-folder and not root?
Block the AI bots · hugovk/hugovk.github.io@79a14a0

* https://neil-clarke.com/block-the-bots-that-feed-ai-models-by-scraping-your-website/ * https://stackoverflow.com/a/47652485/724176

GitHub

@eklem @clarkesworld

I think it has to be in the root directory.

@robertoqs @clarkesworld Thanks. I created a repo with the name [username].github.io as suggested by @hugovk and put a robots.txt in it, so I should be covered now.

@eklem @clarkesworld @hugovk

Well, if you were already using GitHub Pages, you already had a repository named [username].github.io, if I'm not mistaken. The point about robots.txt is to add it to the root directory, otherwise known as the main branch in GitHub.

@robertoqs @eklem @clarkesworld

I didn't have a [username].github.io repo until I created it this morning. But I did have other repos using GitHub Pages, and they are served like [username].github.io/other-repo

But until I created [username].github.io with robots.txt, there was nothing at [username].github.io/robots.txt for the others. As you say, it must be at the root.

Some links to docs: https://mastodon.social/@hugovk/111146631566244808

@hugovk @eklem @clarkesworld

Ah, I see. Then [username].github.io is only required when not using a custom domain, like in my website's case. That's what I was thinking about.

So what I did was simply to put robots.txt next to my HTML files. Also, thanks for the documentation. I love Stack Overflow, by the way.