If you've been blocking AI from scraping your website, there's another one to add. This time it's Google.
I've updated my post on the subject.
https://neil-clarke.com/block-the-bots-that-feed-ai-models-by-scraping-your-website/
If you've been blocking AI from scraping your website, there's another one to add. This time it's Google.
I've updated my post on the subject.
https://neil-clarke.com/block-the-bots-that-feed-ai-models-by-scraping-your-website/
Thank you for this. I'm not entirely sure if it's necessary for me if I use GitHub Pages, but I added the file nonetheless.
@eklem @robertoqs @clarkesworld
For GitHub Pages, add robots.txt to a repo called <username>.github.io and then it will appear at <username>.github.io/robots.txt
For example:
https://github.com/hugovk/hugovk.github.io/commit/79a14a01d37d574e2a76127722cdaf25cc1b9293
https://github.com/hugovk/hugovk.github.io
https://hugovk.github.io/robots.txt
More info:
https://stackoverflow.com/a/47652485/724176
https://docs.github.com/en/pages/getting-started-with-github-pages/about-github-pages
#GitHub #GitHubPages #robotstxt
I think it has to be in the root directory.
Well, if you were already using GitHub Pages, you already had a repository named [username].github.io, if I'm not mistaken. The point about robots.txt is to add it to the root directory, otherwise known as the main branch in GitHub.
@robertoqs @eklem @clarkesworld
I didn't have a [username].github.io repo until I created it this morning. But I did have other repos using GitHub Pages, and they are served like [username].github.io/other-repo
But until I created [username].github.io with robots.txt, there was nothing at [username].github.io/robots.txt for the others. As you say, it must be at the root.
Some links to docs: https://mastodon.social/@hugovk/111146631566244808
Ah, I see. Then [username].github.io is only required when not using a custom domain, like in my website's case. That's what I was thinking about.
So what I did was simply to put robots.txt next to my HTML files. Also, thanks for the documentation. I love Stack Overflow, by the way.
Great post, thanks for all the reserach. Maybe add FacebookBot to block Meta’s efforts?
“FacebookBot crawls public web pages to improve language models for our speech recognition technology.”
@clarkesworld Also I added a link from my somewhat popular #Django robots.txt post: