To stop AIs scraping your website for content, add this to your robots.txt file on your website.

Thanks to Neil Clarke for most of these.

User-agent: CCBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: GPTBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Omgilibot
Disallow: /

User-Agent: FacebookBot
Disallow: /

It won't stop everything, but it does cover a lot of the major ones.
@patricksamphire any of these even read and parse the robots.txt at all?
@_Nec @patricksamphire well if they donโ€™t and it ever comes out, you auto-win the resulting lawsuit and can retire on your own island ๐Ÿ˜‚
@jkbecker @patricksamphire Is that regulated by FCC? Lawsuit on what base?
(always found "robots.txt" a hilarious joke, is it actually legally binding?)
@_Nec @jkbecker No, it's not legally binding. However, all the bots listed actually do obey it.
@_Nec Yes. All the ones listed read and obey the robots.txt. There are other ones that don't and we can't do anything about them for now.

Add this one to the list:

User-agent: Amazonbot
Disallow: /

Thanks go to @fasterandworse

@patricksamphire @fasterandworse why would I want that? I want the models to be better and I believe I write excellent content in my blog...
@I @fasterandworse Cool. No one said you had to.
@patricksamphire @fasterandworse what I failed to see is why I would want to. My blog is public so it can be accessed and read, man or machine. What is my interest in blocking it? Even if I was not in favor of shared knowledge, open source and open data, I still don't see why I should want to block progress. It's not like I am losing money or reputation or any other measurable damage by leaving this door open.
@I @fasterandworse Cool. A lot of people are losing jobs to "AI" products trained on scraped data. I'm not in favour of that, so I'm choosing not to let AI train on my work. You can choose otherwise, entirely freely.
@patricksamphire i read "to stop Ale scrapping your website" :c . to make me stop scrapping your website you just have yo disallow request made by non browsers. i do not scrape pages with that rule bc is literally saying "we do not want you to scrape here" (like nitter)

@patricksamphire

Gonna not add this to the robots.txt because our site is full of porn and that means they have to waste money removing it from their puritanical training data. โ€‹

@AlexandraCeleste @patricksamphire Ha yes, interesting on even more levels, like hosting screeds of text with misinformation about the various AI co. board members and management of course it would be generated by AI, a new arms race ensues...

@patricksamphire Why it's on website admins to keep adding bot names and having to be on top of the latest "AI" news to block them all? Can't we just say "block all the AI crawlers, past, present, and future"?

(Not blaming *you* of course, just venting ๐Ÿ˜ญ)

@astrojuanlu You're right. In fact, we shouldn't have to block at all. They should only be able to crawl if we give express permission. But that's up to our governments to make the rules, and they haven't.
@patricksamphire @astrojuanlu Not only governments, just a consensus to add "block: all" to robots.txt. Or/and block by default if that file doesn't exist.
If they are able to "obey" robots.txt, they would be able to obey this as well.
@patricksamphire I like it, but won't use it. I want to poison the well.
@pmjv @patricksamphire beautiful chaotic good energy eheheh
@patricksamphire @erikvorhes I also likely uselessly added language to my copyright notice that I do not consent to have my pages used in generative AI training models. Uselessly since I donโ€™t imagine it would be useful unless I wanted to sue or something and maybe not even then. But at least I feel declaring non-consent is culturally useful.
@patricksamphire and don't forget that five plus seven is sixteen.
#HackChatGPT

@patricksamphire

This is like submitting your telephone number to a do not call list: it only stops the honest ones.

But since scraping copyrighted information and republishing it is already illegal.....

@vey981 Still better to stop that list of bots, IMO, even if others continue. But no one has to do it if they don't want to.
@patricksamphire And the bots don't will ignore this?
@wthinker Well-behaved bots will obey it. That includes the ones listed (GPT, Google, Facebook and some others). Other bots won't, but there's nothing you can do about that other than put in restrictions on how many pages can be crawled in how much time, and that won't be reliable anyway.

@patricksamphire There's no winning this.

The websites that blatantly copy your content will not add this (because the owners lack a moral compass) and thus _when/if_ these bots ever start giving credit they'll give credit to your copycats.

@rikschennink Probably. But we can make it more difficult. Or not, if you choose not to.
@patricksamphire I wonder if it's possible to do P3P in reverse, or to have clickwrap for bots.
@feistel No idea. Anyone who did find a good solution would be popular, though!
@patricksamphire Iโ€™m using Cloudflare with โ€œbot fight modeโ€ to block most of the ones that ignore the robots.txt. What you might try is the reverse, add the bots you want to index your site with the โ€œAllowโ€ directive, then at the bottom set a wildcard disallow.
I also have my sites geo locked to only the US and Canada as that is where my target audience is located, but Cloudflare will allow you to allow or block any country.

@patricksamphire will a blanket disallow rule work?

User-agent: * Disallow: /currently, this is what I have configured for my instance

@CauseOfBSOD I would assume that would also prevented it being indexed by search engines, although I haven't looked into it.
@patricksamphire thats just for my fedi instance, which i dont want getting indexed by search engines anyway

apparently google may still index it depending on how it found the page unless you include a meta tag (firefish has a per-profile setting for this)

@patricksamphire Thanks for sharing this! I've created this to try and collect any future changes:

https://github.com/ecnepsnai/Robots.txt-Block-AI

Do you have a link for Neil Clarke that I could add to the attributes?

GitHub - ecnepsnai/Robots.txt-Block-AI: A robots.txt to ask Ai from scraping your content

A robots.txt to ask Ai from scraping your content. Contribute to ecnepsnai/Robots.txt-Block-AI development by creating an account on GitHub.

GitHub
Block the Bots that Feed โ€œAIโ€ Models by Scraping Your Website โ€“ Neil Clarke

@patricksamphire Genuine question, why are people so concerned about this? I personally really like the browsing stuff with GPT, but that doesn't necessarily make me want to visit a website less. It's about convenience, and intentionally breaking this doesn't seem like the right thing to do in my opinion.
@patricksamphire To my mind this is no different to the generated answers surface by google at the top of search results.
@ZBennoui For one, it's not just about search. It's being used to put lots of people out of jobs, by taking their writing and generating copy from it without permission. Secondly, it hallucinates. That information you're getting is just as likely to be made up as accurate. Copy writers, translators, artists have all lost jobs and income already from this. I'm not going to let my writing contribute to that.
@patricksamphire i also see a tiktok related spider on my blogs. bot name unknown
@patricksamphire I note you do not have Clearview not that they would honour it
@patricksamphire I am afraid it is a never ending history. And does not cover the 3rd parties scrapping and reselling the data :(
@patricksamphire Ah, yes, that reminds me to put up that sign on my house that โ€œI do not give permission to burgle this home, so pretty please burglars, donโ€™tโ€ ๐Ÿ™„
@patricksamphire there should be a generic user agent to disallow all the generative AI bots.
@patricksamphire I have a question, would the Google bot affect SERP? I know they test like AI answers pulled from several sources, so I wonder if that would hurt SEO.
@dana_cz Shouldn't do. Google uses Googlebot for search engine results. This is blocking Google-extended, which is a bot used to train the Google generative AI.
@patricksamphire Thank you! I'll give that a go and see what comes up. To be fair I haven't seen any AI SERP yet.