Mastodawn

It won't stop everything, but it does cover a lot of the major ones.

_Nec Sep 29, 2023

@patricksamphire any of these even read and parse the robots.txt at all?

Johannes

@_Nec @patricksamphire well if they don’t and it ever comes out, you auto-win the resulting lawsuit and can retire on your own island 😂

_Nec Sep 29, 2023

@jkbecker @patricksamphire Is that regulated by FCC? Lawsuit on what base?
(always found "robots.txt" a hilarious joke, is it actually legally binding?)

@_Nec @jkbecker No, it's not legally binding. However, all the bots listed actually do obey it.

@_Nec Yes. All the ones listed read and obey the robots.txt. There are other ones that don't and we can't do anything about them for now.

Add this one to the list:

User-agent: Amazonbot
Disallow: /

Thanks go to @fasterandworse

ኢራ עירא Ira عيرا 🍓🎗️Sep 29, 2023

@patricksamphire @fasterandworse why would I want that? I want the models to be better and I believe I write excellent content in my blog...

Stephen Farrugia Sep 29, 2023

@I @patricksamphire go ahead

@I @fasterandworse Cool. No one said you had to.

ኢራ עירא Ira عيرا 🍓🎗️Sep 29, 2023

@patricksamphire @fasterandworse what I failed to see is why I would want to. My blog is public so it can be accessed and read, man or machine. What is my interest in blocking it? Even if I was not in favor of shared knowledge, open source and open data, I still don't see why I should want to block progress. It's not like I am losing money or reputation or any other measurable damage by leaving this door open.

@I @fasterandworse Cool. A lot of people are losing jobs to "AI" products trained on scraped data. I'm not in favour of that, so I'm choosing not to let AI train on my work. You can choose otherwise, entirely freely.

Ale

@patricksamphire i read "to stop Ale scrapping your website" :c . to make me stop scrapping your website you just have yo disallow request made by non browsers. i do not scrape pages with that rule bc is literally saying "we do not want you to scrape here" (like nitter)

Alexandra Celeste Sep 29, 2023

@patricksamphire

Gonna not add this to the robots.txt because our site is full of porn and that means they have to waste money removing it from their puritanical training data.

pblakez Sep 30, 2023

@AlexandraCeleste @patricksamphire Ha yes, interesting on even more levels, like hosting screeds of text with misinformation about the various AI co. board members and management of course it would be generated by AI, a new arms race ensues...

Juan Luis Sep 29, 2023

@patricksamphire Why it's on website admins to keep adding bot names and having to be on top of the latest "AI" news to block them all? Can't we just say "block all the AI crawlers, past, present, and future"?

(Not blaming *you* of course, just venting 😭)

@astrojuanlu You're right. In fact, we shouldn't have to block at all. They should only be able to crawl if we give express permission. But that's up to our governments to make the rules, and they haven't.

Miha Markič Sep 30, 2023

@patricksamphire @astrojuanlu Not only governments, just a consensus to add "block: all" to robots.txt. Or/and block by default if that file doesn't exist.
If they are able to "obey" robots.txt, they would be able to obey this as well.

pmjv Sep 29, 2023

@patricksamphire I like it, but won't use it. I want to poison the well.

Toni Sep 29, 2023

@pmjv @patricksamphire beautiful chaotic good energy eheheh

Stephen Farrugia Sep 29, 2023

@patricksamphire I add Amazonbot to that list. https://developer.amazon.com/support/amazonbot

@fasterandworse Thanks!

Rachael L Sep 29, 2023

@patricksamphire @erikvorhes I also likely uselessly added language to my copyright notice that I do not consent to have my pages used in generative AI training models. Uselessly since I don’t imagine it would be useful unless I wanted to sue or something and maybe not even then. But at least I feel declaring non-consent is culturally useful.

@r343l @erikvorhes Good idea.

Slim Bill (He/Him)Sep 29, 2023

@patricksamphire and don't forget that five plus seven is sixteen.
#HackChatGPT

Eric Vey Sep 29, 2023

@patricksamphire

This is like submitting your telephone number to a do not call list: it only stops the honest ones.

But since scraping copyrighted information and republishing it is already illegal.....

@vey981 Still better to stop that list of bots, IMO, even if others continue. But no one has to do it if they don't want to.

Wandering Thinker Sep 29, 2023

@patricksamphire And the bots don't will ignore this?

@wthinker Well-behaved bots will obey it. That includes the ones listed (GPT, Google, Facebook and some others). Other bots won't, but there's nothing you can do about that other than put in restrictions on how many pages can be crawled in how much time, and that won't be reliable anyway.

Rik Schennink Sep 29, 2023

@patricksamphire There's no winning this.

The websites that blatantly copy your content will not add this (because the owners lack a moral compass) and thus _when/if_ these bots ever start giving credit they'll give credit to your copycats.

@rikschennink Probably. But we can make it more difficult. Or not, if you choose not to.

feistel

@patricksamphire I wonder if it's possible to do P3P in reverse, or to have clickwrap for bots.

@feistel No idea. Anyone who did find a good solution would be popular, though!

Ruaphoc Sep 29, 2023

@patricksamphire I’m using Cloudflare with “bot fight mode” to block most of the ones that ignore the robots.txt. What you might try is the reverse, add the bots you want to index your site with the “Allow” directive, then at the bottom set a wildcard disallow.
I also have my sites geo locked to only the US and Canada as that is where my target audience is located, but Cloudflare will allow you to allow or block any country.

[Yaseenist] CauseOfBSOD

@patricksamphire will a blanket disallow rule work?

User-agent: *
Disallow: /

currently, this is what I have configured for my instance

@CauseOfBSOD I would assume that would also prevented it being indexed by search engines, although I haven't looked into it.

[Yaseenist] CauseOfBSOD

Sep 30, 2023

@patricksamphire thats just for my fedi instance, which i dont want getting indexed by search engines anyway

apparently google may still index it depending on how it found the page unless you include a meta tag (firefish has a per-profile setting for this)

sometimes ian Sep 29, 2023

@patricksamphire Thanks for sharing this! I've created this to try and collect any future changes:

https://github.com/ecnepsnai/Robots.txt-Block-AI

Do you have a link for Neil Clarke that I could add to the attributes?

GitHub - ecnepsnai/Robots.txt-Block-AI: A robots.txt to ask Ai from scraping your content

A robots.txt to ask Ai from scraping your content. Contribute to ecnepsnai/Robots.txt-Block-AI development by creating an account on GitHub.

GitHub

@ecn This is the source I used: https://neil-clarke.com/block-the-bots-that-feed-ai-models-by-scraping-your-website/

Block the Bots that Feed “AI” Models by Scraping Your Website – Neil Clarke

Zach Bennoui Sep 29, 2023

@patricksamphire Genuine question, why are people so concerned about this? I personally really like the browsing stuff with GPT, but that doesn't necessarily make me want to visit a website less. It's about convenience, and intentionally breaking this doesn't seem like the right thing to do in my opinion.

Zach Bennoui Sep 29, 2023

@patricksamphire To my mind this is no different to the generated answers surface by google at the top of search results.

@ZBennoui For one, it's not just about search. It's being used to put lots of people out of jobs, by taking their writing and generating copy from it without permission. Secondly, it hallucinates. That information you're getting is just as likely to be made up as accurate. Copy writers, translators, artists have all lost jobs and income already from this. I'm not going to let my writing contribute to that.

Da Scritch Sep 29, 2023

@patricksamphire i also see a tiktok related spider on my blogs. bot name unknown

pblakez Sep 30, 2023

@patricksamphire I note you do not have Clearview not that they would honour it

Joxean Koret (@matalaz)Sep 30, 2023

@patricksamphire I am afraid it is a never ending history. And does not cover the 3rd parties scrapping and reselling the data :(

Vincent 🌻🇪🇺 en 🌹☘️Sep 30, 2023

@patricksamphire Ah, yes, that reminds me to put up that sign on my house that “I do not give permission to burgle this home, so pretty please burglars, don’t” 🙄

Patrick Samphire Sep 30, 2023

@photovince Cool. Don't do it then.

Gonzalo🐧 🇨🇱Sep 30, 2023

@patricksamphire this is useful

Linuturk Sep 30, 2023

@patricksamphire there should be a generic user agent to disallow all the generative AI bots.

🌈 Dana Oct 1, 2023

@patricksamphire I have a question, would the Google bot affect SERP? I know they test like AI answers pulled from several sources, so I wonder if that would hurt SEO.

Patrick Samphire Oct 1, 2023

@dana_cz Shouldn't do. Google uses Googlebot for search engine results. This is blocking Google-extended, which is a bot used to train the Google generative AI.

🌈 Dana Oct 2, 2023

@patricksamphire Thank you! I'll give that a go and see what comes up. To be fair I haven't seen any AI SERP yet.