important info if you have a blog/content website etc which you would not want to be used by OpenAI for their data vacuuming

➡️ add this to robots.txt
User-agent: GPTBot
Disallow: /

➡️ or block these IP ranges
20.15.240.64/28
20.15.240.80/28
20.15.240.96/28
20.15.240.176/28
20.15.241.0/28
20.15.242.128/28
20.15.242.144/28
20.15.242.192/28
40.83.2.64/28

🔗 https://platform.openai.com/docs/gptbot

OpenAI Platform

@filipw Added to robots.txt, thanks for sharing that!

(Would add IP blocks, too, but I don't think either GitHub Pages or Netlify support that except on very expensive plans.)

@xgranade I hope they get forced (European Commission perhaps?) to have this as opt-in, not opt-out!
@filipw Yeah, I get indexing, but having all of my writing and presentations scraped to help people replace me entirely? I can't say as that I'm alright with that!

@filipw it would be nice to have the same for google and others invasive ones.

Or if we could sue them for that kind of behaviour ...

Google Crawler (User Agent) Overview | Google Search Central  |  Documentation  |  Google for Developers

Google crawlers discover and scan websites. This overview will help you understand the common Google crawlers including the Googlebot user agent.

Google for Developers
@filipw just added to my robots.txt, thanks

@filipw

to be clear: robots.txt is processed from top to bottom, so that line needs to go before any User-agent: * blocks.

Do I have that right?

@filipw
Of course, by now, all the data has been sucked up, and the models are trained 😅 😂

@filipw Well, this part is also very interesting: https://platform.openai.com/docs/gptbot/customize-gptbot-access

What would happen if a lot of people would just dump a lot of random meaningless stuff in a specific file for chatGPT (probably something that "looks like" meaningful text) xD

OpenAI Platform

@filipw
Is there any reason to think they will honor this? Does openai do the scraping? They used some sort of ethics washing by paying universities to do the scraping right? "Fair use" because of research.
@filipw Thanks for the IP ranges. Toast!
@filipw ¿por que no los dos?

@filipw since the bots are actively scraping, I think both (as @genehack suggested) plus redirections such as…

RewriteCond %{HTTP_USER_AGENT} " +https://openai." RewriteRule . - [R=451] RewriteCond %{HTTP_USER_AGENT} "GPTBot/" RewriteRule . - [R=451]

… to get them if they acquire new IPs. (Perhaps exclude robots.txt from that rule, who knows.)

@filipw Thanks you, but any determined scrapper will also scrap from sites and APIs such as Internet Archive, archive.is, Google's and Cloudflare's cached websites, or just outright buying data from other scrapper individuals/companies anyways.
@filipw I wish licenses were enforceable. This post is available for inclusion in language model training for a fee of $1,000,000 (one million USD) per output generated by such model.
@filipw All those 20.*.*.* blocks are new since I first found out about this last week…
@filipw what they don’t tell one: they’re scraping like wild already, the robots.txt isn’t rechecked…