Mastodawn

important info if you have a blog/content website etc which you would not want to be used by OpenAI for their data vacuuming

➡️ add this to robots.txt
User-agent: GPTBot
Disallow: /

➡️ or block these IP ranges
20.15.240.64/28
20.15.240.80/28
20.15.240.96/28
20.15.240.176/28
20.15.241.0/28
20.15.242.128/28
20.15.242.144/28
20.15.242.192/28
40.83.2.64/28

🔗 https://platform.openai.com/docs/gptbot

OpenAI Platform

Show thread

Cassandra is only carbon now Aug 8, 2023

@filipw Added to robots.txt, thanks for sharing that!

(Would add IP blocks, too, but I don't think either GitHub Pages or Netlify support that except on very expensive plans.)

Show thread

Filip W Aug 8, 2023

@xgranade I hope they get forced (European Commission perhaps?) to have this as opt-in, not opt-out!

Show thread

Cassandra is only carbon now Aug 8, 2023

@filipw Yeah, I get indexing, but having all of my writing and presentations scraped to help people replace me entirely? I can't say as that I'm alright with that!

Show thread

Hyde 📷 🖋

Aug 8, 2023

@filipw it would be nice to have the same for google and others invasive ones.

Or if we could sue them for that kind of behaviour ...

Show thread

Martin Vassor Aug 8, 2023

@hyde @filipw I think you can: https://developers.google.com/search/docs/crawling-indexing/overview-google-crawlers

Google Crawler (User Agent) Overview | Google Search Central | Documentation | Google for Developers

Google crawlers discover and scan websites. This overview will help you understand the common Google crawlers including the Googlebot user agent.

Google for Developers

Show thread

David Krooshof Aug 8, 2023

@filipw Dit wil je, @vasilis

en ik ook...

Show thread

Vasilis Aug 8, 2023

@krooshof Alstublieft: https://dendriet.nl/robots.txt

Show thread

David Krooshof Aug 8, 2023

@vasilis Heel gaaf! Dank je wel!

Show thread

Alessio

Aug 8, 2023

@filipw just added to my robots.txt, thanks

Show thread

Roger Lipscombe Aug 8, 2023

@filipw

to be clear: robots.txt is processed from top to bottom, so that line needs to go before any User-agent: * blocks.

Do I have that right?

Show thread

Ursidinoj/The Bjornsdottirs Aug 8, 2023

@filipw way late for me. wilco anyway

Show thread

Mal 甄/kalessin/Peri Aug 8, 2023

@filipw is there a non authentication-walled uri/url for this information?

Show thread

The Doctor Aug 9, 2023

@perigee @filipw This might be the no-auth-needed one.

Show thread

Mal 甄/kalessin/Peri Aug 9, 2023

@drwho @filipw it is not.

Show thread

The Doctor Aug 9, 2023

@perigee @filipw I mean that post.

Show thread

Mal 甄/kalessin/Peri Aug 9, 2023

@drwho @filipw oh I see.

Show thread

Mark

Aug 8, 2023

@filipw
Of course, by now, all the data has been sucked up, and the models are trained 😅 😂

Show thread

Martin Vassor Aug 8, 2023

@filipw Well, this part is also very interesting: https://platform.openai.com/docs/gptbot/customize-gptbot-access

What would happen if a lot of people would just dump a lot of random meaningless stuff in a specific file for chatGPT (probably something that "looks like" meaningful text) xD

OpenAI Platform

Show thread

Roel Aug 8, 2023

@filipw
Is there any reason to think they will honor this? Does openai do the scraping? They used some sort of ethics washing by paying universities to do the scraping right? "Fair use" because of research.

Show thread

WTL Aug 9, 2023

@filipw Thanks for the IP ranges. Toast!

Show thread

Quenby Aug 9, 2023

@filipw thank you!!

Show thread

Richard W. Woodley ELBOWS UP 🇨🇦🌹🚴‍♂️📷 🗺️Aug 9, 2023

@filipw
Thanks, just did that

Show thread

genehack Aug 9, 2023

@filipw ¿por que no los dos?

Show thread

mirabilos Aug 10, 2023

@filipw since the bots are actively scraping, I think both (as @genehack suggested) plus redirections such as…

RewriteCond %{HTTP_USER_AGENT} " +https://openai."
RewriteRule . - [R=451]
RewriteCond %{HTTP_USER_AGENT} "GPTBot/"
RewriteRule . - [R=451]

… to get them if they acquire new IPs. (Perhaps exclude robots.txt from that rule, who knows.)

Show thread

JesseTong Aug 9, 2023

@filipw Thanks you, but any determined scrapper will also scrap from sites and APIs such as Internet Archive, archive.is, Google's and Cloudflare's cached websites, or just outright buying data from other scrapper individuals/companies anyways.

Show thread

ShadSterling Aug 9, 2023

@filipw I wish licenses were enforceable. This post is available for inclusion in language model training for a fee of $1,000,000 (one million USD) per output generated by such model.

Show thread

Peter Hosey Aug 10, 2023

@filipw All those 20.*.*.* blocks are new since I first found out about this last week…

Show thread

mirabilos Aug 10, 2023

@filipw what they don’t tell one: they’re scraping like wild already, the robots.txt isn’t rechecked…