Speaking of which, hot new robots.txt entry just dropped:

User-agent: GPTBot
Disallow: /

https://platform.openai.com/docs/gptbot

OpenAI Platform

Explore developer resources, tutorials, API docs, and dynamic examples to get the most out of OpenAI's platform.

@olivierlacan Alternatively, reply to that user agent with a blank page and 402 Payment Required
@jstepien @olivierlacan @brainwane Oh god, that is the obviously correct response, and now I have to figure out how to do it.
@olivierlacan Can you copy the content of the page? It requires an account.
@olivierlacan wild that you need to log in to see this link lol
@olivierlacan Just applied to all pages on my site. Thanks.

@olivierlacan I just do this:

echo "Blocking IP addresses..."
echo "[OpenAI egress ranges]"
iptables -A INPUT -s 23.102.140.112/28 -j DROP
iptables -A INPUT -s 23.98.142.176/28 -j DROP
@olivierlacan ironic that I needed to pass the cloudflare “are you a human” check to see that web page
@olivierlacan Even better to return a bunch of GPT generated garbage to poison the model instead of just disallow
@Beldantazar @olivierlacan now that is a great idea. Have a singe directory on your site that is a bunch of Lorem ipsum or nonsens language pages that you allow the GPTbot to access. Surely some nice soul will make a generator for this soon. Just make sure to tell the agent only that dodgy directory. Of course that assumes you trust them to do the right thing, which I think they’ve already demonstrated they shouldn’t be trusted.
@Danwwilson @olivierlacan i mean, you're already trusting them if you use the disallow anyway, but the ideal thing is rather than return a bunch of lorem ipsum that will be easily detected, instead return stuff that is chatgpt generated trash, it's harder for them to detect that and ai models get killed fast if they feed off their own outputs. ideally just have the same set of pages return either the normal data or the gpt data depending on user agent, so that way it's even harder to detect.
@olivierlacan Would be a shame if a bunch of websites responded to this user agent with megabytes of nonsensical English text. That might screw up its model

@olivierlacan

How I can add that to mastodon?

@olivierlacan @darren we need to extend the concept of robots.txt to machine-readable content licenses, specifically to disallow use for ML training.
@sminnee Do you think this is something that Creative Commons licensing could/should cover?
@candidexmedia that would potentially be a way forward, if they added “no ML training” as one of their pre-package options.

@olivierlacan
So:

User-agent: GPTBot
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

any other I'm missing? What about the browser.ai and axiom.ai crapola? Are they good netizens that can be kept at bay via robots?

@oblomov @olivierlacan

It would be nice to intercept them and send them to a script generating a page with some Mb of "you're full of shit".

@olivierlacan so if you now opt out, will they actually remove your data from their training model? Doubt it.
@olivierlacan and if people do get their data removed from AI, will it be like the end of 2001 as HAL has his modules removed? #DaisyDaisy
@olivierlacan I am happy that they are providing methods to opt out for web developers, which I think it really nice. I am hoping that SOURCES of training data (images and the like) implement something similar, it could become a factor in choosing a web site to post stories on or images on -- does that site make a good faith effort to prevent data from being used to train AI?
@olivierlacan unfortunate that this even has to happen 😞 it'd be nice if website just weren't scraped for AI training or it was on a very opt-in basis.
@olivierlacan This is cool but technically shouldn’t even be necessary. Anything you create is protected by copyright, automatically and implicitly. Even if the ChatGPT bot is free to lumber through your site, it is violating copyright law if it copies your stuff, uses it to train its LLMs, and then makes that derived work publicly available.
@olivierlacan @dblume so they actually honor it?
@JetForMe @olivierlacan I'm often pessimistic and sarcastic. Instead, I'm going to be honest here: Yeah, I believe they will. I try to be charitable with voluntary claims about robots.txt.

@olivierlacan There's at least 3 people here with a similar idea... a script to feed it an infinite number of generated, cross-linked nonsense pages, each linking to more of the same, wasting resources and poisoning the dataset.

They don't even need to be anything advanced. Any basic markov library spitting out vaguely (but unhelpfully) semi-coherent nonsense would do. It just needs to look just enough like language while being garbage in html tags.

Please! Create or boost!

#LLM #GPT #OpenAI

@olivierlacan Imagine if someone made something like this and then published it as a #WordPress plugin! 
@olivierlacan or just allow a directory filled with (generated) rubbish? :D

@olivierlacan it seems weird to post this with a link to OpenAI that requires login.

No thank you.

@olivierlacan this is actually deviously evil. They are pulling up the drawbridge now that they've scraped the internet without consent.

I'm sure OpenAI will also push for legislation to have new AI companies respect robots.txt. they are trying to cement their monopoly on data before credible competitors emerge.