"AI" companies think that we should have to opt-out of data-scraping bots that take our work to train their products. There isn't even a required no-scraping period between the announcement and when they start. Too late? Tough.
Not acceptable.
#RequireOptIn
@clarkesworld also, there are those of us with sites where we don't have control over the code side of things, and so can't add arbitrary code to our headers.
@clarkesworld That's only because if it was opt-in, no one would allow it at all.

@clarkesworld
Is it possible to add copyrighted text or robots.txt content to the LICENSE so source files won't be scraped?
Could confirm that ChatGPT was trained with the public GitHub repo but OpenAI requires a legal name for reports.

Edit:
e. g. there is
https://huggingface.co/spaces/bigcode/in-the-stack

Am I in The Stack? - a Hugging Face Space by bigcode

This app lets you check if your GitHub repositories are part of the The Stack dataset. Enter your GitHub username and select the dataset version to see if your code is included. If you want your da...

@clarkesworld

'THE COMMONS IS MINE! MIIIIIIINE!'

@clarkesworld I mean, it could be seen as an improvement on their previous position of "we don't give a shit whether you consent or not"
@clarkesworld Plus, they are keeping everything they scraped before.
@clarkesworld Also, scraping (edit: the kind that the AI companies use) and DDoS attacks are functionally the same to the receiving server. Both are basically an excessive number of seemingly legitimate requests.
@clarkesworld Hopefully OpenAI goes bankrupt pretty soon since this AI thing is really expensive apparently lol.
@clarkesworld
Biological data scraping bots i.e. corpus linguists have been getting away with this for years.
#corpora
@clarkesworld
I believe that it should be opt out.