I've been helping some friends and colleagues block some of the site scraping bots that are feeding "AI" models. Decided to take some of my notes and make something others could use too. It's a work-in-progress. Happy to add to or correct things.
https://neil-clarke.com/block-the-bots-that-feed-ai-models-by-scraping-your-website/

@clarkesworld This link gives me: “Error establishing a database connection”.

[Edit: OK, it works for me too now.]

@gregeganSF @clarkesworld Works For Me, fwiw. Maybe just a reload will fix it?
@gregeganSF Should be fine now. For some reason that happens sometimes when I post on Mastodon. (We're making changes to fix that right now.)
@clarkesworld Thank you! Bookmarking this for when I have spoons!

@clarkesworld Thank you for this.

Maybe worth making clear that CCBot is not like the others, in that it's not solely intended for gathering data for AI training? Data in the Common Crawl archives HAS been used to train ML models, but it's also used for other, arguably more benign purposes.

It's a fine distinction, to be sure, but it might matter to some people.

@angusm Unfortunately, it's all-or-nothing with them. Considering how many models depend on CC data, allowing them to continue would be the same as allowing everyone to continue.

@clarkesworld

AI #companies should respect an opt-in #policy for #authors, not force authors to opt-out. #Copyright must be respected, who does otherwise is simply a #thief or a #pirate.

#Microsoft #ai #ia #chatgpt #bard #Google #apple #Amazon

@elijax respect? from this lot? lol @clarkesworld
@mensrea @clarkesworld cannot really understand your comment, you can explain if you wish!
@elijax just that none of those companies (and others like openai) have any chance of respecting anyone. and they have all specifically shown contempt for creative production @clarkesworld
@mensrea @clarkesworld
I do agree! The exploitation of copyrighted material begun with Google and YouTube making impossible for authors to monetize.
@elijax it began before google. emi, random house, paramount, ... all attempt to do the same thing. the likes of google, amazon, apple, ... have just taken the next step @clarkesworld

@mensrea @elijax @clarkesworld Yup and at this point that includes creative companies like Disney, some companies in general actually steal art from artists directly.

Honestly wish these companies could be punished. :(

@clarkesworld doing my bit… https://github.com/revk/ASCII

Will love to see this is some AI results.

GitHub - revk/ASCII: Adrian's Standard Code for Information Interchange

Adrian's Standard Code for Information Interchange - GitHub - revk/ASCII: Adrian's Standard Code for Information Interchange

GitHub

@clarkesworld

Update the sites robots.txt with this handy dandy boilerplate language that, obvs.,

..... The 🚫AI 🤖's 'respect' ☜ (↼_↼)

⚠️👇
https://infosec.exchange/@infosec_jcp/110941117442757422

@infosec_jcp 🆓🐦🐈🃏 done differently (@[email protected])

Attached: 3 images · Content warning: BoilerPlate from https://govtrack.us/legal hits really really really hard #ToS wise

Infosec Exchange
@clarkesworld this is like a great scfi novel. Thx.
@clarkesworld Thanks for this, very useful. Coincidentally enough I was just reading a post on legal responses to scraping -- potentially complmentary. https://blog.ericgoldman.org/archives/2023/08/web-scraping-for-me-but-not-for-thee-guest-blog-post.htm
Web Scraping for Me, But Not for Thee (Guest Blog Post) - Technology & Marketing Law Blog

by guest blogger Kieran McCarthy There are few, if any, legal domains where hypocrisy is as baked into the ecosystem as it is with web scraping. Some of the biggest companies on earth—including Meta and Microsoft—take aggressive, litigious approaches to...

Technology & Marketing Law Blog

@clarkesworld FYI robots.txt allow opt in behavior too. How come ppl don't know this?

Just disallow user-agent: * and allow GoogleBot etc. That's opt in and is literally used by basically every big website for over a decade now. See https://Twitter.com/robots.txt

@wraptile not really. True opt-in requires no action on behalf of the site owner.
@colo_lee Are you getting involved with this? Looks cool.
@clarkesworld thanks for posting this. I'm concerned about AI-plagiarism of images of my artworks and just updated my robots.txt accordingly.