Mastodawn

David Stuart Platt, PhD, MLIS Aug 25, 2023

I've been helping some friends and colleagues block some of the site scraping bots that are feeding "AI" models. Decided to take some of my notes and make something others could use too. It's a work-in-progress. Happy to add to or correct things.
https://neil-clarke.com/block-the-bots-that-feed-ai-models-by-scraping-your-website/

Greg Egan Aug 24, 2023

@clarkesworld This link gives me: “Error establishing a database connection”.

[Edit: OK, it works for me too now.]

⊥ᵒᵚ Cᵸᵎᶺᵋᶫ∸ᵒᵘ ☑️Aug 24, 2023

@gregeganSF @clarkesworld wfm

@[email protected]Aug 24, 2023

@gregeganSF @clarkesworld Works For Me, fwiw. Maybe just a reload will fix it?

Clarkesworld Aug 24, 2023

@gregeganSF Should be fine now. For some reason that happens sometimes when I post on Mastodon. (We're making changes to fix that right now.)

Alastair Temple Aug 24, 2023

@clarkesworld Thanks for sharing Neil!

Ruby Jones Aug 24, 2023

@clarkesworld Thank you! Bookmarking this for when I have spoons!

Angus McIntyre Aug 24, 2023

@clarkesworld Thank you for this.

Maybe worth making clear that CCBot is not like the others, in that it's not solely intended for gathering data for AI training? Data in the Common Crawl archives HAS been used to train ML models, but it's also used for other, arguably more benign purposes.

It's a fine distinction, to be sure, but it might matter to some people.

Clarkesworld Aug 24, 2023

@angusm Unfortunately, it's all-or-nothing with them. Considering how many models depend on CC data, allowing them to continue would be the same as allowing everyone to continue.

Elijax (Elia Andrea Corazza)Aug 24, 2023

AI #companies should respect an opt-in #policy for #authors, not force authors to opt-out. #Copyright must be respected, who does otherwise is simply a #thief or a #pirate.

#Microsoft #ai #ia #chatgpt #bard #Google #apple #Amazon

cuan_knaggs Aug 24, 2023

@elijax respect? from this lot? lol @clarkesworld

Elijax (Elia Andrea Corazza)Aug 24, 2023

@mensrea @clarkesworld cannot really understand your comment, you can explain if you wish!

cuan_knaggs Aug 24, 2023

@elijax just that none of those companies (and others like openai) have any chance of respecting anyone. and they have all specifically shown contempt for creative production @clarkesworld

Elijax (Elia Andrea Corazza)Aug 24, 2023

@mensrea @clarkesworld
I do agree! The exploitation of copyrighted material begun with Google and YouTube making impossible for authors to monetize.

cuan_knaggs Aug 24, 2023

@elijax it began before google. emi, random house, paramount, ... all attempt to do the same thing. the likes of google, amazon, apple, ... have just taken the next step @clarkesworld

Starlingmap Aug 25, 2023

@mensrea @elijax @clarkesworld Yup and at this point that includes creative companies like Disney, some companies in general actually steal art from artists directly.

Honestly wish these companies could be punished. :(

@clarkesworld doing my bit… https://github.com/revk/ASCII

Will love to see this is some AI results.

GitHub - revk/ASCII: Adrian's Standard Code for Information Interchange

Adrian's Standard Code for Information Interchange - GitHub - revk/ASCII: Adrian's Standard Code for Information Interchange

GitHub

@infosec_jcp 🐈🃏 done differently Aug 24, 2023

Update the sites robots.txt with this handy dandy boilerplate language that, obvs.,

..... The 🚫AI 🤖's 'respect' ☜ (↼_↼)

⚠️👇
https://infosec.exchange/@infosec_jcp/110941117442757422

@infosec_jcp 🆓🐦🐈🃏 done differently (@[email protected])

Attached: 3 images · Content warning: BoilerPlate from https://govtrack.us/legal hits really really really hard #ToS wise

Infosec Exchange

Made in DNA Aug 24, 2023

@clarkesworld this is like a great scfi novel. Thx.

Jon Aug 24, 2023

@clarkesworld Thanks for this, very useful. Coincidentally enough I was just reading a post on legal responses to scraping -- potentially complmentary. https://blog.ericgoldman.org/archives/2023/08/web-scraping-for-me-but-not-for-thee-guest-blog-post.htm

Web Scraping for Me, But Not for Thee (Guest Blog Post) - Technology & Marketing Law Blog

by guest blogger Kieran McCarthy There are few, if any, legal domains where hypocrisy is as baked into the ecosystem as it is with web scraping. Some of the biggest companies on earth—including Meta and Microsoft—take aggressive, litigious approaches to...

Technology & Marketing Law Blog

Medea Vanamonde🏳️‍⚧️ ♀Aug 24, 2023

wraptile Aug 25, 2023

@clarkesworld FYI robots.txt allow opt in behavior too. How come ppl don't know this?

Just disallow user-agent: * and allow GoogleBot etc. That's opt in and is literally used by basically every big website for over a decade now. See https://Twitter.com/robots.txt

Clarkesworld Aug 25, 2023

@wraptile not really. True opt-in requires no action on behalf of the site owner.

Folk London magazine Aug 25, 2023

@clarkesworld Thanks! This is very helpful

Djembro, RO, supports 🇺🇦🇬🇪Aug 25, 2023

@colo_lee Are you getting involved with this? Looks cool.

*_jߍyrope Aug 25, 2023

gpt bot ip ranges to block, #^https://openai.com/gptbot-ranges.txt by way of @｛Hans｝

Martin Jambon 🌍🌎🌏Aug 26, 2023

@clarkesworld thanks for posting this. I'm concerned about AI-plagiarism of images of my artworks and just updated my robots.txt accordingly.