https://neil-clarke.com/block-the-bots-that-feed-ai-models-by-scraping-your-website/
@clarkesworld This link gives me: “Error establishing a database connection”.
[Edit: OK, it works for me too now.]
@clarkesworld Thank you for this.
Maybe worth making clear that CCBot is not like the others, in that it's not solely intended for gathering data for AI training? Data in the Common Crawl archives HAS been used to train ML models, but it's also used for other, arguably more benign purposes.
It's a fine distinction, to be sure, but it might matter to some people.
AI #companies should respect an opt-in #policy for #authors, not force authors to opt-out. #Copyright must be respected, who does otherwise is simply a #thief or a #pirate.
@mensrea @elijax @clarkesworld Yup and at this point that includes creative companies like Disney, some companies in general actually steal art from artists directly.
Honestly wish these companies could be punished. :(
@clarkesworld doing my bit… https://github.com/revk/ASCII
Will love to see this is some AI results.
Update the sites robots.txt with this handy dandy boilerplate language that, obvs.,
..... The 🚫AI 🤖's 'respect' ☜ (↼_↼)
Attached: 3 images · Content warning: BoilerPlate from https://govtrack.us/legal hits really really really hard #ToS wise
by guest blogger Kieran McCarthy There are few, if any, legal domains where hypocrisy is as baked into the ecosystem as it is with web scraping. Some of the biggest companies on earth—including Meta and Microsoft—take aggressive, litigious approaches to...
@clarkesworld FYI robots.txt allow opt in behavior too. How come ppl don't know this?
Just disallow user-agent: * and allow GoogleBot etc. That's opt in and is literally used by basically every big website for over a decade now. See https://Twitter.com/robots.txt