Mastodawn

Clarkesworld Aug 24, 2023

I've been helping some friends and colleagues block some of the site scraping bots that are feeding "AI" models. Decided to take some of my notes and make something others could use too. It's a work-in-progress. Happy to add to or correct things.
https://neil-clarke.com/block-the-bots-that-feed-ai-models-by-scraping-your-website/

Show thread

Angus McIntyre

@clarkesworld Thank you for this.

Maybe worth making clear that CCBot is not like the others, in that it's not solely intended for gathering data for AI training? Data in the Common Crawl archives HAS been used to train ML models, but it's also used for other, arguably more benign purposes.

It's a fine distinction, to be sure, but it might matter to some people.

Show thread

Clarkesworld Aug 24, 2023

@angusm Unfortunately, it's all-or-nothing with them. Considering how many models depend on CC data, allowing them to continue would be the same as allowing everyone to continue.