I've been helping some friends and colleagues block some of the site scraping bots that are feeding "AI" models. Decided to take some of my notes and make something others could use too. It's a work-in-progress. Happy to add to or correct things.
https://neil-clarke.com/block-the-bots-that-feed-ai-models-by-scraping-your-website/

@clarkesworld Thank you for this.

Maybe worth making clear that CCBot is not like the others, in that it's not solely intended for gathering data for AI training? Data in the Common Crawl archives HAS been used to train ML models, but it's also used for other, arguably more benign purposes.

It's a fine distinction, to be sure, but it might matter to some people.

@angusm Unfortunately, it's all-or-nothing with them. Considering how many models depend on CC data, allowing them to continue would be the same as allowing everyone to continue.