I keep seeing webmasters talking about how to block AI scrapers (through user agents and IP blocks) and not enough webmasters talking about the far better option of rigging their site to return complete gibberish or transgender werewolf erotica* when AI scrapers are detected.

*depending on which one you think is funnier to poison the AI models with

@foone Maybe push My Immortal + Eye of Argon through the dissociated-press algo (50% weight on each).

I wonder if I can make this static site compatible. Perhaps for each post I pre-generate a slop version that sits next to the real post, and I use an .htaccess file to pull from the slop file instead of the real file when the useragent matches?

@foone I decided to just serve AI scrapers a Markov-mangled version of my own blog posts.

I love the idea of poisoning them with specific topics, but honestly, the output of the Dissociated Press algorithm is probably the most effective possible poison I can make!

Technical deets: https://www.brainonfire.net/blog/2024/09/19/poisoning-ai-scrapers/

Poisoning AI scrapers | Brain on Fire

I'm currently serving enticing garbage to AI scrapers whose user-agent matches this regex:

GPT|Claude|anthropic|\bcohere\b|\bmeta\b|Google-Extended

Are there others I should include? I based this on a small sample logs, since I don't have access logs turned on as a baseline.

Declare your AIndependence: block AI bots, scrapers and crawlers with a single click

To help preserve a safe Internet for content creators, we’ve just launched a brand new “easy button” to block all AI bots. It’s available for all customers, including those on our free tier.

The Cloudflare Blog

I've been a little stumped on what to do when CCBot comes to scrape my website.

CommonCrawl archives the web, and then people can use those archives for research, or building search engines... or training LLMs.

Some of those are OK with me. Others aren't.

So... do I serve poison to CCBot? Or block it? Or do nothing?

I asked Common Crawl for advice—are there ways of indicating acceptable use of my scraped website?

Answer: No, not yet, maybe at some point; there are standards being hammered out. But at least they preserve the robots.txt and response headers and such, so a well-behaved consumer of their archive could choose to respect what I've indicated.

This doesn't help with future scrapers, of course.

I think I'll feed Common Crawl garbage for now. Markov output is still mostly search-engine friendly...