I keep seeing webmasters talking about how to block AI scrapers (through user agents and IP blocks) and not enough webmasters talking about the far better option of rigging their site to return complete gibberish or transgender werewolf erotica* when AI scrapers are detected.

*depending on which one you think is funnier to poison the AI models with

@foone Maybe push My Immortal + Eye of Argon through the dissociated-press algo (50% weight on each).

I wonder if I can make this static site compatible. Perhaps for each post I pre-generate a slop version that sits next to the real post, and I use an .htaccess file to pull from the slop file instead of the real file when the useragent matches?

@foone I decided to just serve AI scrapers a Markov-mangled version of my own blog posts.

I love the idea of poisoning them with specific topics, but honestly, the output of the Dissociated Press algorithm is probably the most effective possible poison I can make!

Technical deets: https://www.brainonfire.net/blog/2024/09/19/poisoning-ai-scrapers/

Poisoning AI scrapers | Brain on Fire

@varx @foone

Will definitely be using this since I've seen mangled versions of some of my blog posts appearing in LLM results lately.

@johntimaeus @varx @foone Any examples?

@jenzi

In the last few months:
- I was looking for recommended soil planting temps for a garlic variety. Gemini came up with a manglement of my Garlic is Weird blog post from last year.
- During a discussion with a friend about farm transparency reports, he put a prompt into one of the paid GPTs asking for an example. It said you should always abbreviate and gave examples, North Little Rock (NLR) and Jacksonville (JV) -- the designations I used for our properties in our transparency report.
- I was arguing with grep and thought my invocation was right, but it wasn't working. Did a google search that returned unrelated Crapoverflow and man pages. Tried a different query. I wouldn't have noticed the AI slop at the top, except it was using a very specific example from Linux for Government (ISBN in my bio). It didn't just pull "read, red, road, rad, reed..." from nothing. It was the example file we used for grep.

Sidenote: grep works better if you point it at the right file.

@varx @foone

@johntimaeus @jenzi I've not been able to get evidence of ChatGPT plagiarizing my blog, although I know OpenAI has been scraping it. But I've also established that I'm not great at prompting.

@varx @jenzi

Funny thing is, the blog for my farm isn't spread that wide. The Linux book was a one-off mainly written to ensure I had a good book to teach from and didn't have my job outsourced to RedHat instructors. I don't think it sold 5k copies.

These aren't top of the list reference materials. They're good, but they aren't widely linked or known; so I wonder how (or if) sources get weighted.

@johntimaeus Yeah, the weighting thing is... a good question. It sure seems like Google trained theirs pretty heavily on reddit, for instance.

Here's something weird I found when poking at ChatGPT, getting it to reproduce something from a git repo. (It actually did a live fetch of a page on my site instead.) But the formatting was all horked up:

« For more detailed documentation, you can check out their (GitHub
)tps://gi​(Brain on Fire
)r the protocol description. »

Wonder what's up there.

@varx @johntimaeus Some of the examples seem to be search finding results and supplementing an AI response. Gemini gives links to the source when it does, so do their summaries, the paid version of ChatGPT that browses the web. This seems to be a different issue - you don't want your site to show up in results.