Mastodawn

Alice Averlong🏳️‍⚧️Sep 16, 2024

I keep seeing webmasters talking about how to block AI scrapers (through user agents and IP blocks) and not enough webmasters talking about the far better option of rigging their site to return complete gibberish or transgender werewolf erotica* when AI scrapers are detected.

*depending on which one you think is funnier to poison the AI models with

Show thread

varx/social Sep 17, 2024

@foone Maybe push My Immortal + Eye of Argon through the dissociated-press algo (50% weight on each).

I wonder if I can make this static site compatible. Perhaps for each post I pre-generate a slop version that sits next to the real post, and I use an .htaccess file to pull from the slop file instead of the real file when the useragent matches?

Show thread

varx/social Sep 19, 2024

@foone I decided to just serve AI scrapers a Markov-mangled version of my own blog posts.

I love the idea of poisoning them with specific topics, but honestly, the output of the Dissociated Press algorithm is probably the most effective possible poison I can make!

Technical deets: https://www.brainonfire.net/blog/2024/09/19/poisoning-ai-scrapers/

Poisoning AI scrapers | Brain on Fire

Show thread

varx/social

I'm currently serving enticing garbage to AI scrapers whose user-agent matches this regex:

Are there others I should include? I based this on a small sample logs, since I don't have access logs turned on as a baseline.

Show thread

varx/social Sep 24, 2024

Oooh, https://blog.cloudflare.com/declaring-your-aindependence-block-ai-bots-scrapers-and-crawlers-with-a-single-click/ has a good listing of useragents (or spider names) to look out for.

Declare your AIndependence: block AI bots, scrapers and crawlers with a single click

To help preserve a safe Internet for content creators, we’ve just launched a brand new “easy button” to block all AI bots. It’s available for all customers, including those on our free tier.

The Cloudflare Blog

Show thread

varx/social Oct 11, 2024

I've been a little stumped on what to do when CCBot comes to scrape my website.

CommonCrawl archives the web, and then people can use those archives for research, or building search engines... or training LLMs.

Some of those are OK with me. Others aren't.

So... do I serve poison to CCBot? Or block it? Or do nothing?

Show thread

varx/social Oct 11, 2024

I asked Common Crawl for advice—are there ways of indicating acceptable use of my scraped website?

Answer: No, not yet, maybe at some point; there are standards being hammered out. But at least they preserve the robots.txt and response headers and such, so a well-behaved consumer of their archive could choose to respect what I've indicated.

This doesn't help with future scrapers, of course.

I think I'll feed Common Crawl garbage for now. Markov output is still mostly search-engine friendly...