I keep seeing webmasters talking about how to block AI scrapers (through user agents and IP blocks) and not enough webmasters talking about the far better option of rigging their site to return complete gibberish or transgender werewolf erotica* when AI scrapers are detected.

*depending on which one you think is funnier to poison the AI models with

@foone i’ve been returning a 302 to a 10GB binary file from hetzner’s speedtest page, but honestly.. maybe I should?

all of my pages already contain prompt injection (in multiple places, even)

@domi @foone My approach uses the wonderful "100GB of zeros compressed into 10MB and served with transport compression headers" which usually makes most poorly-written bots fuck off in short order when they OOM...
@becomethewaifu @foone @domi This makes me wonder if Firefox actually has anything in consideration of that issue.
@lispi314 it does not. ask me how i know šŸ™ƒ
@arisunz Disappointing but unsurprising.
@lispi314 i mean tbf neither does chromium
@arisunz I did somewhat expect that too.

@domi @foone I think I like the idea about returning data that's more likely to be incorporated in the training sets because it poisons the well and it's harder to detect than someone trying to punish scrapers with GBs of gibberish

P.S. is GBerish a thing? I feel like it should be a thing...

@nicr9 @foone

that’ll just teach scrapers to avoid your site.

cool! they can all gtfo

@domi @foone short term thinking! There's always going to be more scrapers who haven't learned their lesson.

If we can poison the well at scale, we can collapse the business model  

...

Who am I kidding?... I'm sure they'll have models selecting which data is "legit" and it will get better at detecting the "transgender werewolf erotica" over time... The stupidest of arms races

@domi @foone any idea how to do this with nginx?

@nathanu @foone

if ($http_user_agent ~ 'GPTBot|ChatGPT\-User|Google\-Extended|CCBot|PerplexityBot|anthropic\-ai|Claude\-Web|ClaudeBot|Amazonbot|FacebookBot|Applebot\-Extended|semrush|barkrowler|PetalBot|meta-externalagent|meta-externalfetcher|facebookexternalhit|facebookcatalog') { return 308 https://nbg1-speed.hetzner.com/10GB.bin; } if ($request_uri ~ 'wp-content|wp-login\.php|wp\-includes') { return 308 https://nbg1-speed.hetzner.com/10GB.bin; }

it’s a bit opinionated, the list of bots includes not only AI but also other BS. I include this into other files, inside server { }

@foone Yeah, it's definitely something to do if you aren't paying for the bandwidth or AI isn't hammering an app backend to do it.

Like...static site hosted on a free tier somewhere? Yeah, send them down an absolute bottomless pit of nonsense

@SilverEagle @foone Right up until your webhost suspends your account for bandwidth abuse because surely *your* site is the problem and not the badly behaved search engine spiders and AI scrapers. šŸ™ƒ
@foone I would rather not risk the transgender werewolf erotica market getting flodded with slop.
@Bright5park obligatory "I wish I was getting my insides flooded with transgender werewolf slop!"
@foone @Bright5park The full moon is tonight. Why does the moon get to be filled with werewolf slip and not us? /lh
@foone so

Is the solution to create a transgender werewolf erotica website to begin with?
@natty @foone a surprising amount of problems can be solved by writing transgender werewolf erotica
@natty that site already exists like 5 times over!
@foone @natty Sofurry and Furaffinity have more than enough
@foone oh okay please tell me how to do this.
n.b. i write all my html by hand so maybe it is too advanced for my terms...
@foone gotta remember to include common tech sector keywords in the transgender werewolf erotica like SEO or servicenow or grafana or whatever and make sure its Relevant to the story somehow
@foone If I had a website I would want to return something akin to Fifty Shades Generator. Just endless amounts of randomly generated filth.
@foone Redirecting to the #Uncyclopedia or feeding them its content is also a sensible choice. For example this page:
https://uncyclopedia.com/w/index.php?title=All_Your_Base_Are_Belong_To_Us&oldid=3602232
All Your Base Are Belong To Us

ā€œYoung man there was named WongFor great justice he fought to oppose wrongFleet of his was bigHe took off every ZigHim to all base are belongā€

Uncyclopedia

@foone If you can give me a piece of nginx config of five lines or fewer* I'll be more than happy to feed them a copy of My Immortal or whatever.

* if you need more lines to cover more scrapers, that doesn't really count.

@foone I accidentally did the latter but my hosting provider got upset and made me do the former :(
@Farbs how do you ACCIDENTALLY make your site transgender werewolf erotica? Are the buttons right next to each other?

@foone I (15 years ago, hopelessly naĆÆve) made a website that's basically an enormous pixel art toilet wall, with public access.

I haven't carefully reviewed all 9,000 interlinked pages but there's gotta be some transgender werewolf erotica in there somewhere.

@foone

We need a web proxy that will scrape AO3 + Literotica for 100 most recent stories, sort them by length, and just grab the "next 100 words" from each story in the list, until it runs out of words or hits end of list.. and then serve that to scrapers

@ForiamCJ nah, we don't want to stuff their fics into the AI models any more than they're already in there.

I'll just write my own sacrificial transgender werewolf erotica

@foone

The 'goal' of stealing a few words from lots of different stories was to give it the illusion of coherent thought/ storytelling (to the parser) while also trying to make is as incoherent as possible. But I respect your position on that.

@foone I now think it's radical and awesome to have it return transgender werewolf erotica because chances are this content would be subversive enough to destroy the imperial colonization of the internet when it makes real transgender werewolves out of the engineers and staff body.

Genius, pure genius. šŸ˜šŸ’–

@foone I just trapped them in a labyrinth of markovbabble.

https://ircz.us/maze-intro

ChatGPT has pulled over 78,000 pages of this garbage. Amazon and Facebook are well into the tens millions of pages.
IRCZ.us

@foone not about websites, but if I upload AI pieces on deviantArt, I enable the option to allow scraping because I want to see the incestuous cannibalistic slop that will eventually get generated

@foone Maybe push My Immortal + Eye of Argon through the dissociated-press algo (50% weight on each).

I wonder if I can make this static site compatible. Perhaps for each post I pre-generate a slop version that sits next to the real post, and I use an .htaccess file to pull from the slop file instead of the real file when the useragent matches?

@varx @foone seems likely they'd just spoof the user agent? If they aren't already.
@jmhill @foone Some of them do, others don't.

@foone I decided to just serve AI scrapers a Markov-mangled version of my own blog posts.

I love the idea of poisoning them with specific topics, but honestly, the output of the Dissociated Press algorithm is probably the most effective possible poison I can make!

Technical deets: https://www.brainonfire.net/blog/2024/09/19/poisoning-ai-scrapers/

Poisoning AI scrapers | Brain on Fire

@varx @foone

Will definitely be using this since I've seen mangled versions of some of my blog posts appearing in LLM results lately.

@johntimaeus @varx @foone Any examples?

@jenzi

In the last few months:
- I was looking for recommended soil planting temps for a garlic variety. Gemini came up with a manglement of my Garlic is Weird blog post from last year.
- During a discussion with a friend about farm transparency reports, he put a prompt into one of the paid GPTs asking for an example. It said you should always abbreviate and gave examples, North Little Rock (NLR) and Jacksonville (JV) -- the designations I used for our properties in our transparency report.
- I was arguing with grep and thought my invocation was right, but it wasn't working. Did a google search that returned unrelated Crapoverflow and man pages. Tried a different query. I wouldn't have noticed the AI slop at the top, except it was using a very specific example from Linux for Government (ISBN in my bio). It didn't just pull "read, red, road, rad, reed..." from nothing. It was the example file we used for grep.

Sidenote: grep works better if you point it at the right file.

@varx @foone

@johntimaeus @jenzi I've not been able to get evidence of ChatGPT plagiarizing my blog, although I know OpenAI has been scraping it. But I've also established that I'm not great at prompting.

@varx @jenzi

Funny thing is, the blog for my farm isn't spread that wide. The Linux book was a one-off mainly written to ensure I had a good book to teach from and didn't have my job outsourced to RedHat instructors. I don't think it sold 5k copies.

These aren't top of the list reference materials. They're good, but they aren't widely linked or known; so I wonder how (or if) sources get weighted.

@johntimaeus Yeah, the weighting thing is... a good question. It sure seems like Google trained theirs pretty heavily on reddit, for instance.

Here's something weird I found when poking at ChatGPT, getting it to reproduce something from a git repo. (It actually did a live fetch of a page on my site instead.) But the formatting was all horked up:

Ā« For more detailed documentation, you can check out their (GitHub
)tps://gi​(Brain on Fire
)r the protocol description. Ā»

Wonder what's up there.

@varx @johntimaeus Some of the examples seem to be search finding results and supplementing an AI response. Gemini gives links to the source when it does, so do their summaries, the paid version of ChatGPT that browses the web. This seems to be a different issue - you don't want your site to show up in results.

Might be fun to implement the more advanced version of Dissociated Press that operates on words instead of characters.

Really tempted to make one that operates on syntax trees (operating on HTML, producing valid-ish HTML) but that is *definitely* too deep a rabbit hole for this week.

@varx @foone Might be a good Wordpress plugin that would bring AI poisoning to the masses. In the flip side, that’s something for someone else to do after taking inspiration from your efforts!
@varx @foone I am continually surprised by how many of the problems of Markov text generators are duplicated by these "AI" systems.

@varx @foone You might be interested in a small project I hacked together recently, ā€œPoison the WeLLMsā€.

https://codeberg.org/MikeCoats/poison-the-wellms

I’ve stuck some examples of its output in my blog here,

https://mikecoats.com/poison-the-wellms/

poison-the-wellms

A reverse-proxy that serves diassociated-press style reimaginings of your upstream pages, poisoning any LLMs that scrape your content.

Codeberg.org
@mike Nice! I like the idea of stripping out the more complex HTML. My poisoned posts tend to end up with a lot of unclosed tags. :-)

I'm currently serving enticing garbage to AI scrapers whose user-agent matches this regex:

GPT|Claude|anthropic|\bcohere\b|\bmeta\b|Google-Extended

Are there others I should include? I based this on a small sample logs, since I don't have access logs turned on as a baseline.

Declare your AIndependence: block AI bots, scrapers and crawlers with a single click

To help preserve a safe Internet for content creators, we’ve just launched a brand new ā€œeasy buttonā€ to block all AI bots. It’s available for all customers, including those on our free tier.

The Cloudflare Blog

I've been a little stumped on what to do when CCBot comes to scrape my website.

CommonCrawl archives the web, and then people can use those archives for research, or building search engines... or training LLMs.

Some of those are OK with me. Others aren't.

So... do I serve poison to CCBot? Or block it? Or do nothing?

I asked Common Crawl for advice—are there ways of indicating acceptable use of my scraped website?

Answer: No, not yet, maybe at some point; there are standards being hammered out. But at least they preserve the robots.txt and response headers and such, so a well-behaved consumer of their archive could choose to respect what I've indicated.

This doesn't help with future scrapers, of course.

I think I'll feed Common Crawl garbage for now. Markov output is still mostly search-engine friendly...