Mastodawn

Alice Averlong🏳️‍⚧️

I keep seeing webmasters talking about how to block AI scrapers (through user agents and IP blocks) and not enough webmasters talking about the far better option of rigging their site to return complete gibberish or transgender werewolf erotica* when AI scrapers are detected.

*depending on which one you think is funnier to poison the AI models with

dmi 💽 Sep 16, 2024

@foone i’ve been returning a 302 to a 10GB binary file from hetzner’s speedtest page, but honestly.. maybe I should?

all of my pages already contain prompt injection (in multiple places, even)

Emelia/Emi Sep 16, 2024

@domi @foone My approach uses the wonderful "100GB of zeros compressed into 10MB and served with transport compression headers" which usually makes most poorly-written bots fuck off in short order when they OOM...

LisPi Sep 17, 2024

@becomethewaifu @foone @domi This makes me wonder if Firefox actually has anything in consideration of that issue.

sunz*Sep 26, 2024

@lispi314 it does not. ask me how i know 🙃

LisPi Sep 28, 2024

@arisunz Disappointing but unsurprising.

sunz*Sep 28, 2024

@lispi314 i mean tbf neither does chromium

LisPi Sep 28, 2024

@arisunz I did somewhat expect that too.

@domi @foone I think I like the idea about returning data that's more likely to be incorporated in the training sets because it poisons the well and it's harder to detect than someone trying to punish scrapers with GBs of gibberish

P.S. is GBerish a thing? I feel like it should be a thing...

dmi 💽 Sep 27, 2024

that’ll just teach scrapers to avoid your site.

cool! they can all gtfo

@domi @foone short term thinking! There's always going to be more scrapers who haven't learned their lesson.

If we can poison the well at scale, we can collapse the business model

...

Who am I kidding?... I'm sure they'll have models selecting which data is "legit" and it will get better at detecting the "transgender werewolf erotica" over time... The stupidest of arms races

Nathan Sep 30, 2024

@domi @foone any idea how to do this with nginx?

dmi 💽 Sep 30, 2024

@nathanu @foone

if ($http_user_agent ~ 'GPTBot|ChatGPT\-User|Google\-Extended|CCBot|PerplexityBot|anthropic\-ai|Claude\-Web|ClaudeBot|Amazonbot|FacebookBot|Applebot\-Extended|semrush|barkrowler|PetalBot|meta-externalagent|meta-externalfetcher|facebookexternalhit|facebookcatalog') {
	return 308 https://nbg1-speed.hetzner.com/10GB.bin;
}

if ($request_uri ~ 'wp-content|wp-login\.php|wp\-includes') {
	return 308 https://nbg1-speed.hetzner.com/10GB.bin;
}

it’s a bit opinionated, the list of bots includes not only AI but also other BS. I include this into other files, inside server { }

e(Ag)le 🦅Sep 16, 2024

@foone Yeah, it's definitely something to do if you aren't paying for the bandwidth or AI isn't hammering an app backend to do it.

Like...static site hosted on a free tier somewhere? Yeah, send them down an absolute bottomless pit of nonsense

King Naga Calyo Lucere-Delphi Sep 16, 2024

@SilverEagle @foone Right up until your webhost suspends your account for bandwidth abuse because surely *your* site is the problem and not the badly behaved search engine spiders and AI scrapers. 🙃

Sparky Sep 16, 2024

@foone I would rather not risk the transgender werewolf erotica market getting flodded with slop.

Alice Averlong🏳️‍⚧️Sep 16, 2024

@Bright5park only the AI transgender werewolf erotica market!

Alice Averlong🏳️‍⚧️Sep 16, 2024

@Bright5park obligatory "I wish I was getting my insides flooded with transgender werewolf slop!"

@foone @Bright5park The full moon is tonight. Why does the moon get to be filled with werewolf slip and not us? /lh

princess pancake

@foone so

Is the solution to create a transgender werewolf erotica website to begin with?

elke Sep 16, 2024

@natty @foone a surprising amount of problems can be solved by writing transgender werewolf erotica

Alice Averlong🏳️‍⚧️Sep 16, 2024

@natty that site already exists like 5 times over!

Devourer Sep 17, 2024

@foone @natty Sofurry and Furaffinity have more than enough

paramilitary organizer Sep 16, 2024

@foone oh okay please tell me how to do this.
n.b. i write all my html by hand so maybe it is too advanced for my terms...

elke Sep 16, 2024

@foone gotta remember to include common tech sector keywords in the transgender werewolf erotica like SEO or servicenow or grafana or whatever and make sure its Relevant to the story somehow

Gulleko Sep 16, 2024

@foone If I had a website I would want to return something akin to Fifty Shades Generator. Just endless amounts of randomly generated filth.

Albert Cardona Sep 16, 2024

@foone Redirecting to the #Uncyclopedia or feeding them its content is also a sensible choice. For example this page:
https://uncyclopedia.com/w/index.php?title=All_Your_Base_Are_Belong_To_Us&oldid=3602232

All Your Base Are Belong To Us

“Young man there was named WongFor great justice he fought to oppose wrongFleet of his was bigHe took off every ZigHim to all base are belong”

Uncyclopedia

Kawaoneechan Sep 16, 2024

@foone If you can give me a piece of nginx config of five lines or fewer* I'll be more than happy to feed them a copy of My Immortal or whatever.

* if you need more lines to cover more scrapers, that doesn't really count.

Farbs Sep 16, 2024

@foone I accidentally did the latter but my hosting provider got upset and made me do the former :(

Alice Averlong🏳️‍⚧️Sep 16, 2024

@Farbs how do you ACCIDENTALLY make your site transgender werewolf erotica? Are the buttons right next to each other?

Farbs Sep 16, 2024

@foone I (15 years ago, hopelessly naïve) made a website that's basically an enormous pixel art toilet wall, with public access.

I haven't carefully reviewed all 9,000 interlinked pages but there's gotta be some transgender werewolf erotica in there somewhere.

Alice Averlong🏳️‍⚧️Sep 16, 2024

@Farbs ahh, I see.

black_sparkling_heart

We need a web proxy that will scrape AO3 + Literotica for 100 most recent stories, sort them by length, and just grab the "next 100 words" from each story in the list, until it runs out of words or hits end of list.. and then serve that to scrapers

Alice Averlong🏳️‍⚧️Sep 16, 2024

@ForiamCJ nah, we don't want to stuff their fics into the AI models any more than they're already in there.

I'll just write my own sacrificial transgender werewolf erotica

black_sparkling_heart

The 'goal' of stealing a few words from lots of different stories was to give it the illusion of coherent thought/ storytelling (to the parser) while also trying to make is as incoherent as possible. But I respect your position on that.

Filene Taylor Sep 16, 2024

@foone I now think it's radical and awesome to have it return transgender werewolf erotica because chances are this content would be subversive enough to destroy the imperial colonization of the internet when it makes real transgender werewolves out of the engineers and staff body.

Genius, pure genius. 😁💖

Alice Averlong🏳️‍⚧️Sep 16, 2024

@Filene that's the ideal ending, yeah!

Aaron Sep 16, 2024

@foone I just trapped them in a labyrinth of markovbabble.

https://ircz.us/maze-intro

ChatGPT has pulled over 78,000 pages of this garbage. Amazon and Facebook are well into the tens millions of pages.

IRCZ.us

lapisliozuli Sep 16, 2024

@foone not about websites, but if I upload AI pieces on deviantArt, I enable the option to allow scraping because I want to see the incestuous cannibalistic slop that will eventually get generated

varx/social Sep 17, 2024

@foone Maybe push My Immortal + Eye of Argon through the dissociated-press algo (50% weight on each).

I wonder if I can make this static site compatible. Perhaps for each post I pre-generate a slop version that sits next to the real post, and I use an .htaccess file to pull from the slop file instead of the real file when the useragent matches?

J.M. Hill Sep 17, 2024

@varx @foone seems likely they'd just spoof the user agent? If they aren't already.

varx/social Sep 17, 2024

@jmhill @foone Some of them do, others don't.

varx/social Sep 19, 2024

@foone I decided to just serve AI scrapers a Markov-mangled version of my own blog posts.

I love the idea of poisoning them with specific topics, but honestly, the output of the Dissociated Press algorithm is probably the most effective possible poison I can make!

Technical deets: https://www.brainonfire.net/blog/2024/09/19/poisoning-ai-scrapers/

Poisoning AI scrapers | Brain on Fire

John Timaeus Sep 19, 2024

Will definitely be using this since I've seen mangled versions of some of my blog posts appearing in LLM results lately.

Old Man in the Shoe Sep 30, 2024

@johntimaeus @varx @foone Any examples?

John Timaeus Sep 30, 2024

In the last few months:
- I was looking for recommended soil planting temps for a garlic variety. Gemini came up with a manglement of my Garlic is Weird blog post from last year.
- During a discussion with a friend about farm transparency reports, he put a prompt into one of the paid GPTs asking for an example. It said you should always abbreviate and gave examples, North Little Rock (NLR) and Jacksonville (JV) -- the designations I used for our properties in our transparency report.
- I was arguing with grep and thought my invocation was right, but it wasn't working. Did a google search that returned unrelated Crapoverflow and man pages. Tried a different query. I wouldn't have noticed the AI slop at the top, except it was using a very specific example from Linux for Government (ISBN in my bio). It didn't just pull "read, red, road, rad, reed..." from nothing. It was the example file we used for grep.

Sidenote: grep works better if you point it at the right file.

varx/social Oct 1, 2024

@johntimaeus @jenzi I've not been able to get evidence of ChatGPT plagiarizing my blog, although I know OpenAI has been scraping it. But I've also established that I'm not great at prompting.

John Timaeus Oct 1, 2024

Funny thing is, the blog for my farm isn't spread that wide. The Linux book was a one-off mainly written to ensure I had a good book to teach from and didn't have my job outsourced to RedHat instructors. I don't think it sold 5k copies.

These aren't top of the list reference materials. They're good, but they aren't widely linked or known; so I wonder how (or if) sources get weighted.

varx/social Oct 1, 2024

@johntimaeus Yeah, the weighting thing is... a good question. It sure seems like Google trained theirs pretty heavily on reddit, for instance.

Here's something weird I found when poking at ChatGPT, getting it to reproduce something from a git repo. (It actually did a live fetch of a page on my site instead.) But the formatting was all horked up:

« For more detailed documentation, you can check out their (GitHub
)tps://gi(Brain on Fire
)r the protocol description. »

Wonder what's up there.

Old Man in the Shoe Oct 1, 2024

@varx @johntimaeus Some of the examples seem to be search finding results and supplementing an AI response. Gemini gives links to the source when it does, so do their summaries, the paid version of ChatGPT that browses the web. This seems to be a different issue - you don't want your site to show up in results.

varx/social Sep 20, 2024

Might be fun to implement the more advanced version of Dissociated Press that operates on words instead of characters.

Really tempted to make one that operates on syntax trees (operating on HTML, producing valid-ish HTML) but that is *definitely* too deep a rabbit hole for this week.

Bill Plein🌶Sep 20, 2024

@varx @foone Might be a good Wordpress plugin that would bring AI poisoning to the masses. In the flip side, that’s something for someone else to do after taking inspiration from your efforts!

John Harris Sep 21, 2024

@varx @foone I am continually surprised by how many of the problems of Markov text generators are duplicated by these "AI" systems.

Mike Coats 🏴󠁧󠁢󠁳󠁣󠁴󠁿🇪🇺🌍♻️Sep 22, 2024

@varx @foone You might be interested in a small project I hacked together recently, “Poison the WeLLMs”.

https://codeberg.org/MikeCoats/poison-the-wellms

I’ve stuck some examples of its output in my blog here,

https://mikecoats.com/poison-the-wellms/

poison-the-wellms

A reverse-proxy that serves diassociated-press style reimaginings of your upstream pages, poisoning any LLMs that scrape your content.

Codeberg.org

varx/social Sep 22, 2024

@mike Nice! I like the idea of stripping out the more complex HTML. My poisoned posts tend to end up with a lot of unclosed tags. :-)

varx/social Sep 24, 2024

I'm currently serving enticing garbage to AI scrapers whose user-agent matches this regex:

GPT|Claude|anthropic|\bcohere\b|\bmeta\b|Google-Extended

Are there others I should include? I based this on a small sample logs, since I don't have access logs turned on as a baseline.

varx/social Sep 24, 2024

Oooh, https://blog.cloudflare.com/declaring-your-aindependence-block-ai-bots-scrapers-and-crawlers-with-a-single-click/ has a good listing of useragents (or spider names) to look out for.

Declare your AIndependence: block AI bots, scrapers and crawlers with a single click

To help preserve a safe Internet for content creators, we’ve just launched a brand new “easy button” to block all AI bots. It’s available for all customers, including those on our free tier.

The Cloudflare Blog

varx/social Oct 11, 2024

I've been a little stumped on what to do when CCBot comes to scrape my website.

CommonCrawl archives the web, and then people can use those archives for research, or building search engines... or training LLMs.

Some of those are OK with me. Others aren't.

So... do I serve poison to CCBot? Or block it? Or do nothing?

varx/social Oct 11, 2024

I asked Common Crawl for advice—are there ways of indicating acceptable use of my scraped website?

Answer: No, not yet, maybe at some point; there are standards being hammered out. But at least they preserve the robots.txt and response headers and such, so a well-behaved consumer of their archive could choose to respect what I've indicated.

This doesn't help with future scrapers, of course.

I think I'll feed Common Crawl garbage for now. Markov output is still mostly search-engine friendly...