Mastodawn

Alice Averlong🏳️‍⚧️Sep 16, 2024

I keep seeing webmasters talking about how to block AI scrapers (through user agents and IP blocks) and not enough webmasters talking about the far better option of rigging their site to return complete gibberish or transgender werewolf erotica* when AI scrapers are detected.

*depending on which one you think is funnier to poison the AI models with

Show thread

varx/social Sep 17, 2024

@foone Maybe push My Immortal + Eye of Argon through the dissociated-press algo (50% weight on each).

I wonder if I can make this static site compatible. Perhaps for each post I pre-generate a slop version that sits next to the real post, and I use an .htaccess file to pull from the slop file instead of the real file when the useragent matches?

Show thread

varx/social Sep 19, 2024

@foone I decided to just serve AI scrapers a Markov-mangled version of my own blog posts.

I love the idea of poisoning them with specific topics, but honestly, the output of the Dissociated Press algorithm is probably the most effective possible poison I can make!

Technical deets: https://www.brainonfire.net/blog/2024/09/19/poisoning-ai-scrapers/

Poisoning AI scrapers | Brain on Fire

Show thread

John Timaeus

@varx @foone

Will definitely be using this since I've seen mangled versions of some of my blog posts appearing in LLM results lately.

Show thread

Old Man in the Shoe Sep 30, 2024

@johntimaeus @varx @foone Any examples?

Show thread

John Timaeus Sep 30, 2024

@jenzi

In the last few months:
- I was looking for recommended soil planting temps for a garlic variety. Gemini came up with a manglement of my Garlic is Weird blog post from last year.
- During a discussion with a friend about farm transparency reports, he put a prompt into one of the paid GPTs asking for an example. It said you should always abbreviate and gave examples, North Little Rock (NLR) and Jacksonville (JV) -- the designations I used for our properties in our transparency report.
- I was arguing with grep and thought my invocation was right, but it wasn't working. Did a google search that returned unrelated Crapoverflow and man pages. Tried a different query. I wouldn't have noticed the AI slop at the top, except it was using a very specific example from Linux for Government (ISBN in my bio). It didn't just pull "read, red, road, rad, reed..." from nothing. It was the example file we used for grep.

Sidenote: grep works better if you point it at the right file.

@varx @foone

Show thread

varx/social Oct 1, 2024

@johntimaeus @jenzi I've not been able to get evidence of ChatGPT plagiarizing my blog, although I know OpenAI has been scraping it. But I've also established that I'm not great at prompting.

Show thread

John Timaeus Oct 1, 2024

@varx @jenzi

Funny thing is, the blog for my farm isn't spread that wide. The Linux book was a one-off mainly written to ensure I had a good book to teach from and didn't have my job outsourced to RedHat instructors. I don't think it sold 5k copies.

These aren't top of the list reference materials. They're good, but they aren't widely linked or known; so I wonder how (or if) sources get weighted.

Show thread

varx/social Oct 1, 2024

@johntimaeus Yeah, the weighting thing is... a good question. It sure seems like Google trained theirs pretty heavily on reddit, for instance.

Here's something weird I found when poking at ChatGPT, getting it to reproduce something from a git repo. (It actually did a live fetch of a page on my site instead.) But the formatting was all horked up:

« For more detailed documentation, you can check out their (GitHub
)tps://gi(Brain on Fire
)r the protocol description. »

Wonder what's up there.

Show thread

Old Man in the Shoe Oct 1, 2024

@varx @johntimaeus Some of the examples seem to be search finding results and supplementing an AI response. Gemini gives links to the source when it does, so do their summaries, the paid version of ChatGPT that browses the web. This seems to be a different issue - you don't want your site to show up in results.