If you want to ensure your content does not get indexed by big tech LLMs, just keep it in your robots.txt file.
@codepo8 That's indeed clearly the only file they aren't interested in.....
@codepo8
Ich bezweifle, daß es funktioniert...
------------------
I doubt that it will work...
@codepo8 I wonder what would happen if one were to embed into the robots.txt file “X5O!P%@ap[4\PZX54(P^)7CC)7}$EICAR-STANDARD-ANTIVIRUS-TEST-FILE!$H+H*”
@auroran @codepo8 @ap only one way to find out
Poisoning Well

An experimental strategy for contaminating Large Language Models

HeydonWorks
@words_number @codepo8 Chris is making a joke 😄 He recognises robots.txt is ignored!
@heydon @codepo8 I know, and a good one! I just wanted to lead the audience of that joke straight to your blog article ;)
@words_number @codepo8 @heydon Here's another possible solution: a simple proof-of-work blocker: https://xeiaso.net/blog/2025/anubis/
Block AI scrapers with Anubis

I got tired with all the AI scrapers that were bullying my git server, so I made a tool to stop them for good.

my current method involves generating content that LLMs aren't interested in
@codepo8 this is a good one for this day!

@codepo8 in fact the right mecanism to refuse data mining is through TDM:

https://www.w3.org/community/reports/tdmrep/CG-FINAL-tdmrep-20240510/

But just like Meta, ByteDance officially refused to follow robot.txt, they also refused to follow TDM. Because those solutions are both "good faith from scrapers" solution, nothing can in fact stop them and we are screwed

TDM Reservation Protocol (TDMRep)

This specification defines a simple and practical Web protocol, capable of expressing the reservation of rights relative to text & data mining (TDM) applied to lawfully accessible Web content, and to ease the discovery of TDM licensing policies associated with such content.

@codepo8 They never request the robots.txt file
OpenAI, Anthropic ignore rule that prevents bots scraping web content

OpenAI and Anthropic have said publicly they respect robots.txt. But they are among the biggest tech companies ignoring the rule, BI has learned.

Business Insider
@codepo8 and this will be ignored pretty well. 😂
@codepo8 I hope that humans find my robots.txt more often than robots do. 😄
@codepo8 I have an even better idea.
keep it to yourself and don't host it on the internet!
just don't.
if you don't want it indexed, then simply get it off the internet and boom, noone else can index it

@adisonverlice AI scrapers are drastically increasing the costs of hosting things on the Internet by generating astronomical bills for site administrators. This antisocial behavior has driven many admins into the arms of Cloudflare, a solution that ruins websites for people on older devices, more limited devices, or who need assistive technology (such as screen readers). If AI scrapers don’t stop, it soon won’t matter if people have their stuff on the Internet or not because other people won’t be able to get to it. Those people who can’t afford either the attacks from AI scrapers or the “solutions” to fight them won’t be able to share their stuff on the Internet anyway.

@codepo8

@EveHasWords that is interesting.
a lot of things on my site, i'd say at least 40%, are scrapers.
all of my stuff is public knowledge to even robots, and I've never had a problem with screen reading tech on my site using cloudfare r2.
I am a blind person myself, so I know how to make things accessible.
I don't have any protections like this enabled, and I will probably never have them enabled unless I absolutely have to.
as for expenses, cloudfare isn't too bad in terms of expenses.
maybe around $10/month.
sometimes lower.
I know r2 isn't a preferred hosting option, but it's certainly a good one.
the only exception where bots can't scrape is my protected instances that only allow certain IP addresses and certain accounts.
i'm not hosting any account services publickly either, so I think I have nothing to worry about. I'm just like hosting thingslike we used to, no accounts, just some PDFs, and some info about things. nothing major though.
i'm happy with deciding to let bots crawl.
also, there is a backup domain in which it keeps encrypted instences of backups.
now for that one, that is where protections are most proactively enabled.
I have it to where first, it challenges your computer using cloudflare turnstile. it's a simple browser check that offen takes around 20-30seconds to complete.
so in this context, absolutely!
but for things that are public, they can scrape untill the cows come home

@codepo8 then how about we test that theory?
i'm gonna put up (for test purposes) a robots.txt file and i'll DM you the results. and if it's wrong, i'm gonna call you out on it. funny story bro, but that's not ow robots.txt works.

and i'm gonna do an experiment to show this is true since you clearly think it is.
but read this
"The instructions in robots.txt files cannot enforce crawler behavior to your site; it's up to the crawler to obey them. While Googlebot and other respectable web crawlers obey the instructions in a robots.txt file, other crawlers might not."

https://developers.google.com/search/docs/crawling-indexing/robots/intro#:~:text=The%20instructions%20in%20robots.txt,file%2C%20other%20crawlers%20might%20not.

Robots.txt Introduction and Guide | Google Search Central  |  Documentation  |  Google for Developers

Robots.txt is used to manage crawler traffic. Explore this robots.txt introduction guide to learn what robot.txt files are and how to use them.

Google for Developers
which btw, Google search might but Gemini certainly doesn't
@adisonverlice I think you missed the joke, the point was that AI crawlers never look at robots.txt because scrapers don't care, so it'd be the perfect place to hide things. (Sarcastically though, In reality they probably scrape that too)
@rootfake o I see.
I mean, if you really wanted to, you could ide the contence of the file by requiring auth.
also i only know if it's a joke if there is a content warning in front of the joke.
eitherway, i think i'm gonna conduct the experiment anyways and it would be fun to see it in action.
@adisonverlice definitely sounds like a fun one. Might be worth throwing a directory in there that's not listed anywhere else, see if they're scraping robots.txt for targeting data. And yeah, I figured that was likely the case, I have people in my life who have a hard time with jokes (including me, sometimes), and it seemed like that. (Cause taken literally, that joke would be an absolutely absurd statement)
@rootfake hmmm. I plan to start it either tomorrow or this afternoon.
@codepo8 Now there's a zinger.
@codepo8 there are some forbidden words for LLM, but forgot what these are.
Idea is to sparkle your document with these.
@gunstick Swearing works. Fucking A
@codepo8 I'm sorry but I have to take a screenshot of this toot and use it in my, yet to be written, blog post about LLMs and how bad the f-up things.

@codepo8

I am very concerned !
@nlnet -
an EU funded institution did share this joke without explaining it !!!
cc @YlvaJohansson @EUCommission