Mastodawn

Katamari Apr 9

Chris Heilmann

If you want to ensure your content does not get indexed by big tech LLMs, just keep it in your robots.txt file.

Show thread

Chris Mackay 🇨🇦Apr 7

@codepo8 Ok that made me LOL

Show thread

Angela Scholder Apr 7

@codepo8 That's indeed clearly the only file they aren't interested in.....

Show thread

Hans Apr 7

@codepo8
Ich bezweifle, daß es funktioniert...
------------------
I doubt that it will work...

Show thread

Orin Apr 7

@codepo8 I wonder what would happen if one were to embed into the robots.txt file “X5O!P%@ap[4\PZX54(P^)7CC)7}$EICAR-STANDARD-ANTIVIRUS-TEST-FILE!$H+H*”

Show thread

Kévin ⏚Apr 9

@auroran @codepo8 @ap only one way to find out

Show thread

James 🦉 #FBPE 🇪🇺Apr 7

@codepo8 LOLLM

Show thread

Avi Rappoport (avirr)Apr 7

@codepo8 💯

Show thread

words_number Apr 7

@codepo8 I prefer this method by @heydon : https://heydonworks.com/article/poisoning-well/

Poisoning Well

An experimental strategy for contaminating Large Language Models

HeydonWorks

Show thread

Large Heydon Collider Apr 7

@words_number @codepo8 Chris is making a joke 😄 He recognises robots.txt is ignored!

Show thread

words_number Apr 7

@heydon @codepo8 I know, and a good one! I just wanted to lead the audience of that joke straight to your blog article ;)

Show thread

Chris Heilmann Apr 7

@words_number @heydon I approve of this technique

Show thread

Large Heydon Collider Apr 7

@codepo8 @words_number cheers, Chris, it's been a long time

Show thread

Large Heydon Collider Apr 7

@words_number @codepo8 sorry, I misread your toot!

Show thread

buo Apr 8

@words_number @codepo8 @heydon Here's another possible solution: a simple proof-of-work blocker: https://xeiaso.net/blog/2025/anubis/

Block AI scrapers with Anubis

I got tired with all the AI scrapers that were bullying my git server, so I made a tool to stop them for good.

Show thread

w Apr 7

my current method involves generating content that LLMs aren't interested in

Show thread

Thierna Apr 7

@codepo8 this is a good one for this day!

Show thread

Loafer Apr 7

@codepo8 @lowqualityfacts

Show thread

Erin Nivelet Apr 7

@codepo8 in fact the right mecanism to refuse data mining is through TDM:

https://www.w3.org/community/reports/tdmrep/CG-FINAL-tdmrep-20240510/

But just like Meta, ByteDance officially refused to follow robot.txt, they also refused to follow TDM. Because those solutions are both "good faith from scrapers" solution, nothing can in fact stop them and we are screwed

TDM Reservation Protocol (TDMRep)

This specification defines a simple and practical Web protocol, capable of expressing the reservation of rights relative to text & data mining (TDM) applied to lawfully accessible Web content, and to ease the discovery of TDM licensing policies associated with such content.

Show thread

Frank Heijkamp Apr 8

@codepo8 They never request the robots.txt file

Show thread

matt Apr 8

@codepo8 https://www.businessinsider.com/openai-anthropic-ai-ignore-rule-scraping-web-contect-robotstxt
Not so sure about this....

OpenAI, Anthropic ignore rule that prevents bots scraping web content

OpenAI and Anthropic have said publicly they respect robots.txt. But they are among the biggest tech companies ignoring the rule, BI has learned.

Business Insider

Show thread

Sven (Der Cybär)Apr 8

@codepo8 and this will be ignored pretty well. 😂

Show thread

Niko Poikulainen Apr 8

@codepo8 I hope that humans find my robots.txt more often than robots do. 😄

@codepo8 ha!

@codepo8 I have an even better idea.
keep it to yourself and don't host it on the internet!
just don't.
if you don't want it indexed, then simply get it off the internet and boom, noone else can index it

Show thread

Eve Ventually Apr 12

@adisonverlice AI scrapers are drastically increasing the costs of hosting things on the Internet by generating astronomical bills for site administrators. This antisocial behavior has driven many admins into the arms of Cloudflare, a solution that ruins websites for people on older devices, more limited devices, or who need assistive technology (such as screen readers). If AI scrapers don’t stop, it soon won’t matter if people have their stuff on the Internet or not because other people won’t be able to get to it. Those people who can’t afford either the attacks from AI scrapers or the “solutions” to fight them won’t be able to share their stuff on the Internet anyway.

@codepo8

Show thread

adison verlice Apr 12

@EveHasWords that is interesting.
a lot of things on my site, i'd say at least 40%, are scrapers.
all of my stuff is public knowledge to even robots, and I've never had a problem with screen reading tech on my site using cloudfare r2.
I am a blind person myself, so I know how to make things accessible.
I don't have any protections like this enabled, and I will probably never have them enabled unless I absolutely have to.
as for expenses, cloudfare isn't too bad in terms of expenses.
maybe around $10/month.
sometimes lower.
I know r2 isn't a preferred hosting option, but it's certainly a good one.
the only exception where bots can't scrape is my protected instances that only allow certain IP addresses and certain accounts.
i'm not hosting any account services publickly either, so I think I have nothing to worry about. I'm just like hosting thingslike we used to, no accounts, just some PDFs, and some info about things. nothing major though.
i'm happy with deciding to let bots crawl.

Show thread

adison verlice Apr 12

also, there is a backup domain in which it keeps encrypted instences of backups.
now for that one, that is where protections are most proactively enabled.
I have it to where first, it challenges your computer using cloudflare turnstile. it's a simple browser check that offen takes around 20-30seconds to complete.
so in this context, absolutely!
but for things that are public, they can scrape untill the cows come home

Show thread

adison verlice Apr 9

@codepo8 then how about we test that theory?
i'm gonna put up (for test purposes) a robots.txt file and i'll DM you the results. and if it's wrong, i'm gonna call you out on it. funny story bro, but that's not ow robots.txt works.

and i'm gonna do an experiment to show this is true since you clearly think it is.
but read this
"The instructions in robots.txt files cannot enforce crawler behavior to your site; it's up to the crawler to obey them. While Googlebot and other respectable web crawlers obey the instructions in a robots.txt file, other crawlers might not."

https://developers.google.com/search/docs/crawling-indexing/robots/intro#:~:text=The%20instructions%20in%20robots.txt,file%2C%20other%20crawlers%20might%20not.

Robots.txt Introduction and Guide | Google Search Central | Documentation | Google for Developers

Robots.txt is used to manage crawler traffic. Explore this robots.txt introduction guide to learn what robot.txt files are and how to use them.

Google for Developers

Show thread

adison verlice Apr 9

which btw, Google search might but Gemini certainly doesn't

Show thread

rootfake Apr 10

@adisonverlice I think you missed the joke, the point was that AI crawlers never look at robots.txt because scrapers don't care, so it'd be the perfect place to hide things. (Sarcastically though, In reality they probably scrape that too)

Show thread

adison verlice Apr 10

@rootfake o I see.
I mean, if you really wanted to, you could ide the contence of the file by requiring auth.
also i only know if it's a joke if there is a content warning in front of the joke.
eitherway, i think i'm gonna conduct the experiment anyways and it would be fun to see it in action.

Show thread

rootfake Apr 10

@adisonverlice definitely sounds like a fun one. Might be worth throwing a directory in there that's not listed anywhere else, see if they're scraping robots.txt for targeting data. And yeah, I figured that was likely the case, I have people in my life who have a hard time with jokes (including me, sometimes), and it seemed like that. (Cause taken literally, that joke would be an absolutely absurd statement)

Show thread

adison verlice Apr 10

@rootfake hmmm. I plan to start it either tomorrow or this afternoon.

Show thread

nantucketlit Apr 10

@codepo8 Now there's a zinger.

Show thread

gunstick Apr 10

@codepo8 there are some forbidden words for LLM, but forgot what these are.
Idea is to sparkle your document with these.

Show thread

Chris Heilmann Apr 10

@gunstick Swearing works. Fucking A

Show thread

Christian Lauf Apr 12

@codepo8 I'm sorry but I have to take a screenshot of this toot and use it in my, yet to be written, blog post about LLMs and how bad the f-up things.

Show thread

Sebastian Lasse Apr 13

@codepo8

I am very concerned !
@nlnet -
an EU funded institution did share this joke without explaining it !!!
cc @YlvaJohansson @EUCommission