Tried to use /robots.txt to tell bots to stay out. The bots' response: "Your rules are adorable. Now, where's the content? Hmm, what's this 'disallow' thing? Looks like a suggestion for a really fast crawl!" *om nom om nom*

PS: As per that SO's robots.txt file, Google, Yahoo, DuckDuckGo, Bing, or LLM/AI (or anyone) should not show SO content, yet they all ignore it. Moral of the story for developers: robots.txt files are useless these days. They don't follow rules.

@nixCraft
Fascists don't follow rules either.
I'm wondering whether there's a connection…?
@nixCraft if you have a crawler honey trap, including it in robots.txt would ensure only ill-behaved crawlers would get trapped.
@nixCraft If this would work, Stackoverflow would be dead by tomorrow. I wouldn't use it because it wouldn't appear in the search results.
@nixCraft I wonder when they started doing it. Looks like it's a recent thing. Search engines of course take some time to de-index pages but ai bots anyway don't care. https://web.archive.org/web/20250228102832/https://stackoverflow.com/robots.txt is
Wayback Machine

@nixCraft @Matti_Vuori It has been always useless. So, nothing has changed.

@nixCraft “The solution to surveillance is pollution”. it’s the uniqueness of our likeness, attributes, behavior, style, or content that is used against us.

For AI bots, give them wrong/garbage content and tell them it’s good.
waste their time and resources. slow them down. send them on a wild goose chase.

@nixCraft robots.txt is like asking a bully to not bullying you 🙃
@nixCraft I had to use Anubis and IPFire to block LLM scrapers. Works for the moment
@nixCraft actually, the lesser known sitemap directive encourages crawling, from when crawlers were a good thing
@nixCraft Man, if only we could have a far more accurate version of the CAPTCHA that actually STOPS these bots.
@nixCraft put a trap in place like disallow /list_of_politicians_that_received_bribes_from_pharma_corps_for_covid_vaccines.docx and apply fail2ban on every IP address that accesses it.

@nixCraft

Obeying your rules is optional? Well, guess what?

@nixCraft can we just start adding ridiculous terms of service to robots.txt so if a bot scrapes my site we can go to court over how their bot agreed to terms and they owe me $10 million?
@nixCraft that is how SO’s robots.txt looks today. It was different a few months ago and different in 2024 and before: https://web.archive.org/web/20250331163653/https://stackoverflow.com/robots.txt I didn’t do any research, but if the problem is reading questions and answers content, I won’t be too surprised it was possible via some url they forgot to add there.
Wayback Machine

@nixCraft major search engines have other ways to crawl content. One of them is to "push" changes to them instead of crawling pages. This is how Wikipedia gets its contents indexed by Google since several years. Bing and Yandex have it too. https://searchengineland.com/indexnow-new-initiative-by-microsoft-and-yandex-to-push-content-to-search-engines-375247
IndexNow - new initiative by Microsoft and Yandex to push content to search engines

Google seems to currently not be participating in this initiative.

Search Engine Land