Mastodawn

Tried to use /robots.txt to tell bots to stay out. The bots' response: "Your rules are adorable. Now, where's the content? Hmm, what's this 'disallow' thing? Looks like a suggestion for a really fast crawl!" *om nom om nom*

PS: As per that SO's robots.txt file, Google, Yahoo, DuckDuckGo, Bing, or LLM/AI (or anyone) should not show SO content, yet they all ignore it. Moral of the story for developers: robots.txt files are useless these days. They don't follow rules.

M Schommer Jul 26, 2025

@nixCraft
Fascists don't follow rules either.
I'm wondering whether there's a connection…?

Fazal Majid Jul 26, 2025

@nixCraft if you have a crawler honey trap, including it in robots.txt would ensure only ill-behaved crawlers would get trapped.

♾️ Water Jul 26, 2025

@nixCraft If this would work, Stackoverflow would be dead by tomorrow. I wouldn't use it because it wouldn't appear in the search results.

Avadhesh Jul 26, 2025

@nixCraft I wonder when they started doing it. Looks like it's a recent thing. Search engines of course take some time to de-index pages but ai bots anyway don't care. https://web.archive.org/web/20250228102832/https://stackoverflow.com/robots.txt is

Wayback Machine

Jakke Flemming Jul 26, 2025

@nixCraft @Matti_Vuori It has been always useless. So, nothing has changed.

Andrew Woods Jul 26, 2025

@nixCraft “The solution to surveillance is pollution”. it’s the uniqueness of our likeness, attributes, behavior, style, or content that is used against us.

For AI bots, give them wrong/garbage content and tell them it’s good.
waste their time and resources. slow them down. send them on a wild goose chase.

Nandeth Draggar Jul 26, 2025

@nixCraft robots.txt is like asking a bully to not bullying you 🙃

Solve Computer Science Jul 26, 2025

@nixCraft I had to use Anubis and IPFire to block LLM scrapers. Works for the moment

Urix Turing Jul 26, 2025

@nixCraft actually, the lesser known sitemap directive encourages crawling, from when crawlers were a good thing

JamesTDG Jul 26, 2025

@nixCraft Man, if only we could have a far more accurate version of the CAPTCHA that actually STOPS these bots.

Agron Noka Jul 26, 2025

@nixCraft put a trap in place like disallow /list_of_politicians_that_received_bribes_from_pharma_corps_for_covid_vaccines.docx and apply fail2ban on every IP address that accesses it.

glutto Jul 26, 2025

Obeying your rules is optional? Well, guess what?

Bryan Thompson Jul 26, 2025

@nixCraft can we just start adding ridiculous terms of service to robots.txt so if a bot scrapes my site we can go to court over how their bot agreed to terms and they owe me $10 million?

Davide Guerri Jul 27, 2025

@nixCraft that is how SO’s robots.txt looks today. It was different a few months ago and different in 2024 and before: https://web.archive.org/web/20250331163653/https://stackoverflow.com/robots.txt I didn’t do any research, but if the problem is reading questions and answers content, I won’t be too surprised it was possible via some url they forgot to add there.

Wayback Machine

Ciencia Al Poder Jul 27, 2025

@nixCraft major search engines have other ways to crawl content. One of them is to "push" changes to them instead of crawling pages. This is how Wikipedia gets its contents indexed by Google since several years. Bing and Yandex have it too. https://searchengineland.com/indexnow-new-initiative-by-microsoft-and-yandex-to-push-content-to-search-engines-375247

IndexNow - new initiative by Microsoft and Yandex to push content to search engines

Google seems to currently not be participating in this initiative.

Search Engine Land