Mastodawn

Cymphoni Fantastique Mar 21, 2025

List of AI bots to add to robots.txt (although they may not obey -- may need to throw them in the bitbucket and 404 or 444 them). In addition to these, you may have to block specific random browser versions for the most aggressive bots who ignore robots.txt.

https://github.com/ai-robots-txt/ai.robots.txt/blob/main/robots.txt

#AI #scrapers #LLMs

ai.robots.txt/robots.txt at main · ai-robots-txt/ai.robots.txt

A list of AI agents and robots to block. Contribute to ai-robots-txt/ai.robots.txt development by creating an account on GitHub.

GitHub

Show thread

draeath Mar 21, 2025

@ai6yr whacking Scrapy seems a bit heavy handed, that's a whole-ass html handling framework and not a specific bot.

https://scrapy.org

Scrapy | A Fast and Powerful Scraping and Web Crawling Framework

Show thread

AI6YR Ben Mar 21, 2025

@draeath Well, I mean, if you don't want any scraping of your instance/website...

Show thread

draeath Mar 21, 2025

@ai6yr well, not all use is scraping, despite the name. Think of it like blocking webkit (if you had a means to know about it?)

If all you need/care about is users that come in via traditional browsers though, it's probably a fair move.

(Also scrapy users can change the user agent, I believe, as could any of these abusive LLM scrapers. Most ignore robots.txt anyway, did you know? Meaning putting this into your robots.txt may be doing less than you hope)

Show thread

AI6YR Ben Mar 21, 2025

@draeath Correct, robots.txt only blocks the companies which honor it. But, since my users are only coming in via web browsers and Mastodon clients, this reduces the load significantly on the web server.

Show thread

jcccb Mar 21, 2025

@ai6yr

since robots.txt is ignored some new approaches are in need:
https://anubis.techaro.lol/

Anubis: Web AI Firewall Utility | Anubis

Weigh the soul of incoming HTTP requests to protect your website!