Mastodawn

Ugh, I might need to put part of my website behind something like Anubis after all

@dressupgeekout i think we all will, eventually. Is it AI scraper bots? Far from a permanent solution, but so far I've gotten by with blocking user agents and IPs.

Show thread

dressupgeekout Mar 18

@gordoooo_z Yes, it's ClaudeBot et al. super-aggressively scraping my new cgit instance. The host runs NetBSD, so I've also been looking into blacklistd(8)

Show thread

Andreas (82MHz)Mar 18

@dressupgeekout
If you haven't tried it yet, there are regularly updated blocklists for many webservers available here:
https://github.com/ai-robots-txt/ai.robots.txt

I have this on my webserver and I can see from the logs that the server is responding with a few thousand 403s every day to crawlers and bots, so it's helping a bit at least.

@gordoooo_z

GitHub - ai-robots-txt/ai.robots.txt: A list of AI agents and robots to block.

A list of AI agents and robots to block. Contribute to ai-robots-txt/ai.robots.txt development by creating an account on GitHub.

GitHub

Show thread

dressupgeekout Mar 18

@82mhz @gordoooo_z This looks to be a valuable resource. Thank you for sharing. What a hassle. I've never needed to take these kinds of measures before... I guess this will force me to get better at server administration, heh

Show thread

Andreas (82MHz)Mar 18

@dressupgeekout
It's infuriating. And there is no guarantee that this will keep them out as they are actively working on finding ways around blocks. But at least we can make it a little harder. Anubis is good, but has the downside that it annoys the users too, and sometimes even makes the site inaccessible, which is too much collateral damage for me.

Anyway, good luck implementing it, I hope it helps a little!

@gordoooo_z

Show thread

dressupgeekout 1d ago

@82mhz Thank you for sharing ai.robots.txt with me, I think it's made a difference on my website!

Show thread

Andreas (82MHz)1d ago

@dressupgeekout
That's awesome, happy to hear it! 😊

Show thread

My name is Gordo Mar 18

@dressupgeekout according to at least one source, claudebot respects robots.txt. I'm not prepared to take one random source's word for it but it's simple enough to implement and find out, assuming you haven't already, that is?