To whomever praises #Claude #LLM:

ClaudeBot has made 20k requests to bugs.gentoo.org today. 15k of them were repeatedly fetching robots.txt. That surely is a sign of great code quality.

#AI

@mgorny I guess... at least it asks. Eugh.
@mgorny Claude has a new /22 I see… 😾☠️🤬
@mgorny robots.txt about to become more and more disrespected, eh? ​
@mgorny Claude, the anxious AI: "Maybe robots.txt has changed this time? Surely it has changed. It must have!"
@obsurveyor @mgorny
is this the crawler equivalent of checking the fridge?
@mgorny It also seems to use only IPv4 to perform fetches from what I can tell.

This got me wondering if there is a way to tell a crawler that crawling this site is permitted, but only if you use IPv6.

Simply serving different versions of robots.txt depending on address family won’t achieve that since the crawler will silently assume the version of robots.txt it received applies in both cases.

@mgorny They may use Claude Code.

I‘d not be surprised if we see a decline in software and service quality over the next few years in general. Once all seniors are retired or laid off this may be the new normal.

@mgorny you never know, it might have changed in the 500milliseconds between one and the next!
@mgorny Probably an edge function is spawned for each "web search". That function fetches the robots.txt and then some pages.
@mgorny
maybe whoever wrote the code felt denial of service was more important than training an LLM.
@mgorny you've got to laugh…

I am guessing they load robots.txt before each intended fetch to verify that the URL they intend to fetch is permitted. If they primarily want resources that are not permitted, it would explain why they fetch robots.txt more often than anything else.

Of course caching robots.txt would be better. The only problem with that is that you may end up fetching a URL which is no longer permitted because you used an outdated version of robots.txt.

If you want a crawler to be extra well behaved you could take this approach:

  • If your cached robots.txt is older than 24 hours or you haven’t cached it at all. Then retrieve robots.txt.
  • If your cached robots.txt is less than 24 hours old and doesn’t permit the desired URL. Then you don’t retrieve anything.
  • If your cached robots.txt is between 1 minute and 24 hours old and does permit the URL you intend to fetch, then you fetch robots.txt again to ensure the desired URL is still permitted.
  • If your cached robots.txt is less than 1 minute old and does permit the desired URL you trust the cache.

But I think that’s probably a bit too advanced for an AI company to work out.

./play.it server gets between 200 000 and 500 000 requests from Claude scraping bot every day! We’re talking about a server maintained by a single lone human, hosted on a desktop computer dual-classing as a regular system for daily use (like writing this message)…

Well, there is a trick: we actually want that bot (and its friends) to spend as much time as possible scanning that server: https://notes.vv221.fr/blackhole.xhtml
Claude more like clod