Mastodawn

To whomever praises #Claude #LLM:

ClaudeBot has made 20k requests to bugs.gentoo.org today. 15k of them were repeatedly fetching robots.txt. That surely is a sign of great code quality.

#AI

Show thread

vey Jul 5

@mgorny I guess... at least it asks. Eugh.

Show thread

mirabilos Jul 5

@mgorny Claude has a new /22 I see… 😾☠️🤬

Show thread

Sean King Jul 5

@mgorny robots.txt about to become more and more disrespected, eh?

Show thread

Obsurveyor Jul 5

@mgorny Claude, the anxious AI: "Maybe robots.txt has changed this time? Surely it has changed. It must have!"

Show thread

Lea Jul 6

@obsurveyor @mgorny
is this the crawler equivalent of checking the fridge?

Show thread

Erik Nygren

Jul 6

@mgorny It also seems to use only IPv4 to perform fetches from what I can tell.

Show thread

kasperd Jul 6

This got me wondering if there is a way to tell a crawler that crawling this site is permitted, but only if you use IPv6.

Simply serving different versions of robots.txt depending on address family won’t achieve that since the crawler will silently assume the version of robots.txt it received applies in both cases.

Show thread

Schrammy Jul 6

@mgorny They may use Claude Code.

I‘d not be surprised if we see a decline in software and service quality over the next few years in general. Once all seniors are retired or laid off this may be the new normal.

Show thread

Oblomov Jul 6

@mgorny you never know, it might have changed in the 500milliseconds between one and the next!

Show thread

Jak2k 🇪🇺Jul 6

@mgorny Probably an edge function is spawned for each "web search". That function fetches the robots.txt and then some pages.

Show thread

llewelly Jul 6

@mgorny
maybe whoever wrote the code felt denial of service was more important than training an LLM.

Show thread

Dave Everitt Jul 6

@mgorny you've got to laugh…

Show thread

kasperd Jul 6

I am guessing they load robots.txt before each intended fetch to verify that the URL they intend to fetch is permitted. If they primarily want resources that are not permitted, it would explain why they fetch robots.txt more often than anything else.

Of course caching robots.txt would be better. The only problem with that is that you may end up fetching a URL which is no longer permitted because you used an outdated version of robots.txt.

If you want a crawler to be extra well behaved you could take this approach:

If your cached robots.txt is older than 24 hours or you haven’t cached it at all. Then retrieve robots.txt.
If your cached robots.txt is less than 24 hours old and doesn’t permit the desired URL. Then you don’t retrieve anything.
If your cached robots.txt is between 1 minute and 24 hours old and does permit the URL you intend to fetch, then you fetch robots.txt again to ensure the desired URL is still permitted.
If your cached robots.txt is less than 1 minute old and does permit the desired URL you trust the cache.

But I think that’s probably a bit too advanced for an AI company to work out.

Show thread

./play.it Jul 7

./play.it server gets between 200 000 and 500 000 requests from Claude scraping bot every day! We’re talking about a server maintained by a single lone human, hosted on a desktop computer dual-classing as a regular system for daily use (like writing this message)…

Well, there is a trick: we actually want that bot (and its friends) to spend as much time as possible scanning that server: https://notes.vv221.fr/blackhole.xhtml

Show thread

Soulfire The Wolf

Jul 9

Claude more like clod