This got me wondering if there is a way to tell a crawler that crawling this site is permitted, but only if you use IPv6.
Simply serving different versions of robots.txt depending on address family won’t achieve that since the crawler will silently assume the version of robots.txt it received applies in both cases.
@mgorny They may use Claude Code.
I‘d not be surprised if we see a decline in software and service quality over the next few years in general. Once all seniors are retired or laid off this may be the new normal.
I am guessing they load robots.txt before each intended fetch to verify that the URL they intend to fetch is permitted. If they primarily want resources that are not permitted, it would explain why they fetch robots.txt more often than anything else.
Of course caching robots.txt would be better. The only problem with that is that you may end up fetching a URL which is no longer permitted because you used an outdated version of robots.txt.
If you want a crawler to be extra well behaved you could take this approach:
But I think that’s probably a bit too advanced for an AI company to work out.