Current status: hacking extremely discount per-IP request ratelimits into my techblog because yet another person turned something loose on it that made 9,.000+ requests today and everyone even vaguely like that can lose hard in the future. Also, this allows me to ratelimit Googlebot and other known crawlers by User-Agent because.

Also someone out there in the world helpfully decided to do an unrequested security scan of the entire web server today, to the tune of 28k requests or so.

Lately Googlebot has decided that it wants to do a few thousand requests a day to my techblog to fetch content (often several times in one day) that hasn't changed for years.

Sometimes it's extremely tempting to give Googlebot and other prolific crawlers perpetual HTTP 429s, or at least HTTP 429s for all but one day of the week or something. If they're not going to show people actual search results anyway, only LLM summaries of them...

So far almost all of what my techblog's ratelimiting has rate-limited has been Googlebot and Applebot (I think Bing also got ratelimited once). This is my surprised face, really. Also this is my surprised face that they generally keep crawling when they get HTTP 429s, rather than slowing down.

Perhaps I should put these crawlers all back on timeout for a week or so (where they get perpetual HTTP 429s). Is Applebot even doing anything for me with all this crawling?

Today my techblog's HTTP ratelimiting worked exactly as I wanted it to, turning about 1400 scrape attempts into 61 successful ones and then 429'ing the rest. Then I actively blocked that source and it came back 8,600 times to get HTTP 403s. Take a bow, 212.56.54.138, ideally right into the deep sea.
@cks heres another potential solution thats all the rage lately: https://github.com/TecharoHQ/anubis
GitHub - TecharoHQ/anubis: Weighs the soul of incoming HTTP requests to stop AI crawlers

Weighs the soul of incoming HTTP requests to stop AI crawlers - TecharoHQ/anubis

GitHub

@eru For my sins, I run my techblog as a CGI. One could commit terrible hacks¹ to get a proxy like Anubis in the picture and someday I may, but so far I've gotten by. (And I think I'd want ratelimiting anyway for various reasons. Certainly syndication feed ratelimiting...)

¹ https://utcc.utoronto.ca/~cks/space/blog/web/OutsourcingClientChecking but I wouldn't use OIDC, I'd do something even more terrible.

Chris's Wiki :: blog/web/OutsourcingClientChecking

@cks an easier way is probably to use something like uwsgi
have it bind to http for anubis and configure it to run the cgi script after (
https://uwsgi-docs.readthedocs.io/en/latest/CGI.html)
Running CGI scripts on uWSGI — uWSGI 2.0 documentation

@cks FWIW GoogleBot is what some scrapers have chosen to impersonate in the hope they won’t get blocked. And those impersonating scrapers are rather less polite.

If it’s not coming from Google IP blocks I’d be wary of assuming it’s from Google, despite the GoogleBot claims.

@cks

Is it doing straightforward GETs? Or is it doing HEAD? Or using If-Modified-Since?

#HTTP #GoogleBot #httpd

@JdeBP All GETs and only trace amounts of conditional GETs in the form of HTTP 304 responses (but there are a couple a day, to my surprise).

@cks

It makes me think that there's one well-behaved 'bot drowned in a sea of ill-behaved ones.

I'm just instrumenting #djbwares httpd to log GET and HEAD differently. I wonder what I'll see.

#HTTP #httpd #GoogleBot

@cks

Early results are not promising. I've had a handful of HEAD requests in the past day. Only 2 appear legitimate, in that they hit genuine page URLs. The others were attempts to exploit WordPress vulnerabilities.

#HTTP #httpd #GoogleBot #djbwares #WordPress