Current status: hacking extremely discount per-IP request ratelimits into my techblog because yet another person turned something loose on it that made 9,.000+ requests today and everyone even vaguely like that can lose hard in the future. Also, this allows me to ratelimit Googlebot and other known crawlers by User-Agent because.

Also someone out there in the world helpfully decided to do an unrequested security scan of the entire web server today, to the tune of 28k requests or so.

Lately Googlebot has decided that it wants to do a few thousand requests a day to my techblog to fetch content (often several times in one day) that hasn't changed for years.

Sometimes it's extremely tempting to give Googlebot and other prolific crawlers perpetual HTTP 429s, or at least HTTP 429s for all but one day of the week or something. If they're not going to show people actual search results anyway, only LLM summaries of them...

@cks

Is it doing straightforward GETs? Or is it doing HEAD? Or using If-Modified-Since?

#HTTP #GoogleBot #httpd

@JdeBP All GETs and only trace amounts of conditional GETs in the form of HTTP 304 responses (but there are a couple a day, to my surprise).

@cks

It makes me think that there's one well-behaved 'bot drowned in a sea of ill-behaved ones.

I'm just instrumenting #djbwares httpd to log GET and HEAD differently. I wonder what I'll see.

#HTTP #httpd #GoogleBot

@cks

Early results are not promising. I've had a handful of HEAD requests in the past day. Only 2 appear legitimate, in that they hit genuine page URLs. The others were attempts to exploit WordPress vulnerabilities.

#HTTP #httpd #GoogleBot #djbwares #WordPress