AI crawler attacks are out of control.

Yesterday a site I still help with saw over 20M requests from tens thousands of IP addresses across hundreds of ASNs. Each bot involved only did just over 1k requests, the most from a single ASN was 45k requests. Almost everything I could block against is legit looking and would trigger false positives, or just randomized garbage and even allow listing won't work as still there would be false positives.

What is everyone doing with these?

I already have:

  • Nginx blocking countries
  • Nginx blocking ASNs
  • Nginx allow listing only known URLs
  • Heavy caching layers

Yet still the URLs that exist and are correct from countries and ASNs allowed, is still enough to effectively pull the whole site in a couple of hours.

These bots obvs don't honour robots.txt, and Cloudflare and such don't report them as AI crawls.

The only thing I can really think of doing is raising the cost of crawling... Using an allow list of known TLS certs against recent browsers, and saying that that's the only thing you can use. But that's exclusive and shitty to do.

The problem is this churns all my caches, which then means a higher compute load as nothing is in cache, which after a while will knock the site offline.

Wish I could force Cloudflares TCP turtle on them (a TCP tarpit for H2 requests that I don't believe it was ever used as it was made as an experiment internally, but would be perfect for this).

I can't even rate limit effectively, as 1k requests per IP over a 30 minute window flies under the radar of systems, any rate limit would false positive on real users.

@dee "This churns all my caches" is an excellent statement of indignation.