Mastodawn

AI crawler attacks are out of control.

Yesterday a site I still help with saw over 20M requests from tens thousands of IP addresses across hundreds of ASNs. Each bot involved only did just over 1k requests, the most from a single ASN was 45k requests. Almost everything I could block against is legit looking and would trigger false positives, or just randomized garbage and even allow listing won't work as still there would be false positives.

What is everyone doing with these?

I already have:

Nginx blocking countries
Nginx blocking ASNs
Nginx allow listing only known URLs
Heavy caching layers

Yet still the URLs that exist and are correct from countries and ASNs allowed, is still enough to effectively pull the whole site in a couple of hours.

These bots obvs don't honour robots.txt, and Cloudflare and such don't report them as AI crawls.

The only thing I can really think of doing is raising the cost of crawling... Using an allow list of known TLS certs against recent browsers, and saying that that's the only thing you can use. But that's exclusive and shitty to do.

Show thread

Dee in Osaka 🇯🇵⛩️16h ago

The problem is this churns all my caches, which then means a higher compute load as nothing is in cache, which after a while will knock the site offline.

Wish I could force Cloudflares TCP turtle on them (a TCP tarpit for H2 requests that I don't believe it was ever used as it was made as an experiment internally, but would be perfect for this).

I can't even rate limit effectively, as 1k requests per IP over a 30 minute window flies under the radar of systems, any rate limit would false positive on real users.

Show thread

noodle 16h ago

@dee "This churns all my caches" is an excellent statement of indignation.

Show thread

aoanla 16h ago

@dee in terms of making it harder, something like Anubis would be an option?

Show thread

Dee in Osaka 🇯🇵⛩️16h ago

@aoanla hmmm, possibly.

Might try when I'm back from vacation. @cadey based on the above description do you think Anubis would make a difference?

I guess I could try when I'm back from travel, and see where it works or doesn't.

Quick look says: should help with human web endpoints, will do nothing for API endpoints... This may be ok, so long as I secure the API to known clients beforehand rather than leave those public.

Show thread

Luci Bitchface Angerfoot 16h ago

@dee what tools are you using to be informed about these attacks? i’m a little lost when it comes to log analysis

Show thread

Dee in Osaka 🇯🇵⛩️16h ago

@bri7 from nginx I'm logging every tcp, http, and TLS field that is exposed advertising to the docs... Literally everything.

I'm formatting that as Json lines in the log output.

I'm then sending that to Loki.

And analysing via a Grafana dashboard.

I need some plugins too for nginx, the geoip one for getting ASNs for direct traffic, but some sites are behind Cloudflare so I also log their heads which gives me asn and country too.

In the early days of attacks, ASNs were useful as blocks. But now I see a lot of consumer ISPs involved and asn blocking is less useful. Countries aren't that useful to block at all... And it's often wrong, i.e. China shows up in Marseilles or Singapore based on where submarine cables go.

Show thread

Luci Bitchface Angerfoot 16h ago

@dee thank you

Show thread

Dee in Osaka 🇯🇵⛩️16h ago

@bri7 when I'm back from travels I'm happy to share the logging config with you if you run behind and are interested

Show thread

Luci Bitchface Angerfoot 16h ago

@dee i’ll give it a crack on my own and then i will have better questions

Show thread

Anyia, geeky 🏳️‍⚧️ girl 16h ago

@dee I've had to deploy nepenthes (zadzmo.org) as an active counter measure on my little personal site, but as a result I've pretty much dedicated one cpu core to running the tar pit. I've put the prefix in my robots.txt, so only rude scrapers get stuck.

Current stats:
Addresses: 601979
User-agents: 1890
Bytes: 7.8GB

It's insane...

Show thread

wyatt 16h ago

@dee unplug my web server for a week

Show thread

Lily Star 9h ago

@dee https://www.linuxlinks.com/iocaine-defense-mechanism-against-unwanted-scrapers/

Iocaine - defense mechanism against unwanted scrapers - LinuxLinks

Iocaine is a defense mechanism against unwanted scrapers, sitting between upstream resources and the fronting reverse proxy.

LinuxLinks