Jeff Starr's web server firewalls prove to be very useful in this regard.
Why is #twitter not properly identifying itself as a bot when trying to scrape my website? (69.12.56.0/21 is AS63179 is Twitter)
Could it be cause they're a malicious party training an #aibot?
(This is extremely low-intensity, but based on the combination of this specific UA and the pages they're trying to reach, I've seen them before, coming in from residential proxies.)
The funny thing is that bots identifying as bots and observing robots.txt would actually be allowed to reach those particular pages.
After 615 requests over pretty much exactly 24 hours, the #aiscraper abusing #residentialproxies to try and repeatedly request one particular page on #GameSieve - 18 times successfully, before I noticed it being stuck in a loop and added another block rule - finally disappeared. However, its final request was successful and is worrying, as it came through fetch.tunnel.googlezip.net - which apparently is #Google 's Chrome Prefetch Proxy.
I've noticed requests from that range before, but always assumed that was legitimate. Do I now have to think about blocking that bit of infrastructure as well, as #scrapers have found a way to piggyback on it? Urgh!
I guess I'll start by blocking prefetching via .well-known/traffic-advice and see what that does...
Iocaine and my custom solution aren't good enough.
I'm considering to add to login to my website rewrite as protection against bots.
I would always offer an anonymous session after completing a proof of work (which is also available without JS).
Do you think this is okay? Please don't hesitate to reply!
#website #personalBlog #PersonalSites #indieweb #spam #spamprotection #scrapers #selfhosting #iocaine
A couple new #scrapers to block that I haven't seen on robotstxt.com:
* Amzn-SearchBot is the search engine for Alexa and Rufus. Amazon claims on https://developer.amazon.com/amazonbot that it doesn't do AI training, but it still hammered our sites the past two days.
* SleepBot I haven't found much on, but it was requesting URLs for files that were submitted in a document upload spam attack we had a few months ago. Very sus.