I have a recent project to stop (LLM training) crawlers from copyright-thefting my website. I find these bots with either a hidden-link tarpit or by looking for single access events (no css), which I then ban if they come from a cloud server. So far I have learned:
• Amazon AWS, Google Cloud, Microsoft Azure, and Chinese telecom companies are pretty easy to block. These were the early heavy hitters.
• Huawei has little cloud server farms all over the world. I seem to still find about one a day.
• Some mysterious entity rents servers all over the world and crawls by sniping one page at a time. The snipes come in clusters, so all these bots are running the same crawler, with some but not complete inter-communication. Popular cloud companies are OVH Cloud, EGI Hosting, Web2Objects, Host Royale, Digital Ocean, Cloud Innovation, ....
#apache #litespeed #htaccess #crawlers #botfarms
https://codeberg.org/skewray/htaccess