If you write about the messy reality behind "free" internet services: we're seeing #OpenStreetMap hammered by scrapers hiding behind residential proxy/embedded-SDK networks. We're a volunteer-run service and the costs are real. We'd love to talk to a journalist about what we're seeing + how we're responding. #AI #Bots #Abuse
@osm_tech I wonder if there's a way to fail2ban requests coming in faster than typically found in human requests.
@BalooUriza We use fail2ban to handle some of this with custom rules, but eventually fail2ban becomes a bottleneck after 100,000 IP addresses.
@osm_tech @BalooUriza For IPv4, a bitmask of the entire address space is a viable "efficient" implementation of blocking. I wonder if there are tools that can do it that way rather than needing a gigantic list.
@osm_tech @BalooUriza Like, a bitmask of IPv4 space is several times smaller than a Chrome instance. 🙃 🤡
@dalias @osm_tech @BalooUriza we have a very efficient implementation in #vinylcache (formerly #varnishcache )
@dalias @BalooUriza But that is one of the points @osm_tech are making in their post. These crawlers resort to using massive amounts of "scrapers hiding behind residential proxy/embedded-SDK networks" - meaning they are using Adware-infested phones all over the world for their scraping attaks. So banning IP ranges won't help much. Playing cat-and-mouse with these scrapers is resource intensive, which is increasingly hard for FOSS projects and is also driving up cost for commercial offerings.
@magezwitscher @BalooUriza @osm_tech Not ranges. Just the single IP, and a short-lived ban. All you need to do is get them down from thousands of requests per minute to one request per hour (because they get banned for an hour each time they start again).
@dalias If the botnet has two million computers, that is still two million requests per hour. I want a block tool for the ISP to run on their DNS that blocks the backbone of the proxy network so the clients won't get commands any more.