If you write about the messy reality behind "free" internet services: we're seeing #OpenStreetMap hammered by scrapers hiding behind residential proxy/embedded-SDK networks. We're a volunteer-run service and the costs are real. We'd love to talk to a journalist about what we're seeing + how we're responding. #AI #Bots #Abuse
@osm_tech ugh.... why don't they use the exports...

@pietervdvn Because that would involve a human using their brains or having a shred of conscience and those both go against the basic principles of the companies doing this.

@osm_tech

@InsertUser @pietervdvn @osm_tech It goes against their whole ideology. The ideology says trust the machine to do what it copied from scraped Stack Overflow posts. If you try to intervene to make it do better, you're not trusting it.
@InsertUser @pietervdvn @osm_tech One of the most maddening things about all this is I go out of my way on all of my sites to provide a detailed sitemap and all of the link traversal hints to tell bots what links to not bother with because they give nothing of added value, and then the AI scrapers just go ahead and hammer everything en masse anyway, all to try to extract one last shred of additional information from my sites. I'm so sick of it.
@osm_tech muss schon wieder @evawolfangel hier taggen.
@osm_tech @josephcox Open source maps project dealing with AI scrapers, requesting journalists who might be interested ☝️
@osm_tech The proxy SDK providers need to be treated like the DDOS providers they are and prosecuted.
@InsertUser @osm_tech Pulling them from app stores and banning developers of the SDKs would be a good start. Save the criminal charges for after the damage control is done.
@azonenberg @InsertUser @osm_tech Given who controls the app stores, courts may be more willing _and_ faster.
@osm_tech Hey. Sorry to hear about that. Drop me a line on Signal? username: briankrebs.07. Thanks!
@osm_tech I'm administering a web server for a client that has about 50 web sites. Every few days they get hammered by residential proxy IPs for a few hours, so I finally installed Anubis.
@osm_tech Oh, that sounds interesting! If you want, I'd be interested in talking about this.

@osm_tech
Decades ago when I read Dune I thought the Butlerian Jihad against computers was the silliest thing in it.

Suddenly it makes sense. The sooner the LLM AI bubble bursts the better!

@osm_tech this sounds right up 404 medias alley.

They all have contacts and have reported on museums and Wikipedia having similar issues.

https://www.404media.co/about/

Who We Are

Who is 404 Media?

404 Media
@osm_tech You guys are heroes, nonetheless!! ✊
@osm_tech hey @emanuelmaiberg this sounds interesting and might add yet another part to your reporting on museums and others being scraped.
@osm_tech You are definitely not alone: https://lwn.net/Articles/1008897/ The situation is not sustainable but I'm not sure what we do about it beyond waiting for the AI bubble to burst.
Fighting the AI scraperbot scourge

There are many challenges involved with running a web site like LWN. Some of them, such as fin [...]

LWN.net
@corbet @osm_tech I don't have answers either but I hope something emerges because waiting for the bubble to burst still may face the "the market can remain irrational longer than you can remain solvent" problem.
@osm_tech what is a embedded-Sdk network?

@utf_7 @osm_tech

App developers can embed some "Sdk" into their apps or games.
The developer receives money.
The "Sdk"-provider proxies requests through these apps and games, to gain residential IPs.
And scrapers can buy these services, to tunnel their requests from residential IPs.

@utf_7 @osm_tech

This gets ugly really fast, if you want to see the full extent: <https://alternativeto.net/software/netnut-proxy-network/> for a list of _known_ residential proxy-providers.

@AliveDevil @utf_7 @osm_tech So ridiculous that Google and Apple won't just permaban any developer embedding one of these "SDKs".

@dalias I'd wish for them to enforce policies, but they get Ad- and IAP-revenue, so why bother.

Also, these "Sdks" probably have kill-switches (or rather, delayed activation) built-in, to not immediately contact their C&C servers.

@AliveDevil Yes but they could still be banned when caught. A few devs getting banned would be a big deterrent for others to ship this malware.

The right *technical* defense, however, is not to allow apps arbitrary network access unless they're declared in the manifest as a "browser" or other "client software" that the user can use with any service they want (like IRC clients, mail clients, Mastodon clients, etc.).

Instead, the manifest should declare a single domain the app can contact, or multiple if the developer is willing to pay for more intensive vetting of them, and only allow network access to the declared domain(s).

@dalias @AliveDevil dafuq? if so, "software development kit sounds" wrong in that contedt. this is plain malware.

imagine using an app and someone downloads child porn or classical torrent over your connection. how will you proof you're innocent

@AliveDevil @utf_7 @osm_tech how do I know if any of my installed apps is doing this crap? Could that be just buried in terms of use? Or would I need to give it explicit consent?

@luksfarris @utf_7

Probably terms of use, but this is so shady, that I doubt anyone would even bother disclosing this.
Best you can do: Monitor network traffic, and use DNS block lists for these known proxy services.

They definitely won't ask you for consent.
The only way to know an app _doesn't_ use these services is checking for the "requires internet access"-flag in AppStores, but that is basically futile, as most apps require internet access for … something.

2025: Servers on Fire: Keeping OpenStreetMap Online

YouTube
@osm_tech so I recently read a couple writeups from @briankrebs about malware and residential proxies
@osm_tech @dangoodin maybe this might be up your alley? If not you may know someone appriopriate
@osm_tech You're absolutely not alone, we wish you good luck! ~n
@osm_tech 404 media might be interested in this - they've been doing a lot of pieces about the impact of AI
@ClaireH @osm_tech @404mediaco here's a good story idea, OSM is having issues with AI scrapers
@osm_tech
@leah just recently discussed the same topic from the perspective of a web hosting provider in the @chaosradio podcast.
@osm_tech @geerlingguy I would 100% watch a video about that. Just throwing it out there :)
@LMieldazis @geerlingguy oooh do we get to show him our out-of-band (remote access) Raspberry Pi with dual power feeds, 4G modem and loads of serial connections? Saved our skin a good few times.
@osm_tech @LMieldazis would love to talk maps ops! I've seen many projects wrapping in map data and adding scripts to dl entire regions
@osm_tech I wonder if there's a way to fail2ban requests coming in faster than typically found in human requests.
@BalooUriza We use fail2ban to handle some of this with custom rules, but eventually fail2ban becomes a bottleneck after 100,000 IP addresses.
@osm_tech @BalooUriza For IPv4, a bitmask of the entire address space is a viable "efficient" implementation of blocking. I wonder if there are tools that can do it that way rather than needing a gigantic list.
@osm_tech @BalooUriza Like, a bitmask of IPv4 space is several times smaller than a Chrome instance. 🙃 🤡
@dalias @osm_tech @BalooUriza we have a very efficient implementation in #vinylcache (formerly #varnishcache )
@dalias @BalooUriza But that is one of the points @osm_tech are making in their post. These crawlers resort to using massive amounts of "scrapers hiding behind residential proxy/embedded-SDK networks" - meaning they are using Adware-infested phones all over the world for their scraping attaks. So banning IP ranges won't help much. Playing cat-and-mouse with these scrapers is resource intensive, which is increasingly hard for FOSS projects and is also driving up cost for commercial offerings.
@magezwitscher @BalooUriza @osm_tech Not ranges. Just the single IP, and a short-lived ban. All you need to do is get them down from thousands of requests per minute to one request per hour (because they get banned for an hour each time they start again).
@dalias If the botnet has two million computers, that is still two million requests per hour. I want a block tool for the ISP to run on their DNS that blocks the backbone of the proxy network so the clients won't get commands any more.
@osm_tech @BalooUriza is it using ipset hashsets, or default rule-per-ip rules? raw namespace or? I don't know the details of implementation, but if it is L7 load that is problematic (instead of pure bandwidth DDoS), it might be worth to consider whitelisting instead. I.e. whitelist addresses (or /24s) that have *not* had excessive requests lately, and put them in priority network bucket, and the rest (which is not blacklisted) goes in best-effort bucket (to maybe migrate to whitelist later)

@BalooUriza @osm_tech

Cycling to new IPs is trivial, I ban a few thousand IPs and cidr ranges in my WAF, I’ll see 75% of them show up the next time the scraper hits. Then after that most don’t show up again and the next scrape comes from a mostly new set of IPs.

I’ve see A few instances where they will cycle IPs during the same scraping event if some of them are blocked.

I’ve got scrapers that will send every request from a unique IP.

There is a lot of money to be made right now offering hard to block scraping services or tools to enable them.

@BalooUriza The problem is, who do you ban? Since the requests keep changing IPs and user agents.