Mastodawn

OpenStreetMap Ops Team

If you write about the messy reality behind "free" internet services: we're seeing #OpenStreetMap hammered by scrapers hiding behind residential proxy/embedded-SDK networks. We're a volunteer-run service and the costs are real. We'd love to talk to a journalist about what we're seeing + how we're responding. #AI #Bots #Abuse

Derick Rethans Jan 27

@osm_tech Maybe one for you @ricmac ?

@osm_tech ugh.... why don't they use the exports...

InsertUser Jan 27

@pietervdvn Because that would involve a human using their brains or having a shred of conscience and those both go against the basic principles of the companies doing this.

Cassandrich Jan 28

@InsertUser @pietervdvn @osm_tech It goes against their whole ideology. The ideology says trust the machine to do what it copied from scraped Stack Overflow posts. If you try to intervene to make it do better, you're not trusting it.

fluffy 💜Jan 30

@InsertUser @pietervdvn @osm_tech One of the most maddening things about all this is I go out of my way on all of my sites to provide a detailed sitemap and all of the link traversal hints to tell bots what links to not bother with because they give nothing of added value, and then the AI scrapers just go ahead and hammer everything en masse anyway, all to try to extract one last shred of additional information from my sites. I'm so sick of it.

Jan Kröger Jan 27

@osm_tech
Could that be a topic for @evawolfangel ?

@osm_tech Or @heiseonline ?

Dr. Christopher Kunz Jan 28

@blub @osm_tech @heiseonline Yeah I already replied.

Vincent Bergeot Jan 27

@osm_tech @mathildesaliou perhaps ?

@osm_tech muss schon wieder @evawolfangel hier taggen.

@osm_tech @josephcox Open source maps project dealing with AI scrapers, requesting journalists who might be interested ☝️

InsertUser Jan 27

@osm_tech The proxy SDK providers need to be treated like the DDOS providers they are and prosecuted.

Andrew Zonenberg Jan 28

@InsertUser @osm_tech Pulling them from app stores and banning developers of the SDKs would be a good start. Save the criminal charges for after the damage control is done.

@azonenberg @InsertUser @osm_tech Given who controls the app stores, courts may be more willing _and_ faster.

João Tiago Rebelo (NAFO J-121)Jan 27

@osm_tech I think @briankrebs could be interested.

BrianKrebs Jan 27

@osm_tech Hey. Sorry to hear about that. Drop me a line on Signal? username: briankrebs.07. Thanks!

Jernej Simončič �Jan 27

@osm_tech I'm administering a web server for a client that has about 50 web sites. Every few days they get hammered by residential proxy IPs for a few hours, so I finally installed Anubis.

Dr. Christopher Kunz Jan 27

@osm_tech Oh, that sounds interesting! If you want, I'd be interested in talking about this.

@osm_tech @josephcox

Ray McCarthy Jan 27

@osm_tech
Decades ago when I read Dune I thought the Butlerian Jihad against computers was the silliest thing in it.

Suddenly it makes sense. The sooner the LLM AI bubble bursts the better!

zuggamasta Jan 27

@osm_tech this sounds right up 404 medias alley.

They all have contacts and have reported on museums and Wikipedia having similar issues.

https://www.404media.co/about/

Who We Are

Who is 404 Media?

404 Media

@osm_tech cc @404mediaco as well

Internet Rando Jan 27

@osm_tech You guys are heroes, nonetheless!! ✊

zuggamasta Jan 27

@osm_tech hey @emanuelmaiberg this sounds interesting and might add yet another part to your reporting on museums and others being scraped.

Jonathan Corbet Jan 27

@osm_tech You are definitely not alone: https://lwn.net/Articles/1008897/ The situation is not sustainable but I'm not sure what we do about it beyond waiting for the AI bubble to burst.

Fighting the AI scraperbot scourge

There are many challenges involved with running a web site like LWN. Some of them, such as fin [...]

LWN.net

soaproot Jan 28

@corbet @osm_tech I don't have answers either but I hope something emerges because waiting for the bubble to burst still may face the "the market can remain irrational longer than you can remain solvent" problem.

@osm_tech what is a embedded-Sdk network?

AliveDevil Jan 27

@utf_7 @osm_tech

App developers can embed some "Sdk" into their apps or games.
The developer receives money.
The "Sdk"-provider proxies requests through these apps and games, to gain residential IPs.
And scrapers can buy these services, to tunnel their requests from residential IPs.

AliveDevil Jan 27

@utf_7 @osm_tech

This gets ugly really fast, if you want to see the full extent: <https://alternativeto.net/software/netnut-proxy-network/> for a list of _known_ residential proxy-providers.

Cassandrich Jan 28

@AliveDevil @utf_7 @osm_tech So ridiculous that Google and Apple won't just permaban any developer embedding one of these "SDKs".

AliveDevil Jan 28

@dalias I'd wish for them to enforce policies, but they get Ad- and IAP-revenue, so why bother.

Also, these "Sdks" probably have kill-switches (or rather, delayed activation) built-in, to not immediately contact their C&C servers.

Cassandrich Jan 28

@AliveDevil Yes but they could still be banned when caught. A few devs getting banned would be a big deterrent for others to ship this malware.

The right *technical* defense, however, is not to allow apps arbitrary network access unless they're declared in the manifest as a "browser" or other "client software" that the user can use with any service they want (like IRC clients, mail clients, Mastodon clients, etc.).

Instead, the manifest should declare a single domain the app can contact, or multiple if the developer is willing to pay for more intensive vetting of them, and only allow network access to the declared domain(s).

@dalias @AliveDevil dafuq? if so, "software development kit sounds" wrong in that contedt. this is plain malware.

imagine using an app and someone downloads child porn or classical torrent over your connection. how will you proof you're innocent

luksfarris Jan 27

@AliveDevil @utf_7 @osm_tech how do I know if any of my installed apps is doing this crap? Could that be just buried in terms of use? Or would I need to give it explicit consent?

AliveDevil Jan 27

@luksfarris @utf_7

Probably terms of use, but this is so shady, that I doubt anyone would even bother disclosing this.
Best you can do: Monitor network traffic, and use DNS block lists for these known proxy services.

They definitely won't ask you for consent.
The only way to know an app _doesn't_ use these services is checking for the "requires internet access"-flag in AppStores, but that is basically futile, as most apps require internet access for … something.

inamruzui Jan 28

@AliveDevil @utf_7 @osm_tech basically botnet/malware

OpenStreetMap Ops Team Jan 27

@utf_7 See https://www.youtube.com/watch?v=qhie14YKa-w&t=790s

2025: Servers on Fire: Keeping OpenStreetMap Online

YouTube

draNgNon Jan 27

@osm_tech so I recently read a couple writeups from @briankrebs about malware and residential proxies

@osm_tech @dangoodin maybe this might be up your alley? If not you may know someone appriopriate

Codeberg Jan 27

@osm_tech You're absolutely not alone, we wish you good luck! ~n

@osm_tech 404 media might be interested in this - they've been doing a lot of pieces about the impact of AI

Fluffy Kitty Cat Jan 27

@ClaireH @osm_tech @404mediaco here's a good story idea, OSM is having issues with AI scrapers

MAgetröte Jan 27

@osm_tech
@leah just recently discussed the same topic from the perspective of a web hosting provider in the @chaosradio podcast.

Luciano Mieldazis Jan 27

@osm_tech @geerlingguy I would 100% watch a video about that. Just throwing it out there :)

OpenStreetMap Ops Team Jan 27

@LMieldazis @geerlingguy oooh do we get to show him our out-of-band (remote access) Raspberry Pi with dual power feeds, 4G modem and loads of serial connections? Saved our skin a good few times.

Jeff Geerling Jan 28

@osm_tech @LMieldazis would love to talk maps ops! I've seen many projects wrapping in map data and adding scripts to dl entire regions

@osm_tech @suka_hiroaki

Kees de Kooter 🍉Jan 27

@osm_tech @schellevis @laurensvhg ?

Baloo Uriza Jan 27

@osm_tech I wonder if there's a way to fail2ban requests coming in faster than typically found in human requests.

OpenStreetMap Ops Team Jan 27

@BalooUriza We use fail2ban to handle some of this with custom rules, but eventually fail2ban becomes a bottleneck after 100,000 IP addresses.

Cassandrich Jan 28

@osm_tech @BalooUriza For IPv4, a bitmask of the entire address space is a viable "efficient" implementation of blocking. I wonder if there are tools that can do it that way rather than needing a gigantic list.

Cassandrich Jan 28

@osm_tech @BalooUriza Like, a bitmask of IPv4 space is several times smaller than a Chrome instance. 🙃 🤡

Nils Goroll 🕊️

@dalias @osm_tech @BalooUriza we have a very efficient implementation in #vinylcache (formerly #varnishcache )

MAgetröte Jan 28

@dalias @BalooUriza But that is one of the points @osm_tech are making in their post. These crawlers resort to using massive amounts of "scrapers hiding behind residential proxy/embedded-SDK networks" - meaning they are using Adware-infested phones all over the world for their scraping attaks. So banning IP ranges won't help much. Playing cat-and-mouse with these scrapers is resource intensive, which is increasingly hard for FOSS projects and is also driving up cost for commercial offerings.

Cassandrich Jan 28

@magezwitscher @BalooUriza @osm_tech Not ranges. Just the single IP, and a short-lived ban. All you need to do is get them down from thousands of requests per minute to one request per hour (because they get banned for an hour each time they start again).

@dalias If the botnet has two million computers, that is still two million requests per hour. I want a block tool for the ISP to run on their DNS that blocks the backbone of the proxy network so the clients won't get commands any more.

Matija Nalis Jan 28

@osm_tech @BalooUriza is it using ipset hashsets, or default rule-per-ip rules? raw namespace or? I don't know the details of implementation, but if it is L7 load that is problematic (instead of pure bandwidth DDoS), it might be worth to consider whitelisting instead. I.e. whitelist addresses (or /24s) that have *not* had excessive requests lately, and put them in priority network bucket, and the rest (which is not blacklisted) goes in best-effort bucket (to maybe migrate to whitelist later)

Andrew Prillaman Jan 28

@BalooUriza @osm_tech

Cycling to new IPs is trivial, I ban a few thousand IPs and cidr ranges in my WAF, I’ll see 75% of them show up the next time the scraper hits. Then after that most don’t show up again and the next scrape comes from a mostly new set of IPs.

I’ve see A few instances where they will cycle IPs during the same scraping event if some of them are blocked.

I’ve got scrapers that will send every request from a unique IP.

There is a lot of money to be made right now offering hard to block scraping services or tools to enable them.

@BalooUriza The problem is, who do you ban? Since the requests keep changing IPs and user agents.

@osm_tech
Maybe @adfichter for @republik_magazin ?