Dear AI Companies, instead of sneakily scraping OpenStreetMap.org, how about a tiny $10,000 donation? We'll even throw in a shiny new download link to our entire planet's geo data! Who knew it was that easy? Start here: https://supporting.openstreetmap.org/donate/ #win #ai #bots #OpenStreetMap 🌍 🤖 🤑
Donate – OpenStreetMap Foundation

But wait there is more, for a $50,000 donation we'll even provide live minutely streaming updates direct from OpenStreetMap.org. #WIN #AI #Bots #OpenStreetMap
@[email protected] I think when your target audience sees these prices they think it is a scam, because it is way too cheap for such a service.
@bart I suspect doing it the 'sensible' way would damage their disruptive image (worth 100s of millions $).
@bart (for the unaware: you can get the planet data for free and streaming updates for free-to-cheap, OSM doesn't really sell its data, it's freely available)
@[email protected] Scraping OpenStreetMap. 🤦 First time I see that combination of words.
@bart Unfortunately it is extremely common. Sometimes 100s of req/s hitting expensive API endpoints. Multiple IPs, faked UAs.
@Firefishy @bart why scrap when there is planet.osm available?
@wikiyu @bart Sssssh! Don't give away our secret s̶a̶u̶c̶e̶ source. Honestly, no idea. The full planet.osm data would be a lot easier to use than painfully slow scraped data.

@Firefishy @bart
Exactly, there are also increments with changes... and parts per continents and...
Oh god i cannot imagine ANY reason to scrap it from website.

Or maybe... it was shIT answer of ai for "download whole osm"

@wikiyu @Firefishy @[email protected] probably asked an ai for a program to get all the data...
@Firefishy @wikiyu @bart that's expecting people behind those companies to be, you know, actually competent
@SRAZKVT @Firefishy @bart you won that conversation ;-)
@wikiyu @Firefishy @bart Because that’s what CoPilot told their junior dev to do when they asked for boilerplate code.
Probably.
@Firefishy @bart Similar thing to what's happening to @readthedocs then https://about.readthedocs.com/blog/2024/07/ai-crawlers-abuse/ (although luckily they were not being subject to faked UAs, AFAIK...)
AI crawlers need to be more respectful

We talk a bit about the AI crawler abuse we are seeing at Read the Docs, and warn that this behavior is not sustainable.

Read the Docs
@Firefishy @bart Interesting challenge. Care to share how you overcome these nuisances? Do you block them by headers or just let them wreck havoc?
@Firefishy My guess would be a contractor in a certain country was tasked to download OSM and as their culture is them never saying no and not questioning stuff they did exactly that with tools they already built for scraping other sites.
@Firefishy why would they bother to pay when OSM mappers do not care to enforce even its own license requirements? https://www.openstreetmap.org/user/laznik/diary/404381
laznik's Diary | The OSMF License Charade

OpenStreetMap is a map of the world, created by people like you and free to use under an open license.

OpenStreetMap
@leadingzero I created https://github.com/openstreetmap/tile-attribution and the automation that runs it. I care about attribution. OSM cannot do everything, but the people who make up the project do try. Remember OSM is just people like you and me working together.
GitHub - openstreetmap/tile-attribution: This repository is used for reporting and tracking sites which are using tile.openstreetmap.org tiles but without attributing OpenStreetMap. The sites are tracked in the issue tracker.

This repository is used for reporting and tracking sites which are using tile.openstreetmap.org tiles but without attributing OpenStreetMap. The sites are tracked in the issue tracker. - openstreet...

GitHub
@Firefishy The problem is that most OSM mappers do not care about the attribution requirement to be enforced. The key word is "enforced". Think about it — would you cast your vote to make the OSMF go to court to defend the license if the offending party ignores the love letters?
@leadingzero @Firefishy we have done that. Recently we settled out of court in Germany.
@grischard, in my blog post I asked OSMF to state, what are the criteria for taking a legal action against the license violators, as it is clear that most cases are NOT tried.
@leadingzero discussing this in public, letting some violators know that they’re at the bottom of our priorities list, would be self-sabotage.
Wrong answer @grischard. The correct answer should be "we are striving for 100% compliance". We, the OSM mappers granted the OSMF the exclusive right to defend the license and with that comes the duty to do so to the maximum extent. If you have a threshold, then explain to he OSM community why it exists.
@leadingzero I feel like you’re putting words in my mouth and trying to pick an argument here. I never said we have a threshold, but that we triage and prioritise.
@grischard sorry - my bad. However, there is little difference between triage and a threshold as both mean that OSMF is not striving to achieve 100% compliance, which is my point here and in my blog post as well. So let me repeat the question — why not resolve every case and do so in court if necessary?
@Firefishy how about OSM just blanket blocks AI crawlers
@Jessica Unfortunately not easy. User-Agents are often library-defaults (eg: python-requests/2.26.0) or faked (Browsers or "googlebot" or similar). Honouring robots.txt treated as optional. When blocked they change IP or User-Agent.
@Firefishy most AI crawlers do have their own user-agent so the big offenders can be blocked, like Bytespyder and such.
@Firefishy I'm not saying block every single IP that has ever datamined you for AI, I'm saying block the ones that truly cause damage, or you could set a really generous rate limit that only people who would cause trouble would go over.

@Firefishy @Jessica Is it possible to reply to unauthorized crawlers with data spiked with canaries? So when you take someone into court, you can show the judge logs that "after notifying them that they were in violation we altered our responses to add these custom changes only to this query from this IP with this timestamp".

I recall mapmakers of yore would include non-existent features on printed maps to help protect their copyright.

Websites are Blocking the Wrong AI Scrapers (Because AI Companies Keep Making New Ones)

Hundreds of sites have put old Anthropic scrapers on their blocklist, while leaving a new one unblocked.

404 Media
@Firefishy If they scrape your date then surely they're obliged to be an openAI.

@Firefishy The theft is the point.

They all want to be Steve Jobs mixed with Rumpelstiltskin, and other people’s data is the straw. Can’t be a business genius if you *pay* for data, when you can get it free.

@metaning @Firefishy OSM's data is already free, so there is nothing to pay *for*:

https://planet.openstreetmap.org/

Planet OSM

@derickr @metaning @Firefishy Yeah it is free. If it's available in completeness, why are they scraping. Now isn't that bad engineering. Making AI by taking help from AI.
@Firefishy they rather pay a developer 20000 to program and maintain the scraper.
@Firefishy They have no morals; they only know how to steal.