Mastodawn

Edward T Hall III Jul 31, 2024

Dear AI Companies, instead of sneakily scraping OpenStreetMap.org, how about a tiny $10,000 donation? We'll even throw in a shiny new download link to our entire planet's geo data! Who knew it was that easy? Start here: https://supporting.openstreetmap.org/donate/ #win #ai #bots #OpenStreetMap 🌍 🤖 🤑

Donate – OpenStreetMap Foundation

Grant Slater Jul 30, 2024

But wait there is more, for a $50,000 donation we'll even provide live minutely streaming updates direct from OpenStreetMap.org. #WIN #AI #Bots #OpenStreetMap

Bart Louwers Jul 30, 2024

@[email protected] I think when your target audience sees these prices they think it is a scam, because it is way too cheap for such a service.

Simon Poole Jul 30, 2024

@bart I suspect doing it the 'sensible' way would damage their disruptive image (worth 100s of millions $).

wb x64 Jul 30, 2024

@bart (for the unaware: you can get the planet data for free and streaming updates for free-to-cheap, OSM doesn't really sell its data, it's freely available)

Bart Louwers Jul 30, 2024

@[email protected] Scraping OpenStreetMap. 🤦 First time I see that combination of words.

Grant Slater Jul 30, 2024

@bart Unfortunately it is extremely common. Sometimes 100s of req/s hitting expensive API endpoints. Multiple IPs, faked UAs.

wikiyu Jul 30, 2024

@Firefishy @bart why scrap when there is planet.osm available?

Grant Slater Jul 30, 2024

@wikiyu @bart Sssssh! Don't give away our secret s̶a̶u̶c̶e̶ source. Honestly, no idea. The full planet.osm data would be a lot easier to use than painfully slow scraped data.

wikiyu Jul 30, 2024

@Firefishy @bart
Exactly, there are also increments with changes... and parts per continents and...
Oh god i cannot imagine ANY reason to scrap it from website.

Or maybe... it was shIT answer of ai for "download whole osm"

Jan ☕🎼🎹☁️🏋️‍♂️Jul 31, 2024

@wikiyu @Firefishy @[email protected] probably asked an ai for a program to get all the data...

@Firefishy @wikiyu @bart that's expecting people behind those companies to be, you know, actually competent

wikiyu Jul 31, 2024

@SRAZKVT @Firefishy @bart you won that conversation ;-)

@richh has moved Jul 30, 2024

@wikiyu @Firefishy @bart Because that’s what CoPilot told their junior dev to do when they asked for boilerplate code.
Probably.

Juan Luis Jul 30, 2024

@Firefishy @bart Similar thing to what's happening to @readthedocs then https://about.readthedocs.com/blog/2024/07/ai-crawlers-abuse/ (although luckily they were not being subject to faked UAs, AFAIK...)

AI crawlers need to be more respectful

We talk a bit about the AI crawler abuse we are seeing at Read the Docs, and warn that this behavior is not sustainable.

Read the Docs

neau Aug 2, 2024

@Firefishy @bart Interesting challenge. Care to share how you overcome these nuisances? Do you block them by headers or just let them wreck havoc?

Michał Jul 30, 2024

@Firefishy My guess would be a contractor in a certain country was tasked to download OSM and as their culture is them never saying no and not questioning stuff they did exactly that with tools they already built for scraping other sites.

leadingzero Jul 30, 2024

@Firefishy why would they bother to pay when OSM mappers do not care to enforce even its own license requirements? https://www.openstreetmap.org/user/laznik/diary/404381

laznik's Diary | The OSMF License Charade

OpenStreetMap is a map of the world, created by people like you and free to use under an open license.

OpenStreetMap

Grant Slater Jul 30, 2024

@leadingzero I created https://github.com/openstreetmap/tile-attribution and the automation that runs it. I care about attribution. OSM cannot do everything, but the people who make up the project do try. Remember OSM is just people like you and me working together.

GitHub - openstreetmap/tile-attribution: This repository is used for reporting and tracking sites which are using tile.openstreetmap.org tiles but without attributing OpenStreetMap. The sites are tracked in the issue tracker.

This repository is used for reporting and tracking sites which are using tile.openstreetmap.org tiles but without attributing OpenStreetMap. The sites are tracked in the issue tracker. - openstreet...

GitHub

leadingzero Jul 30, 2024

@Firefishy The problem is that most OSM mappers do not care about the attribution requirement to be enforced. The key word is "enforced". Think about it — would you cast your vote to make the OSMF go to court to defend the license if the offending party ignores the love letters?

Guillaume Rischard Jul 30, 2024

@leadingzero @Firefishy we have done that. Recently we settled out of court in Germany.

leadingzero Jul 31, 2024

@grischard, in my blog post I asked OSMF to state, what are the criteria for taking a legal action against the license violators, as it is clear that most cases are NOT tried.

Guillaume Rischard Jul 31, 2024

@leadingzero discussing this in public, letting some violators know that they’re at the bottom of our priorities list, would be self-sabotage.

leadingzero Jul 31, 2024

Wrong answer @grischard. The correct answer should be "we are striving for 100% compliance". We, the OSM mappers granted the OSMF the exclusive right to defend the license and with that comes the duty to do so to the maximum extent. If you have a threshold, then explain to he OSM community why it exists.

Guillaume Rischard Aug 1, 2024

@leadingzero I feel like you’re putting words in my mouth and trying to pick an argument here. I never said we have a threshold, but that we triage and prioritise.

leadingzero Aug 1, 2024

@grischard sorry - my bad. However, there is little difference between triage and a threshold as both mean that OSMF is not striving to achieve 100% compliance, which is my point here and in my blog post as well. So let me repeat the question — why not resolve every case and do so in court if necessary?

Jes Moved to @[email protected]Jul 30, 2024

@Firefishy how about OSM just blanket blocks AI crawlers

Grant Slater Jul 30, 2024

@Jessica Unfortunately not easy. User-Agents are often library-defaults (eg: python-requests/2.26.0) or faked (Browsers or "googlebot" or similar). Honouring robots.txt treated as optional. When blocked they change IP or User-Agent.

Jes Moved to @[email protected]Jul 30, 2024

@Firefishy most AI crawlers do have their own user-agent so the big offenders can be blocked, like Bytespyder and such.

Jes Moved to @[email protected]Jul 30, 2024

@Firefishy I'm not saying block every single IP that has ever datamined you for AI, I'm saying block the ones that truly cause damage, or you could set a really generous rate limit that only people who would cause trouble would go over.

John Deters Jul 31, 2024

@Firefishy @Jessica Is it possible to reply to unauthorized crawlers with data spiked with canaries? So when you take someone into court, you can show the judge logs that "after notifying them that they were in violation we altered our responses to add these custom changes only to this query from this IP with this timestamp".

I recall mapmakers of yore would include non-existent features on printed maps to help protect their copyright.

Ra Jul 31, 2024

@Jessica @Firefishy https://www.404media.co/websites-are-blocking-the-wrong-ai-scrapers-because-ai-companies-keep-making-new-ones/

Websites are Blocking the Wrong AI Scrapers (Because AI Companies Keep Making New Ones)

Hundreds of sites have put old Anthropic scrapers on their blocklist, while leaving a new one unblocked.

404 Media

groff 🇺🇦Jul 30, 2024

@Firefishy If they scrape your date then surely they're obliged to be an openAI.

Matt Godden Jul 31, 2024

@Firefishy The theft is the point.

They all want to be Steve Jobs mixed with Rumpelstiltskin, and other people’s data is the straw. Can’t be a business genius if you *pay* for data, when you can get it free.

Derick Rethans Jul 31, 2024

@metaning @Firefishy OSM's data is already free, so there is nothing to pay *for*:

https://planet.openstreetmap.org/

Planet OSM

Priyanshu Rai Aug 27, 2024

@derickr @metaning @Firefishy Yeah it is free. If it's available in completeness, why are they scraping. Now isn't that bad engineering. Making AI by taking help from AI.

gunstick Jul 31, 2024

@Firefishy they rather pay a developer 20000 to program and maintain the scraper.

Adolfo Jayme Barrientos Aug 1, 2024

@Firefishy They have no morals; they only know how to steal.