We need something beyond robots.txt. I want my context search engines indexed. I do not want my content used to train AI. https://searchengineland.com/gptbot-openais-new-web-crawler-430360
GPTBot - OpenAI's new web crawler

You can now disallow ChatGPT from crawling your website and webpages.

Search Engine Land

@gallaugher You do realize that BOTH ARE THE SAME?

You literally can't differenciate them!

The only winning move is to literally block the ASNs of the networks being used to collect training data from which means you've to ban all the GAFAMS from your systems.

Good luck with that!

Not to mention that any #AiBan isn't legally enforceable...
https://felixreda.eu/2021/07/github-copilot-is-not-infringing-your-copyright/

GitHub Copilot is not infringing your copyright

Felix Reda

@kkarhan @gallaugher Many crawlers abide by robots, others do not.

All fediverse apps need to be able to go beyond robots.txt and implement agent level filtering as well at the request level for those that do not. (Won’t stop bad actors)

It’s the GETs that you have to stay on top of to catch the buggers.

Last week ByteDance used 50+ different IP’s and ignored the robots file.

@dogriley @gallaugher That doesn't work as this will either prevent people from being able to use #Accessibility Brownsers like #LynxBrowser...

What you can do is literally block entire ASNs of those companies and write a salty #AbuseReport, CC'ing everyone that is interconnecting the attacker with your network and demand they'll handle said rogue traffic.

gallaugher

We need something beyond robots.txt. I want my context search engines indexed. I do not want my content used to train AI. https://searchengineland.com/gptbot...

@dogriley @gallaugher

That being said there are some resources to reduce traffic is that's what you want...

https://www.keycdn.com/blog/web-crawlers

https://github.com/monperrus/crawler-user-agents

Web Crawlers - Top 10 Most Popular - KeyCDN

Web crawlers can play a vital part in getting your content indexed. Check out our list of the top 10 web crawlers to ensure your handling them correctly.

KeyCDN

@dogriley @gallaugher That being said your mileage would vary greatly.

For example, #crawlers can't be banned in Germany if they act with "legitimate interest" [i.e. price comparison systems]...

There was a court case of an airline trying to ban crawlers from accessing their site, and said airline lost against the comparison site.
https://www.internetworld.de/digitaler-handel/rechtstipp/screen-scraping-erlaubt-473348.html

#NotLegalAdvice OFC!

"Screen Scraping" immer erlaubt?

Online-Flugbuchungs-Portale greifen auf Daten verschiedener Fluggesellschaften zurück. Was aber, wenn die AGBs einer Fluggesellschaft das Herausziehen der Daten verbieten? Von Stefan Michel

iwb