Mastodawn

@gallaugher You do realize that BOTH ARE THE SAME?

You literally can't differenciate them!

The only winning move is to literally block the ASNs of the networks being used to collect training data from which means you've to ban all the GAFAMS from your systems.

Good luck with that!

Not to mention that any #AiBan isn't legally enforceable...
https://felixreda.eu/2021/07/github-copilot-is-not-infringing-your-copyright/

GitHub Copilot is not infringing your copyright

Felix Reda

Sean Riley

@kkarhan @gallaugher Many crawlers abide by robots, others do not.

All fediverse apps need to be able to go beyond robots.txt and implement agent level filtering as well at the request level for those that do not. (Won’t stop bad actors)

It’s the GETs that you have to stay on top of to catch the buggers.

Last week ByteDance used 50+ different IP’s and ignored the robots file.

@dogriley @gallaugher That doesn't work as this will either prevent people from being able to use #Accessibility Brownsers like #LynxBrowser...

What you can do is literally block entire ASNs of those companies and write a salty #AbuseReport, CC'ing everyone that is interconnecting the attacker with your network and demand they'll handle said rogue traffic.

@dogriley @gallaugher After all, every IP adress block has a WHOIS!
https://tausibs.org/display/66e18145-1364-d162-3fc5-a79101940885

gallaugher

We need something beyond robots.txt. I want my context search engines indexed. I do not want my content used to train AI. https://searchengineland.com/gptbot...

https://www.keycdn.com/blog/web-crawlers

@dogriley @gallaugher

That being said there are some resources to reduce traffic is that's what you want...

https://github.com/monperrus/crawler-user-agents

Web Crawlers - Top 10 Most Popular - KeyCDN

Web crawlers can play a vital part in getting your content indexed. Check out our list of the top 10 web crawlers to ensure your handling them correctly.

KeyCDN