@gallaugher You do realize that BOTH ARE THE SAME?
You literally can't differenciate them!
The only winning move is to literally block the ASNs of the networks being used to collect training data from which means you've to ban all the GAFAMS from your systems.
Good luck with that!
Not to mention that any #AiBan isn't legally enforceable...
https://felixreda.eu/2021/07/github-copilot-is-not-infringing-your-copyright/
@kkarhan @gallaugher Many crawlers abide by robots, others do not.
All fediverse apps need to be able to go beyond robots.txt and implement agent level filtering as well at the request level for those that do not. (Won’t stop bad actors)
It’s the GETs that you have to stay on top of to catch the buggers.
Last week ByteDance used 50+ different IP’s and ignored the robots file.
@dogriley @gallaugher That doesn't work as this will either prevent people from being able to use #Accessibility Brownsers like #LynxBrowser...
What you can do is literally block entire ASNs of those companies and write a salty #AbuseReport, CC'ing everyone that is interconnecting the attacker with your network and demand they'll handle said rogue traffic.
That being said there are some resources to reduce traffic is that's what you want...
@dogriley @gallaugher That being said your mileage would vary greatly.
For example, #crawlers can't be banned in Germany if they act with "legitimate interest" [i.e. price comparison systems]...
There was a court case of an airline trying to ban crawlers from accessing their site, and said airline lost against the comparison site.
https://www.internetworld.de/digitaler-handel/rechtstipp/screen-scraping-erlaubt-473348.html
#NotLegalAdvice OFC!