Those huge models are trained by stealing.
Stealing books, stealing posts, stealing code, stealing music, stealing private pictures, stealing everything they can without remorse.
Running a Forgejo instance myself I've been flooded with bots too. And yes, I've also noticed that those bots took the efforts of bypassing the Anubis checks. Btw, they download VERY heavy pages too (like git blame and git diff) without bothering to throttle their requests or respecting robots.txt. I'm basically running the CPU on my server at 100% just to let some greedy guys with way more resources than us exploit us to train their AI models.
Not only. I also run a Wikipedia frontend (Wikiless), YouTube frontend (Invidious), X frontend (Nitter) and Reddit frontend (Redlib).
All of those have been suspended at least once in the past months because of excessive requests. And guess why? Just two weeks ago I had to make my Invidious instance accessible from my VPN only because somebody on the Alibaba network flooded it for days with 25 req/sec to random YouTube videos (I guess that DeepSeek needs multimedia to train a new model?)
I believe that grounds for lawsuits against such abuses must be established. As well as commercial deals that allow both parts to profit if they want. But the current state of things isn't sustainable, and it's hitting small self-hosting enthusiasts like me the most.
https://social.anoxinon.de/@Codeberg/115435661014427222
Codeberg (@Codeberg@social.anoxinon.de)
We apologize for the long performance degradation today. Finally, we identified all of the 'tricks' that AI crawlers found today. They no longer bypass the anubis proof of work challenges. A novelty for us was that AI crawlers seem to not only crawl URLs that are actually presented to them by our frontend, but they converted the URLs into a format that bypassed our filter rules. By the way, you can track the changes we have been doing via https://codeberg.org/Codeberg-Infrastructure/scripted-configuration/compare/51618~1..e4aac





