The AI bots that desperately need OSS for code training, are now slowly killing OSS by overloading every site.

The curl website is now at 77TB/month, or 8GB every five minutes.

https://arstechnica.com/ai/2025/03/devs-say-ai-crawlers-dominate-traffic-forcing-blocks-on-entire-countries/

Open source devs say AI crawlers dominate traffic, forcing blocks on entire countries

AI bots hungry for data are taking down FOSS sites by accident, but humans are fighting back.

Ars Technica
I think users (like GitHub/MS and friends) have a responsibility to push back on the AI companies they lean so heavily on and demand they behave. But I have no expectation they will.
@bagder can we find a way to give them code, but randomly-generated one?
(I know some mastodon instances that does the same with messages 😉)

@benjamin @bagder

There's an effort by some to either waste their time with a bogus maze of junk pages (cloudflare is doing this), or to deliberately poison the dataset with junk once you realize it's an AI scraper.

The scrapers causing the issue ignore robots.txt, don't rate limit, and hit the same content frequently (like every hour).

Trapping AI

This is a methodically structured poisoning mechanism designed to feed nonsensical data to persistent bots and aggressive “AI” scrapers that circumvent robots.txt directives.

ASRG
@Namnatulco @tbortels @bagder @asrg
the one I know are using iocaine and they are happily serving ~ 60 K pages every day to GptBot :D
@bagder i mean Microsoft are tied quite heavily into the AI business themselves with 10+ billion investments in OpenAI. financially they're incentivised to get more training data at all costs
@bagder (not that i believe this is a good thing, frankly i wish they'd all burn down)

@bagder

There's always iocain et. al.

(https://algorithmic-sabotage.github.io/asrg/trapping-ai/)

Putting something like that trained on source code behind your robots.txt will save you a lot of CPU (and maybe bandwidth). And if they complain, you can point to the robots.txt.

Trapping AI

This is a methodically structured poisoning mechanism designed to feed nonsensical data to persistent bots and aggressive “AI” scrapers that circumvent robots.txt directives.

ASRG

@bagder

Time to start defending, beginning with the blog posts. Give 'em invalid data.

Wonder if an AI crawler will catch that or if it simple strips the tags away to read the text fully.

<div>
Loem Ipsum <span>and Trump and the AI bros have no balls</span> parvus principio...
</div>

span {
display:none;
}

@bagder

and identify them and give 'em back gibberish with slow latency. That a simple request lasts a minute or so.

Vegetative electron microscopy – David Bradley

@bagder This is what happens when there is competition where there should be cooperation. AI research and development could be, _should_ be a collaborative project, not owned by anybody and open to everybody, but instead it's a bunch of corporations trying to outrun each other.
The Tragedy of the Commons only exists when there is competition instead of cooperation. Competition is how we ruin everything by trying to grab it all before anybody else does. Cooperation is how we can give everybody whatever they need for free and still have enough for all of us.
Why train so many machine learning models that aren't all that different, which are owned and run by private enterprises, when we could instead train much fewer models that aren't owned by anybody and can be used for free?

@bagder
I hope your project can survive.

You remember/know in the 2000's thousands+ of DSL routers verified time against a few NTP servers ~ reported as "a thunderclap of traffic on the peak of every hour". So the organisations forced the setup of more local servers for the NTP service?

If the AI engineers where nice people, they could setup one of their laptops as a local relay/ image for your site, and poll it every second **much faster**, so the end-users would get a better service.

@bagder honestly I'm surprised no-one has suggested litigation. In the civil avenue the damage is clear. And you could make the case for penal, which would be enough to ID the people behind this. Standard IANAL disclaimer
@bagder corporately owned organizations have all bought in the AI marketing, so very little chance they'll do anything about it. But would like to see AI countermeasure developed because this is out of control.

@bagder What is the use of them hammering the website over and over again. They do the same for the Fedora wiki... It is not like they need be near real-time.

Are you considering an IP block ?

@gbraad we specifically don't have logs so I can't tell exactly where they come from, but I read others' analyses of the problem and from what I hear they are quite hard to block properly. We are fortunate to have Fastly that hosts the site and thus is the one that handles the onslaught
@gbraad @bagder I've seen people mention the bot don't do consecutive hit from the same IP and have a mix of bot and legitimate user agent. I think they use their client machine to do the http request. Prompt ChatGPT, fetch a random url at the same time.
@gkrnours @gbraad @bagder That could be. I sent GPT to explain my own code on github to see if it could (not really, no). I just gave it a url. I don't know what it did with that and didn't consider it (it's github anyway).
@crazyeddie @gbraad @bagder Maybe my hypothesis wasn't clear. There is a client to send a request to ChatGPT. That client could be used to do more than send request to ChatGPT, like fetching for them page they want to scrap. From your computer. With enough traffick, this would give them enough IP to evade most attempt at blocking them

@gkrnours @gbraad @bagder Yeah, it would be a really weird way to do it, but if it bypasses filters they'd probably do it.

You can also run ChatGPT on your own computer. I don't know to what degree you can do actual training on it, but you can chat with it. I had deepseek and one other running the other day. ChatGPT is heavier but can also be done.

@gbraad @bagder You'd have to block entire data centers and many of those are also used for public hosting. So blocking the IP ranges is often not an option if you want legitimate users to be able to access your site.

At least the big ones have proper user agents which you can black hole if they don't respect robots.txt. Honestly most of them do. But even a year ago I didn't have enough traffic from crawlers that it was even worth looking into.

@truh this is actually happening by one of the actors in the opposite direction (on a national level). Even if blocked, they will eventually acquire VM across the pond and continue. It is more they behaviour and the how.

@gbraad Should clarified that I'm only talking from my limited experience. I'm sure others experience far nastier than what I get on my little home lab that isn't even really linked anywhere.

Even if you can do something it's just so tiring that you even have to do something. This is not what I want to spend my evenings on.

@truh I have been in hosting for 'decades' (dang, that sounds bad). And yes, I have seen increased traffic, especially from a specific geo.... though some of that has moved as a spike to known cloud providers. So far, the CDNs I use have not complained and taken the brunt... like mentioned also by the OP. Ugh... just hope they would respect robots.txt.
@gbraad I don't really want Cloudflare to have my traffic.
The 8 Best Cloudflare CDN Alternatives in 2025

In this post, we have provided a list of the best Cloudflare CDN alternatives for websites. Ultimately, the choice of a Cloudflare alternative CDN should

RunCloud Blog
@gbraad Cloudflare and its competitors all decrypt your traffic... I don't really understand why people think that's ok.
@gbraad It's the infrastructure equivalent of inviting Jeffrey Goldberg into your Signal group.
@truh @gbraad not "all", depends what sort of protection you're using. I've used Akamai Prolexic before which does not.

@bracken @truh

Right, I also do not terminate (re-encrypt) at them... this is not as effective, but allows me to use my own certs, and have them merely be the entry point.

@gbraad @bagder most of them are probably written by AI
@oha @gbraad @bagder Is this the “singularity” they talk about?

@gbraad @bagder They don't store the data. These bots are usually training AI straight from the web because scraping is cheaper than storing.

So if you block them, they just lose that "knowledge".

At least as far as I understand.

@sheogorath @bagder

That is interesting...

but yeah, as @rejzor also stated; they do state they have 'knowledge' till 2022 or so.

I tried some of those in Asia, and they are mostly stating things as verbatim.

@gbraad @bagder

I couldn't stop thinking about this. The maths for this are insane. Without any fancy stuff on AWS it costs you $0.08 per month to store 1 GB of data.

But for network traffic you only pay for the traffic you send out ($0.09/GB). On my website, the latest article fetched with Firefox is ~1kb for the request and 5kb (14kb after decompression) as response.

Or in other words: Pulling stuff over the network is ~5-14 times more effective for text. With images we talk about >1:1000.

@gbraad @bagder And when you ask a chatbot it tells you their cutoff date is November 2022. Wtf?
@rejzor @gbraad @bagder actually they may had _learned_ that from chatbot outputs posted on the web.
@bagder what was the traffic before that?

@skaverat three years ago we were at less than 20TB/month, but there is no clear cut-off date nor do I know exactly what amount of this traffic that is AI bots and not

(edit: I meant TB, not GB)

@bagder @skaverat sorry, curl’s traffic went up 4000x in 3 years? Or did you meant 20TB/month here?

@bagder @skaverat An increase in traffic from 20 TB/month to 80 TB/month in 3 years seems normal to me.

Why would an AI crawl your site more than any search engine would? They all want just one copy of your site once/month or so. If an AI visits your site on demand in real time then that's actually just a visit from a biological human that asked their AI to get current data and not rely on its memory of old data.

Sure, one biological human will cause more traffic than before AI. As expected.

@bagder time for a tarpit ...
@jimfuller @bagder I have a git forge and a slow connection. That's basically a tar pit right?
@bagder ive had people tell me it's the projects fault for not using github/gitlab for their infra and centralizing it all, like that's a reasonable thing to do in 100% of cases >.>
@pearl @bagder we should do exactly the opposite, putting all our keys in one company's hand is probably the worst idea in the world
@SRAZKVT @pearl it also doesn't help. curl for example already hosts its git repositories on GitHub, but we still have a website that gets bogged down by the bots
@pearl @bagder > like that's a reasonable thing to do in 100% of cases >.>

Or in any cases.

We should ideally be using content-addressed distributed networks with updatable datasets. That would prevent the entire issue.
@bagder And it largely seems like bad implementations, which will not make the LLMs better (rather the opposite). https://go-to-hellman.blogspot.com/2025/03/ai-bots-are-destroying-open-access.html
AI bots are destroying Open Access

There's a war going on on the Internet. AI companies with billions to burn are hard at work destroying the websites of libraries, archives, ...

@bagder More about this problem (LLMs are used to create bad bots, but people may not always run them for reasons related to LLM training). https://gbilder.com/posts/2025-04-02-bots-behaving-badly/
Bots Behaving Badly

Bots Behaving Badly

@bagder We noticed this on some @weblate servers as well. Used aggressive blocking to combat that, but I'm pretty sure it blocked legitimate users as well.

@bagder I see the article says "cycling through residential IP addresses as proxies"

Does this imply that these AI crawlers are using botnets?

@AndyK1970 @bagder maybe, but there are also some p2p vpn solutions that claim to use your connection for other (random) users while letting you use some stranger's internet connection.

@AndyK1970 @bagder

Yes they are.
They work in two time :

- First they hammer down some content provider, thus all traffic from blocable bot will be removed
- Two they sell the data they collected for a big price.

At least it's their plan.

@bagder

We collected 470K IPv4s from a botnet that was trying to get all the content from our social network; it was behaving in such a way that we could track every single request it made. Since we blocked it, the server has been working much better; it hasn't been running with such a low load for at least a year.

https://seenthis.net/messages/1105923

https://framapiaf.org/@biggrizzly/114227612269042897

We collected 470K IPv4s from a botnet that was trying to…

all the content from our social network; it was behaving in such a way that we could track every single request it made. Since we blocked it, the server has been working much better; it hasn't been running with such a low load for at least a year. (...)