Challenges In Stopping Unauthorized AI Data Scraping

Data scrapers used to train LLMs can be evasive. Our recent view of unauthorized AI data scraping attempts against Kasada customers.

Kasada

Traffic sources to my #SelfHosted #Gitea instance. You can clearly see where the real visits are and where the AI scrapers are. Last time I checked, they weren’t triggering any analytic events. They are definitely improving.

#aiscrapers #ai #llm #LLMs #aislop #homelab #selfhost #selfhosting

@grumpybozo

30% of web search traffic goes through #AI now.

The same folk who pontificate about lost web traffic will gleefully tell you they are blocking "#Aiscrapers"

#AIBots may lead to the end of the internet as we know it

In recent weeks, #OpenDemocracy’s website has been repeatedly brought down by an army of bots. We’re not the only ones

Matthew Linares
20 February 2026

Excerpt: "Slater explained that 'the traffic often arrives through anonymous residential IPs', referring to residential proxy networks that route internet traffic through intermediary servers using IP addresses assigned by internet service providers to real homeowners. This, he said, makes it 'hard to distinguish ‘normal users’ from automated collection'. [That's not right and needs to be changed!!!]

" 'We're being forced into permanent defence mode. #ResidentialProxyNetworks let #AIScrapers hide in plain sight, rotate identities, and extract data at scale. That shifts real costs onto projects that exist to serve people, not feed training pipelines."

Read more:
https://www.opendemocracy.net/en/ai-chatbots-scraper-bots-chatgpt-website-offline-change-internet/

#AISucks #AI #DataMining #Internet #Websites #TechNews #AI #ArtificialIntelligence #BigTech #TechBros

AI bots’ attacks may end the internet as we know it

In recent weeks, openDemocracy’s website has been repeatedly brought down by an army of bots. We’re not the only ones

openDemocracy

Happy to see some updates on AI.ROBOTS.TXT : « A list of AI agents and robots to block » 🤖 🚫

https://github.com/ai-robots-txt/ai.robots.txt

#NoAI #DNS #AIScrapers

GitHub - ai-robots-txt/ai.robots.txt: A list of AI agents and robots to block.

A list of AI agents and robots to block. Contribute to ai-robots-txt/ai.robots.txt development by creating an account on GitHub.

GitHub

Webspace Invaders - Matthias Ott

(…) In their hunger for data to train their large language models, companies from all over the world are systematically harvesting every word I’ve ever published, feeding it into their language models to keep them fresh – and the side effect, the collateral damage, is that Kevin in Montreal now can’t read my articles because my hosting provider decided the solution was to block Canada and half the rest of the world.
I sat there staring at those logs for a while. The irony wasn’t lost on me. This is my little corner of the web. My writing. With my weird little style mixer up there in the top right. And now it is simultaneously being strip-mined by AI companies and effectively made inaccessible to actual humans around the world who might want to read it.
This is where we are in 2026. (…) Yes, the AI companies need to do better. They actually should throttle their scraping to reasonable levels. They actually should respect the limited resources of small sites. They actually should develop industry standards that don’t externalize costs onto individuals who are just trying to share their work. (…) matthiasott.com

I can't help but getting really really angry about all this and what it does to the web I used to love.

#ai #aiScrapers #collateraldamage #exploitation #otemporaomores #Web

https://webrocker.de/?p=29765

Webspace Invaders · Matthias Ott

There’s something happening on the Web at the moment that almost feels like watching that old arcade game Space Invaders play out across our servers. Bots and scrapers marching in formation, attacking our servers wave after wave, systematically requesting page after page, relentlessly filling their data stores while we watch our access logs fill up.

Matthias Ott – Web Design Engineer
Should I be suspicious of all these requests that say they're on Android and using Safari? 🙃 or is that like a thing that people can do now? or I guess it looks like search bots but Bing was supposed to be blocked 🤨 (because they use the same bot for search indexing and AI scraping)
#askFedi #webHosting #serverAdmin #botBlocking #botScrapers #aiScrapers
edit: I'm aware people can change their user agents to scrape, that's why I'm suspicious of these ones bc why else would they be changed to this

Looks like a new player leads the #ipv4games leaderboard. Who else, but a #proxy provider, just followed by another one of it's kind. Not sure if this would be the best advertisement for a company you'd like to appear as "legit".

#IPv4 #threatintel #DDoS #AIScrapers

Hmm chatgpt ser ud til at holde fobindelsen til min hjemmeside i live meeeeget længe. Til sammenligning er besøg fra copilot og perplexity normal <1sek.
#webstats #AIscrapers #weird #danskertrut
this is the proportion of my nginx access logs from my forgejo (which is now closed off) that are from a meta, amazon, or google #AIscrapers bot.
This host has a total deny robots.txt.