AI scrapers are driving up your hosting costs while real users are left waiting in the digital lobby 🤖

It is time to take the pressure off your infrastructure by using robots.txt and cache normalization to manage those thirsty bots 💡

We are sharing how to set sane application limits so your site stays fast for humans and does not turn into a villain story for your budget/.

👉 https://developer.upsun.com/posts/insights/the-not-so-hidden-cost-of-ai-scrapers

#WebPerformance #DevOps #AIScrapers #TechInsights

The (not so) hidden cost of AI scrapers - Upsun Developer

AI scrapers drive up your hosting cost while real users wait. Use robots.txt, cache normalization, and sane application limits to take the pressure off.

Upsun Developer

RE: https://mastodon.social/@gamingonlinux/116560908559455886

#RPCS3, open-source PlayStation 3 emulator and debugger, says on Twitter:

> PSA: #Tencent is aggressively scraping the Internet to build yet another AI slop chatbot, DDoSing many websites in the process.
>
> We've found that, as of last week, their scraping bots can now solve Cloudflare challenges and behave like real users while ignoring robots.txt. In the last 24 hours alone, our website received more than 3 million successful requests from Tencent bot IP addresses, plus another 1 million that were blocked by Cloudflare challenges.
>
> These recurring DDoS attacks from Tencent have been going on for over a year, and we have been constantly adjusting our firewall rules to filter them while trying not to impact Tencent's real users. Because that is no longer possible, we're now fully blocking Tencent IP addresses, starting with ASN 132203. We recommend other sysadmins do the same.

https://x.com/rpcs3/status/2061946000734888017

#DDoS #AI #AIChatBot #AIScraping #AIScraper #AIScrapers #PlayStation #PlayStation3 #PS3

I need to actually read the report in depth, but this is a pretty strong summary:

"Amnesty International finds that standalone generative AI systems, based on unlawful web scraping, depend on mass invasions of privacy by design, and are fundamentally incompatible with IHRL. As such, Amnesty International is calling for a prohibition of such systems."

https://www.amnesty.org/en/documents/pol40/0996/2026/en/

I wonder if this will make anyone working on this type of scraping pause and reflect?

#noAI #AmnestyInternatinal #aiscrapers

Unlawful by design: Exposing the human rights costs of generative AI - Amnesty International

This briefing examines how standalone generative AI systems, based on unlawful web scraping, are in conflict with international human rights law (IHRL) and standards through their design, development and deployment. While these technologies promise sophisticated automation and efficiency, they rely on data collection and model training practices that abuse privacy rights, enable discrimination, and threaten […]

Amnesty International

Why is #twitter not properly identifying itself as a bot when trying to scrape my website? (69.12.56.0/21 is AS63179 is Twitter)

Could it be cause they're a malicious party training an #aibot?

(This is extremely low-intensity, but based on the combination of this specific UA and the pages they're trying to reach, I've seen them before, coming in from residential proxies.)

The funny thing is that bots identifying as bots and observing robots.txt would actually be allowed to reach those particular pages.

#aiscrapers #scrapers

After 615 requests over pretty much exactly 24 hours, the #aiscraper abusing #residentialproxies to try and repeatedly request one particular page on #GameSieve - 18 times successfully, before I noticed it being stuck in a loop and added another block rule - finally disappeared. However, its final request was successful and is worrying, as it came through fetch.tunnel.googlezip.net - which apparently is #Google 's Chrome Prefetch Proxy.

I've noticed requests from that range before, but always assumed that was legitimate. Do I now have to think about blocking that bit of infrastructure as well, as #scrapers have found a way to piggyback on it? Urgh!

I guess I'll start by blocking prefetching via .well-known/traffic-advice and see what that does...

#aiscrapers #aibots

Sunday Somewhat Funday

Occasionally there can be joy

https://mmn.ca/~/English/Sunday%20Somewhat%20Funday/

Czech publishers get new robots.txt shield against AI scrapers: SPIR on March 19 updated its standard for Czech online publishers to opt out of AI text and data mining, adding real-time response crawlers to the scope of the robots.txt framework. https://ppc.land/czech-publishers-get-new-robots-txt-shield-against-ai-scrapers/ #CzechPublishing #AIScrapers #RobotsTxt #DataMining #OnlinePrivacy
Czech publishers get new robots.txt shield against AI scrapers

SPIR on March 19 updated its standard for Czech online publishers to opt out of AI text and data mining, adding real-time response crawlers to the scope of the robots.txt framework.

PPC Land
Challenges In Stopping Unauthorized AI Data Scraping

Data scrapers used to train LLMs can be evasive. Our recent view of unauthorized AI data scraping attempts against Kasada customers.

Kasada

Traffic sources to my #SelfHosted #Gitea instance. You can clearly see where the real visits are and where the AI scrapers are. Last time I checked, they weren’t triggering any analytic events. They are definitely improving.

#aiscrapers #ai #llm #LLMs #aislop #homelab #selfhost #selfhosting

@grumpybozo

30% of web search traffic goes through #AI now.

The same folk who pontificate about lost web traffic will gleefully tell you they are blocking "#Aiscrapers"