Mastodawn

@nixCraft In my experience it's not enough to merely #block said #AIscrapers, but literally necessary to fight back by sending them *malicious data* with EVERY REQUEST* whilst rate limiting to a crawl to combat their literal DDoS-Attacks!

(max. 1 connection at 75 bit/s per IP & request max. 1 request per IP, 120s crawl-delay enforced, redirecting them to EICAR "Malware" every time they violate said limits, commit Blackholing at Upstream / IX-Level)

PPC Land Jun 19

FYI: Fox buys Roku, Publicis and TTD end feud, UK publishers sue AI scrapers: Fox's $22bn Roku deal reshapes CTV, Publicis and The Trade Desk end dispute, UK publishers bill AI scrapers £500 per scraped article through county courts. https://ppc.land/fox-buys-roku-publicis-and-ttd-end-feud-uk-publishers-sue-ai-scrapers/ #FoxBuysRoku #CTV #Publicis #TheTradeDesk #AIscrapers

Fox buys Roku, Publicis and TTD end feud, UK publishers sue AI scrapers

Fox's $22bn Roku deal reshapes CTV, Publicis and The Trade Desk end dispute, UK publishers bill AI scrapers £500 per scraped article through county courts.

PPC Land

shom Jun 15

My PineNote blog posts were linked to from the main body of a discussion thread on Hacker News but due to the consistent and prolific AI scraping it's barely a blip on my traffic. Remember when humans used to hug sites to death? In the old days this would absolutely stick out.

The good news is a human emailed me and asked me for my experience one year in and that was nice (which is also how I found out about the HN post). I'll share my response below until I get around to fleshing out a blog post from it:

I should absolutely do a write-up a year in. I'll give you the highlights:

I lost my original PineNote and after a few months of trying to get it back, I ended up getting a replacement, which should say something all in itself.
I wish the device was a little bit smaller and lighter, but I have gotten used to it.
I use it as a single function device at a time, mostly writing or reading, very occasional computing.
However, I on a month-long trip and I only brought along my work laptop and the PineNote instead of bringing a personal laptop. I can do all the personal computing that I need on the PineNote and my phone. See my setup here: https://shom.dev/posts/20250406_a-pinenote-only-5-day-weekend/
I wish Pine64 would update their images with the kernel optimization created by the community member hrdl, they do link to it from their official docs.
There is also an interesting project called QuillOS, which provides a nice interface to the whole system, but it's not ready for primetime on PineNote.

#PineNote #HackerNews #AIScrapers #AIBots

Upsun Jun 11

AI scrapers are driving up your hosting costs while real users are left waiting in the digital lobby 🤖

It is time to take the pressure off your infrastructure by using robots.txt and cache normalization to manage those thirsty bots 💡

We are sharing how to set sane application limits so your site stays fast for humans and does not turn into a villain story for your budget/.

👉 https://developer.upsun.com/posts/insights/the-not-so-hidden-cost-of-ai-scrapers

#WebPerformance #DevOps #AIScrapers #TechInsights

The (not so) hidden cost of AI scrapers - Upsun Developer

AI scrapers drive up your hosting cost while real users wait. Use robots.txt, cache normalization, and sane application limits to take the pressure off.

Upsun Developer

Vint Prox Jun 3

RE: https://mastodon.social/@gamingonlinux/116560908559455886

#RPCS3, open-source PlayStation 3 emulator and debugger, says on Twitter:

> PSA: #Tencent is aggressively scraping the Internet to build yet another AI slop chatbot, DDoSing many websites in the process.
>
> We've found that, as of last week, their scraping bots can now solve Cloudflare challenges and behave like real users while ignoring robots.txt. In the last 24 hours alone, our website received more than 3 million successful requests from Tencent bot IP addresses, plus another 1 million that were blocked by Cloudflare challenges.
>
> These recurring DDoS attacks from Tencent have been going on for over a year, and we have been constantly adjusting our firewall rules to filter them while trying not to impact Tencent's real users. Because that is no longer possible, we're now fully blocking Tencent IP addresses, starting with ASN 132203. We recommend other sysadmins do the same.

https://x.com/rpcs3/status/2061946000734888017

#DDoS #AI #AIChatBot #AIScraping #AIScraper #AIScrapers #PlayStation #PlayStation3 #PS3

GameSieve May 31

I need to actually read the report in depth, but this is a pretty strong summary:

"Amnesty International finds that standalone generative AI systems, based on unlawful web scraping, depend on mass invasions of privacy by design, and are fundamentally incompatible with IHRL. As such, Amnesty International is calling for a prohibition of such systems."

https://www.amnesty.org/en/documents/pol40/0996/2026/en/

I wonder if this will make anyone working on this type of scraping pause and reflect?

#noAI #AmnestyInternatinal #aiscrapers

Unlawful by design: Exposing the human rights costs of generative AI - Amnesty International

This briefing examines how standalone generative AI systems, based on unlawful web scraping, are in conflict with international human rights law (IHRL) and standards through their design, development and deployment. While these technologies promise sophisticated automation and efficiency, they rely on data collection and model training practices that abuse privacy rights, enable discrimination, and threaten […]

Amnesty International

GameSieve May 26

Why is #twitter not properly identifying itself as a bot when trying to scrape my website? (69.12.56.0/21 is AS63179 is Twitter)

Could it be cause they're a malicious party training an #aibot?

(This is extremely low-intensity, but based on the combination of this specific UA and the pages they're trying to reach, I've seen them before, coming in from residential proxies.)

The funny thing is that bots identifying as bots and observing robots.txt would actually be allowed to reach those particular pages.

#aiscrapers #scrapers

GameSieve May 21

After 615 requests over pretty much exactly 24 hours, the #aiscraper abusing #residentialproxies to try and repeatedly request one particular page on #GameSieve - 18 times successfully, before I noticed it being stuck in a loop and added another block rule - finally disappeared. However, its final request was successful and is worrying, as it came through fetch.tunnel.googlezip.net - which apparently is #Google 's Chrome Prefetch Proxy.

I've noticed requests from that range before, but always assumed that was legitimate. Do I now have to think about blocking that bit of infrastructure as well, as #scrapers have found a way to piggyback on it? Urgh!

I guess I'll start by blocking prefetching via .well-known/traffic-advice and see what that does...

#aiscrapers #aibots

Kévin Apr 26

Sunday Somewhat Funday

Occasionally there can be joy

https://mmn.ca/~/English/Sunday%20Somewhat%20Funday/

PPC Land Mar 22

Czech publishers get new robots.txt shield against AI scrapers: SPIR on March 19 updated its standard for Czech online publishers to opt out of AI text and data mining, adding real-time response crawlers to the scope of the robots.txt framework. https://ppc.land/czech-publishers-get-new-robots-txt-shield-against-ai-scrapers/ #CzechPublishing #AIScrapers #RobotsTxt #DataMining #OnlinePrivacy

Czech publishers get new robots.txt shield against AI scrapers

SPIR on March 19 updated its standard for Czech online publishers to opt out of AI text and data mining, adding real-time response crawlers to the scope of the robots.txt framework.

PPC Land