I really can't overemphasize how destructive this is to the open web. The web is built just as much on a consensus on how to fairly do things as it is on technical measures keep things going, and AI scrapers have blown that up. https://go-to-hellman.blogspot.com/2025/03/ai-bots-are-destroying-open-access.html

I know Molly White wrote a passionate post recently asking people not to throw away open access just to stop AI scrapers, but many people *don't have a choice* anymore. If the old deal is broken, some people may be forced to take things private.

AI bots are destroying Open Access

There's a war going on on the Internet. AI companies with billions to burn are hard at work destroying the websites of libraries, archives, ...

I don't like it and I want to preserve the open web, but things can't keep going like this. Something has to change to stop/slow down these companies for people to be *able* to keep things open and accessible. And it's the smaller publishers and individuals who are being hurt the most and being driven offline or to more closed platforms.
@misty Integer-overflowing session IDs is fucking bleak, goddamn.
@misty team I work on recently became responsible for a web crawler and so I'm starting to do work on it and it sure is telling that basically every robots.txt I've looked at when I'm investigating issues disallows at least 5 LLM company crawlers
@bcj Yeah... that sounds about right. Oof.

@misty

On one side you have the ravenous AI parasites, on the other side you have internet safety bills in places like the UK causing small sites to be shuttered for fear of massive liability due to some random commenter's post.

Not to tout #Cloudflare's horn - especially since internet infrastructure shouldn't be segmented into a single company (seize the means of computing) - but, if they can prevent #AI from scraping the entire #web, that would be nice.

It's not a good idea to let that happen. I mean at some point it's going to be hard to differentiate a regular user with a bot, but still.

Cloudflare’s Free AI Labyrinth Distracts Crawlers That Could Steal Website Content to Feed AI
https://www.eweek.com/news/cloudflare-ai-labyrinth-generative-ai-content/

@misty

Cloudflare’s Free AI Labyrinth Distracts Crawlers That Could Steal Website Content

Cloudflare used generative AI to build premade websites that can be embedded as an AI Labyrinth in protected websites, sending crawlers on a wild goose chase.

eWEEK
@misty pragmatically speaking, are CDNs a solution?
@kepeken If you take a look at the article, they mention it's not *just* the content itself. There was an example of one site with dynamic content that the crawlers were interacting with so much it caused the 32-bit session ID to roll over.
@misty cloudflare (and others) handle this by fingerprinting your browser across enough activity to build up a profile that identifies you, and also identifies you as human. i guess the important question is whether this can be done in a privacy-respecting way. on first glance that sounds contradictory, but cryptography can do a lot of things that sound impossible until you hear how they are done.
@kepeken I'd think cost is the much greater problem. If this becomes de facto mandatory to operate a webpage, a lot of people won't be able to afford to do that anymore.

@misty how about this protocol?

if you connect to a website and browse normally, they offer to sign your public key. if you want to connect to a sensitive service (like a forum that needs to query a database for every page), you send them a public key that has been signed many times. the signatures have dates and expire, so you can't sell your key to a scraper.

@kepeken @misty This seems unnecessary, have you heard of TLS client certificates? Those already exist. https://blog.cloudflare.com/introducing-tls-client-auth/ In fact, RedHat uses them in combination with subscription-manager to give you access to their repositories when you have a license. The problem is that in order to issue such a cert you still need to verify the recipient is human.
Introducing TLS with Client Authentication

In a traditional TLS handshake, the client authenticates the server, and the server doesn’t know too much about the client. However, starting now, Cloudflare is offering enterprise customers TLS with client authentication.

The Cloudflare Blog
@puppygirlhornypost2 @misty the idea is that websites would give you signatures that you could (voluntarily) present to other websites.
@misty @kepeken Not to mention that captchas are mostly becoming obsolete as they can be easily automated. I believe we're talking about Cloudflare's turnstile. It primarily operates by tracking your behavior across the web but it does sometimes fall back to a captcha. I maintain a collection of captchas that are difficult for even sighted people to figure out. They're inaccessible as shit. It's also just plain creepy to have sites tracking my behavior across the net.
@puppygirlhornypost2 @misty @kepeken Turnstile never shows a puzzle to solve, the ones you have must be other CAPTCHAs. But it seems some companies sell services for breaking it.

@misty

I recall reading an article in Byte magazine back in the early 90s or thereabouts. Even at that time they were predicting that the internet will end up like a cable TV service or perhaps a library.
To get any kind of quality we will need to subscribe to a package of curated services/web sites which is walled off from other services and the wider internet jungle.

@sleepy62 @misty

That's what Gates et al were expecting too. The AOL/CompuServe/Prodigy model. Early/mid 90s was before commercial use of the internet really took off so nobody really knew what to expect, and so people assumed the existing models would hold.

@jonhendry @misty

Yes that is how much of it started out. Geocities also comes to mind. Maybe the open internet was always a dream that could never survive the on-slot of bad actors and AI bots.

We are going back to the future..

@misty This is a real ugly choice in academia, because many institutions require publications to be OA. Journals are basically being told “either license your content for AI or we will just take it it, it’s up to you.”
@misty maybe these jerks could scale back or cache things? I mean, it doesn’t seem like it’s necessary for every AI trainer to go out and scan a blog post from three years ago hourly to see if it has changed. Also, write some forge-specific code to give people’s gitea instances and whatever a break!
@misty
Can larger libraries sue AI companies for breaching the terms of service of their services?
@misty I run an open library service, and those AI bots are continually feasting on it. I cannot afford to pay for fancy protection services, so I have to live with my users finding strained services, if available at all. And even if my users do not get through, I still have to foot the data bill.
@misty do we have... any viable p2p-based solutions to this problem? maybe?

@SoniEx2 @misty

What about only providing torrent/magnet links to the full publications?

Provide an abstract, yes, but if you want the full papers you have to use torrenting. :D

While this would change the way that things are accessed, it would also directly increase the bandwidth bills of the LLM's. :D

Do it enough that they become more unprofitable to train. :D

@BillySmith @misty that would be a good start. but then there's git, etc, and we don't think we have any viable options for those yet?
@misty I have seen a number of websites start using "proof of work" holding pages that delay continuing for five seconds or so before giving access.

It holds up the bots, wasting compute cycles, but likely also acts as a method to detect and block them, e.g. a human will wait while a bot may have multiple sessions attempting to hit different urls.

@misty to paraphrase what I have said in another thread only a few minutes ago, ai has been in a vicious cycle for us, it is both flooding the web with useless data to have to filter out, and also being the only way to easily filter through all that data.

If the systems were not so damn trusting, maybe we'd have a better environment or something on here, but that is a very shallow estimation.

@misty I agree but I think the human cost because lot of these data scrappers are in what is now the Global South get paid horribly if at all and are exposed to traumatic things.

We need to have a digital human rights that protects data of individuals and stops this exploitation

@misty do we even know who are the offenders?
@misty Hi, is it not possible to 'fool' these AI Crawlers? I do remember a solution for contacts being stolen and used for email bombing by introducing your first contact only using numbers in all fields.

@misty In the film Terminator, dogs were used to identify terminators.

I wonder if we could build something similar but with cats.