Mastodawn

I really can't overemphasize how destructive this is to the open web. The web is built just as much on a consensus on how to fairly do things as it is on technical measures keep things going, and AI scrapers have blown that up. https://go-to-hellman.blogspot.com/2025/03/ai-bots-are-destroying-open-access.html

I know Molly White wrote a passionate post recently asking people not to throw away open access just to stop AI scrapers, but many people *don't have a choice* anymore. If the old deal is broken, some people may be forced to take things private.

AI bots are destroying Open Access

There's a war going on on the Internet. AI companies with billions to burn are hard at work destroying the websites of libraries, archives, ...

Misty Mar 25, 2025

I don't like it and I want to preserve the open web, but things can't keep going like this. Something has to change to stop/slow down these companies for people to be *able* to keep things open and accessible. And it's the smaller publishers and individuals who are being hurt the most and being driven offline or to more closed platforms.

Cassandra is only carbon now Mar 25, 2025

@misty Integer-overflowing session IDs is fucking bleak, goddamn.

bcj Mar 25, 2025

@misty team I work on recently became responsible for a web crawler and so I'm starting to do work on it and it sure is telling that basically every robots.txt I've looked at when I'm investigating issues disallows at least 5 LLM company crawlers

Misty Mar 25, 2025

@bcj Yeah... that sounds about right. Oof.

Jonathan Hendry Mar 25, 2025

On one side you have the ravenous AI parasites, on the other side you have internet safety bills in places like the UK causing small sites to be shuttered for fear of massive liability due to some random commenter's post.

Thor A. Hopland Mar 25, 2025

Not to tout #Cloudflare's horn - especially since internet infrastructure shouldn't be segmented into a single company (seize the means of computing) - but, if they can prevent #AI from scraping the entire #web, that would be nice.

It's not a good idea to let that happen. I mean at some point it's going to be hard to differentiate a regular user with a bot, but still.

Cloudflare’s Free AI Labyrinth Distracts Crawlers That Could Steal Website Content to Feed AI
https://www.eweek.com/news/cloudflare-ai-labyrinth-generative-ai-content/

Cloudflare’s Free AI Labyrinth Distracts Crawlers That Could Steal Website Content

Cloudflare used generative AI to build premade websites that can be embedded as an AI Labyrinth in protected websites, sending crawlers on a wild goose chase.

eWEEK

Kepeken Mar 25, 2025

@misty pragmatically speaking, are CDNs a solution?

Misty Mar 25, 2025

@kepeken If you take a look at the article, they mention it's not *just* the content itself. There was an example of one site with dynamic content that the crawlers were interacting with so much it caused the 32-bit session ID to roll over.

Kepeken Mar 25, 2025

@misty cloudflare (and others) handle this by fingerprinting your browser across enough activity to build up a profile that identifies you, and also identifies you as human. i guess the important question is whether this can be done in a privacy-respecting way. on first glance that sounds contradictory, but cryptography can do a lot of things that sound impossible until you hear how they are done.

Misty Mar 25, 2025

@kepeken I'd think cost is the much greater problem. If this becomes de facto mandatory to operate a webpage, a lot of people won't be able to afford to do that anymore.

Kepeken Mar 25, 2025

@misty how about this protocol?

if you connect to a website and browse normally, they offer to sign your public key. if you want to connect to a sensitive service (like a forum that needs to query a database for every page), you send them a public key that has been signed many times. the signatures have dates and expire, so you can't sell your key to a scraper.

@kepeken @misty This seems unnecessary, have you heard of TLS client certificates? Those already exist. https://blog.cloudflare.com/introducing-tls-client-auth/ In fact, RedHat uses them in combination with subscription-manager to give you access to their repositories when you have a license. The problem is that in order to issue such a cert you still need to verify the recipient is human.

Introducing TLS with Client Authentication

In a traditional TLS handshake, the client authenticates the server, and the server doesn’t know too much about the client. However, starting now, Cloudflare is offering enterprise customers TLS with client authentication.

The Cloudflare Blog

Kepeken Mar 25, 2025

@puppygirlhornypost2 @misty the idea is that websites would give you signatures that you could (voluntarily) present to other websites.

@misty @kepeken Not to mention that captchas are mostly becoming obsolete as they can be easily automated. I believe we're talking about Cloudflare's turnstile. It primarily operates by tracking your behavior across the web but it does sometimes fall back to a captcha. I maintain a collection of captchas that are difficult for even sighted people to figure out. They're inaccessible as shit. It's also just plain creepy to have sites tracking my behavior across the net.

lnicola Mar 26, 2025

@puppygirlhornypost2 @misty @kepeken Turnstile never shows a puzzle to solve, the ones you have must be other CAPTCHAs. But it seems some companies sell services for breaking it.

sleepy62🍁🛠️ 🖥️ 🔬 🌞Mar 25, 2025

I recall reading an article in Byte magazine back in the early 90s or thereabouts. Even at that time they were predicting that the internet will end up like a cable TV service or perhaps a library.
To get any kind of quality we will need to subscribe to a package of curated services/web sites which is walled off from other services and the wider internet jungle.

Jonathan Hendry Mar 25, 2025

@sleepy62 @misty

That's what Gates et al were expecting too. The AOL/CompuServe/Prodigy model. Early/mid 90s was before commercial use of the internet really took off so nobody really knew what to expect, and so people assumed the existing models would hold.

sleepy62🍁🛠️ 🖥️ 🔬 🌞Mar 26, 2025

@jonhendry @misty

Yes that is how much of it started out. Geocities also comes to mind. Maybe the open internet was always a dream that could never survive the on-slot of bad actors and AI bots.

We are going back to the future..

rednikki Mar 25, 2025

@misty This is a real ugly choice in academia, because many institutions require publications to be OA. Journals are basically being told “either license your content for AI or we will just take it it, it’s up to you.”

0xC0DEC0DE07EA Mar 25, 2025

@misty maybe these jerks could scale back or cache things? I mean, it doesn’t seem like it’s necessary for every AI trainer to go out and scan a blog post from three years ago hourly to see if it has changed. Also, write some forge-specific code to give people’s gitea instances and whatever a break!

Michał Kawalec Mar 25, 2025

@misty
Can larger libraries sue AI companies for breaching the terms of service of their services?

Christian Zimmermann Mar 25, 2025

@misty I run an open library service, and those AI bots are continually feasting on it. I cannot afford to pay for fancy protection services, so I have to live with my users finding strained services, if available at all. And even if my users do not get through, I still have to foot the data bill.

Genders: ♾️, 🟪⬛🟩; Soni L.Mar 25, 2025

@misty do we have... any viable p2p-based solutions to this problem? maybe?

Billy Smith Mar 26, 2025

@SoniEx2 @misty

What about only providing torrent/magnet links to the full publications?

Provide an abstract, yes, but if you want the full papers you have to use torrenting. :D

While this would change the way that things are accessed, it would also directly increase the bandwidth bills of the LLM's. :D

Do it enough that they become more unprofitable to train. :D

Genders: ♾️, 🟪⬛🟩; Soni L.Mar 26, 2025

@BillySmith @misty that would be a good start. but then there's git, etc, and we don't think we have any viable options for those yet?

@misty I have seen a number of websites start using "proof of work" holding pages that delay continuing for five seconds or so before giving access.

It holds up the bots, wasting compute cycles, but likely also acts as a method to detect and block them, e.g. a human will wait while a bot may have multiple sessions attempting to hit different urls.

JamesTDG Mar 26, 2025

@misty to paraphrase what I have said in another thread only a few minutes ago, ai has been in a vicious cycle for us, it is both flooding the web with useless data to have to filter out, and also being the only way to easily filter through all that data.

If the systems were not so damn trusting, maybe we'd have a better environment or something on here, but that is a very shallow estimation.

Frank J Mar 26, 2025

@misty I agree but I think the human cost because lot of these data scrappers are in what is now the Global South get paid horribly if at all and are exposed to traumatic things.

We need to have a digital human rights that protects data of individuals and stops this exploitation

Marten Koetsier Mar 26, 2025

@misty do we even know who are the offenders?

Wim Turnhout 🇺🇦 🍉 🇵🇸 🇦🇴Mar 26, 2025

@misty Hi, is it not possible to 'fool' these AI Crawlers? I do remember a solution for contacts being stolen and used for email bombing by introducing your first contact only using numbers in all fields.

Daniel Durrans Mar 27, 2025

@misty In the film Terminator, dogs were used to identify terminators.

I wonder if we could build something similar but with cats.