I really can't overemphasize how destructive this is to the open web. The web is built just as much on a consensus on how to fairly do things as it is on technical measures keep things going, and AI scrapers have blown that up. https://go-to-hellman.blogspot.com/2025/03/ai-bots-are-destroying-open-access.html

I know Molly White wrote a passionate post recently asking people not to throw away open access just to stop AI scrapers, but many people *don't have a choice* anymore. If the old deal is broken, some people may be forced to take things private.

AI bots are destroying Open Access

There's a war going on on the Internet. AI companies with billions to burn are hard at work destroying the websites of libraries, archives, ...

@misty pragmatically speaking, are CDNs a solution?
@kepeken If you take a look at the article, they mention it's not *just* the content itself. There was an example of one site with dynamic content that the crawlers were interacting with so much it caused the 32-bit session ID to roll over.
@misty cloudflare (and others) handle this by fingerprinting your browser across enough activity to build up a profile that identifies you, and also identifies you as human. i guess the important question is whether this can be done in a privacy-respecting way. on first glance that sounds contradictory, but cryptography can do a lot of things that sound impossible until you hear how they are done.
@kepeken I'd think cost is the much greater problem. If this becomes de facto mandatory to operate a webpage, a lot of people won't be able to afford to do that anymore.

@misty how about this protocol?

if you connect to a website and browse normally, they offer to sign your public key. if you want to connect to a sensitive service (like a forum that needs to query a database for every page), you send them a public key that has been signed many times. the signatures have dates and expire, so you can't sell your key to a scraper.

@kepeken @misty This seems unnecessary, have you heard of TLS client certificates? Those already exist. https://blog.cloudflare.com/introducing-tls-client-auth/ In fact, RedHat uses them in combination with subscription-manager to give you access to their repositories when you have a license. The problem is that in order to issue such a cert you still need to verify the recipient is human.
Introducing TLS with Client Authentication

In a traditional TLS handshake, the client authenticates the server, and the server doesn’t know too much about the client. However, starting now, Cloudflare is offering enterprise customers TLS with client authentication.

The Cloudflare Blog
@puppygirlhornypost2 @misty the idea is that websites would give you signatures that you could (voluntarily) present to other websites.
@misty @kepeken Not to mention that captchas are mostly becoming obsolete as they can be easily automated. I believe we're talking about Cloudflare's turnstile. It primarily operates by tracking your behavior across the web but it does sometimes fall back to a captcha. I maintain a collection of captchas that are difficult for even sighted people to figure out. They're inaccessible as shit. It's also just plain creepy to have sites tracking my behavior across the net.
@puppygirlhornypost2 @misty @kepeken Turnstile never shows a puzzle to solve, the ones you have must be other CAPTCHAs. But it seems some companies sell services for breaking it.