Mastodawn

We apologize for the long performance degradation today.
Finally, we identified all of the 'tricks' that AI crawlers found today. They no longer bypass the anubis proof of work challenges.

A novelty for us was that AI crawlers seem to not only crawl URLs that are actually presented to them by our frontend, but they converted the URLs into a format that bypassed our filter rules.

By the way, you can track the changes we have been doing via

https://codeberg.org/Codeberg-Infrastructure/scripted-configuration/compare/51618~1..e4aac

scripted-configuration

An attempt at a much more simple and intuitive configuration system (used for most of our services)

Codeberg.org

לָקס (לא לותור) لاكس🏴Oct 25, 2025

@Codeberg wow it's so much fasterier

Mason Loring Bliss Oct 25, 2025

@Codeberg They're malicious.

Maypop the Dragon Oct 25, 2025

@mason @Codeberg capitalism is malicious in general

✰ Alice D. ✰Oct 25, 2025

bit101 Oct 25, 2025

@Codeberg Ah I did notice a super long push time this morning. That explains it. Sucks that we have to deal with that crap. But thanks for the transparency.

cλémentd Oct 25, 2025

@Codeberg AI is malware

Marc Oct 25, 2025

@Codeberg fucking parasites.

@Codeberg Malware has the same behaviour.

Toni Aittoniemi Oct 25, 2025

@Codeberg Now OpenAI is giving real users their Atlas browser, so they can scrape while users bypass the security and provide them with logins.

Disguisting.

benjo Oct 25, 2025

@gimulnautti @Codeberg damn I did not think about that. Let's call it a trojan.

Allpoints Oct 25, 2025

@gimulnautti @Codeberg hmm, I wonder if it has a legit browser ID string that I can block. My guess is it will just spoof Chrome.

Bjarne Oct 25, 2025

@Codeberg Was it only today? I felt like it has been slow for the past few weeks. Its really fast now though! Thank you, once again, for the hard work.

zaire arcana Oct 25, 2025

@Codeberg do you really have to be using anubis and not go-away or iocaine cuz anubis is kinda uncomfy with its corporateness

tmaher Oct 25, 2025

@soop @Codeberg could you elaborate on Anubis’s corporate-ness? I thought it was just a lone intrepid dev.

zaire arcana Oct 25, 2025

@tmaher @Codeberg

“we’ll make our open version intentionally shit, fuck you pay us”
paywalling features, generally
being centered around the overkill javascript PoW wall, and promoting that where a non-invasive transparent JS-less check would suffice. I know codeberg disables that, but i still get a sour taste in my mouth from it
the author is/was a smelly blanket mdni-er (ageist)

this is a summary of why i have a distaste for it

tmaher Oct 25, 2025

@soop @Codeberg

Thank you for explaining

Amt_e Oct 25, 2025

@Codeberg Already witnessing that the harm and impact of AI for all-day life is way beyond it's benefits!

2something Oct 25, 2025

@Codeberg Are crawlers that fail the test getting sent to a maze of procedurally-generated junk, or are they just told they failed?

Claudius Oct 25, 2025

@Codeberg AI companies crawl our websites.

We ask that they stop by using the industry standard robots.txt

AI companies ignore those rules.

We start blocking the companies themselves with conventional tools like IP rules.

AI companies start working around those blocks.

We invent ways to specifically make life harder for their crawlers (stuff like Anubis).

AI companies put considerable resources into circumventing that, too.

This industry seriously needs to implode. Fast.

Claudius Oct 25, 2025

As a next step, AI companies are now offering "their" browser (read: Chromium ever so slightly themed with some company bullshit built in)

In part, this is certainly done to have yet another way to crawl the web, but this time user-directed and indistinguishable from actual human requests.

trissc̈hen Oct 26, 2025

@claudius so as a natural next response, imo we should keep trying to recognize these 'browsers' regardless of the people mislead into using them as browser, and serve them some random trash

chrysn Oct 26, 2025

@claudius Frankly, I'm kind of half-OK with that one: There's still the troubling copyright aspect, but at least being the browser and loading nothing but user viewed content at least gets their load off our servers.

Claudius Oct 26, 2025

@chrysn second step: DDoS. If they are on the computer anyway, why not deputize them for crawling?

chrysn Oct 26, 2025

@claudius If there's actual page-consuming users behind every single request, it'd take a colossal effort to pull of DDoS. Cloudflare (whose business interest admittedly is to over-report DoS attacks) clocks even 2010-level attacks at 600k requests per second, so even with low-attention-span users (maybe 5s/page), that'd take 3 million humans for the duration of the attack. If someone can just so convince 3M people to constantly click through slow-loading pages, we have bigger issues than DoS.

chrysn Oct 26, 2025

@claudius Of course, if their browsers load content *beyond* what the viewed page is including and the explicit preload links, then those users turned their hosts into part of a botnet willingly, and need to expect blocking like any other botnet.

Claudius Oct 26, 2025

@chrysn that's what I'm expecting. Most AI venture capital backed companies seem to care very little about the social contracts we have. So turning a browser into a remote controlled crawler would not surprise me the least.

chrysn Oct 26, 2025

@claudius I'd hope that their users sooner or later notice that when using that browser, all of a sudden they frequently face error pages rather than content. As soon as it's not user driven any more, we're "just" back to the usual DDoS whack-a-mole, and whenever we win that for some time, their user experience suffers.

mike805 Oct 26, 2025

@claudius @chrysn I am pretty sure a lot of the AI apps for phone already do that. I saw a lot of AI crawl traffic coming from mobile provider IP addresses.

DNA schedule Nov 2, 2025

@mike805 @claudius @chrysn there's an entire industry of "scraper reverse proxies" that take your web request and route it through some unsuspecting person's mobile device. A bunch of mobile advertising SDKs include a client for one of these proxy services, so app developers don't even have to have heard of this industry to participate & profit from it.

@claudius @chrysn

Yeah, put some zip bombs and infinite directory structures in some directories only crawlers find. Give them a taste of their own medicine.

Peter Bindels Nov 3

@chrysn @claudius Nothing but user viewed content? It'll load 1 page for the user and 99 for the company.

Morgan ⚧️Oct 27, 2025

@claudius fortunately, any humans using that browser chose to use the slop machine and I don't really care if they can see my website
I get how it makes a hard choice for some people though
If anyone sees this and works at an "AI" company, please do more sabotage lol

Stuart Longland (VK4MSL)Nov 2

@claudius Maybe we need to start blocking the Chrome user agent… because we can't actually tell the difference between original non-AI Chrome, and the malware-infested clones that use the same user agent.

Adrian Cockcroft Oct 25, 2025

@claudius @Codeberg Have you tried using Firecrawl as a test for your blocking? It seems to be a popular site that centralizes crawling technology.

Claudius Oct 25, 2025

YourShadowDani Oct 26, 2025

@claudius @Codeberg

I feel like we are working towards a point where you have to redesign the whole web to account for AI ignoring rules.

New browsers, new protocols, etc.

Dagnabbit, Pascaline! 💥Oct 26, 2025

@YourShadowDani

I look back at the good old days, when one day I client asked me to bulletproof their websites and computers so they could never steal something, and I went under the desk and unplugged their first computer.
They learned.

But now with AI it's a whole other level.

@claudius @Codeberg

Luke Harby Oct 27, 2025

@YourShadowDani @claudius @Codeberg

Yeah I was thinking about much more strict user agent strings.

Seems like opaqueness suits these AI entities.

Todd Knarr Oct 26, 2025

@claudius @Codeberg At some point we're going to start paying a lawyer a few dollars to send the AI companies a registered return-receipt-requested letter saying "You are denied access to my web site. I have taken every step possible to prevent you from accessing it. If you continue to circumvent these measures and access my site anyway, you will be billed $1000/access. This fee will take effect 14 days after you receive this notice."

Then start sending bills.

Jack Yan (甄爵恩)Oct 26, 2025

@tknarr @claudius @Codeberg I like your idea. Tencent and ByteDance hit our sites badly and they deserve a massive bill from us.

Alex Markley Nov 2, 2025

@tknarr @claudius @Codeberg how does this behavior not constitute a violation of the DMCA’s anti-circumvention provision on the part of the crawler? https://www.law.cornell.edu/definitions/uscode.php?width=840&height=800&iframe=true&def_id=17-USC-1838631189-2041315756&term_occur=999&term_src=title:17:chapter:12:section:1201

Definition: circumvent a technological measure from 17 USC § 1201(a)(3) | LII / Legal Information Institute

@alex @tknarr @Codeberg my guess: they don't think, copyright applies to them. They claim "fair use" (which is totally ridiculous).

David Chisnall (*Now with 50% more sarcasm!*)Nov 2

@claudius @alex @tknarr @Codeberg

As I recall, this was one of the most hideous bits of the DMCA: that the anti-circumvention parts were decoupled from the copyright bits and you didn’t need to have valid copyright claim to enforce the anti-circumvention bits.

If that’s not true, there are terms in the UK’s Computer Misuse Act have explicit things about bypassing access control. It includes prison time as the recommended sentence.

flammableengineering Nov 2

I think the tough part is figuring out who’s doing it, though I could be wrong. Lots of scrapers will fake their user agent and use sketchy residential proxies to get around IP bans, so it’s quite hard to figure out the origin.

@flammableengineering @Codeberg @tknarr this. And I have so many better things to do.

Szymon Sokół 🇵🇱🇪🇺🇺🇦Nov 2

@tknarr @claudius @Codeberg I would expect them to start using proxy companies (registered in, say, Russia) to hide who is really doing the scraping.

@claudius @Codeberg IANAL but I'm wondering if adding explicit copyright text to every page they consume might at least provide some future ability to claim against them, particularly if it included specific fees for use without permission?

Jennifer Kayla | Theogrin 🦊Oct 26, 2025

@claudius @Codeberg

More folks need to begin adopting ... unorthodox solutions for those groups which have been so wonderful as to ignore robots.txt. Disguised petabyte ZIP bombs. Poisoned pages. Image folders chock full of Nightshade.

The legal argument to be made and adopted here is that if the companies weren't willfully breaking the law, then they wouldn't have subjected themselves to those attacks. It certainly doesn't even fall under entrapment in most cases.

FlohEinstein Oct 26, 2025

@claudius @Codeberg don't fight them circumventing it. Feed them garbage. I let the crawlers download gigabytes of randomness everyday, generated from Discworld books on my instance of Iocaine by @algernon :

https://olyfjan.blomi.is

Terry Pratchett's Discworld Ólyfjan

Owen G. Richards - ANTIFAscist Oct 26, 2025

@claudius @Codeberg

"They" do not appear to be very good with Captcha and similar mechanisms... but it would appear a lot of people don't like them either!

Esther #antifa Nov 2

@Owen_G_Richards
The captcha thing they outsourced to humans by creating a browser (atlas)...

@claudius @Codeberg

Owen G. Richards - ANTIFAscist Nov 2

@src_esther @claudius @Codeberg

Got my own version of a captcha - which won't stop real peeps from getting through, but the crawlers don't like it - approx 30 per day get bounced away from my website.

Esther #antifa Nov 2

@Owen_G_Richards
Well real peeps use the Atlas browser... So how do you know it isn't sending your website info to OpenAI?

@claudius @Codeberg

Owen G. Richards - ANTIFAscist Nov 2

@src_esther @claudius @Codeberg

Honestly... I have never heard of the Atlas browser.

As for knowing if they scrape my data, or not, if they don't get past the Captcha, they won't reach the landing page. I only get indications of passed the captcha, or failed... they fail.

Atlas - Open AI... I see... perhaps I'll need to find a way to block those "real peeps" that use that browser.

Esther #antifa Nov 2

RE: https://exquisite.social/@rqm/115422214495013860

@Owen_G_Richards
You are only able to block those people by blocking users of all Chrome browsers.
See: https://mastodon.online/@rqm@exquisite.social/115422214683437728

@claudius @Codeberg

Owen G. Richards - ANTIFAscist Nov 2

@src_esther @claudius @Codeberg

Could be blocking a lot of people then...

How about I just simply give the whole idea the bird and pack my bags and walk away?

Seems like AI is the invasive degenerative disease of the internet and the only real way to avoid it is isolation/quarantine.

Esther #antifa Nov 2

@Owen_G_Richards
Definitely the invasive degenerative disease of the internet. I think that is the correct description.

@claudius @Codeberg