We apologize for the long performance degradation today.
Finally, we identified all of the 'tricks' that AI crawlers found today. They no longer bypass the anubis proof of work challenges.

A novelty for us was that AI crawlers seem to not only crawl URLs that are actually presented to them by our frontend, but they converted the URLs into a format that bypassed our filter rules.

By the way, you can track the changes we have been doing via

https://codeberg.org/Codeberg-Infrastructure/scripted-configuration/compare/51618~1..e4aac

scripted-configuration

An attempt at a much more simple and intuitive configuration system (used for most of our services)

Codeberg.org
@Codeberg Ah I did notice a super long push time this morning. That explains it. Sucks that we have to deal with that crap. But thanks for the transparency.
@Codeberg fucking parasites.
@Codeberg Malware has the same behaviour.

@Codeberg Now OpenAI is giving real users their Atlas browser, so they can scrape while users bypass the security and provide them with logins.

Disguisting.

@gimulnautti @Codeberg damn I did not think about that. Let's call it a trojan.
@gimulnautti @Codeberg hmm, I wonder if it has a legit browser ID string that I can block. My guess is it will just spoof Chrome.
@Codeberg Was it only today? I felt like it has been slow for the past few weeks. Its really fast now though! Thank you, once again, for the hard work.
@Codeberg do you really have to be using anubis and not go-away or iocaine cuz anubis is kinda uncomfy with its corporateness
@soop @Codeberg could you elaborate on Anubis’s corporate-ness? I thought it was just a lone intrepid dev.

@tmaher @Codeberg

  • “we’ll make our open version intentionally shit, fuck you pay us”
  • paywalling features, generally
  • being centered around the overkill javascript PoW wall, and promoting that where a non-invasive transparent JS-less check would suffice. I know codeberg disables that, but i still get a sour taste in my mouth from it
  • the author is/was a smelly blanket mdni-er (ageist)

this is a summary of why i have a distaste for it

@soop @Codeberg

Thank you for explaining

@Codeberg Already witnessing that the harm and impact of AI for all-day life is way beyond it's benefits!
@Codeberg Are crawlers that fail the test getting sent to a maze of procedurally-generated junk, or are they just told they failed?

@Codeberg AI companies crawl our websites.

We ask that they stop by using the industry standard robots.txt

AI companies ignore those rules.

We start blocking the companies themselves with conventional tools like IP rules.

AI companies start working around those blocks.

We invent ways to specifically make life harder for their crawlers (stuff like Anubis).

AI companies put considerable resources into circumventing that, too.

This industry seriously needs to implode. Fast.

As a next step, AI companies are now offering "their" browser (read: Chromium ever so slightly themed with some company bullshit built in)

In part, this is certainly done to have yet another way to crawl the web, but this time user-directed and indistinguishable from actual human requests.

@claudius so as a natural next response, imo we should keep trying to recognize these 'browsers' regardless of the people mislead into using them as browser, and serve them some random trash
@claudius Frankly, I'm kind of half-OK with that one: There's still the troubling copyright aspect, but at least being the browser and loading nothing but user viewed content at least gets their load off our servers.
@chrysn second step: DDoS. If they are on the computer anyway, why not deputize them for crawling?
@claudius If there's actual page-consuming users behind every single request, it'd take a colossal effort to pull of DDoS. Cloudflare (whose business interest admittedly is to over-report DoS attacks) clocks even 2010-level attacks at 600k requests per second, so even with low-attention-span users (maybe 5s/page), that'd take 3 million humans for the duration of the attack. If someone can just so convince 3M people to constantly click through slow-loading pages, we have bigger issues than DoS.
@claudius Of course, if their browsers load content *beyond* what the viewed page is including and the explicit preload links, then those users turned their hosts into part of a botnet willingly, and need to expect blocking like any other botnet.
@chrysn that's what I'm expecting. Most AI venture capital backed companies seem to care very little about the social contracts we have. So turning a browser into a remote controlled crawler would not surprise me the least.
@claudius I'd hope that their users sooner or later notice that when using that browser, all of a sudden they frequently face error pages rather than content. As soon as it's not user driven any more, we're "just" back to the usual DDoS whack-a-mole, and whenever we win that for some time, their user experience suffers.
@claudius @chrysn I am pretty sure a lot of the AI apps for phone already do that. I saw a lot of AI crawl traffic coming from mobile provider IP addresses.
@mike805 @claudius @chrysn there's an entire industry of "scraper reverse proxies" that take your web request and route it through some unsuspecting person's mobile device. A bunch of mobile advertising SDKs include a client for one of these proxy services, so app developers don't even have to have heard of this industry to participate & profit from it.

@claudius @chrysn

Yeah, put some zip bombs and infinite directory structures in some directories only crawlers find. Give them a taste of their own medicine.

@chrysn @claudius Nothing but user viewed content? It'll load 1 page for the user and 99 for the company.
@claudius fortunately, any humans using that browser chose to use the slop machine and I don't really care if they can see my website
I get how it makes a hard choice for some people though
If anyone sees this and works at an "AI" company, please do more sabotage lol
@claudius Maybe we need to start blocking the Chrome user agent… because we can't actually tell the difference between original non-AI Chrome, and the malware-infested clones that use the same user agent.
@claudius @Codeberg Have you tried using Firecrawl as a test for your blocking? It seems to be a popular site that centralizes crawling technology.

@claudius @Codeberg

I feel like we are working towards a point where you have to redesign the whole web to account for AI ignoring rules.

New browsers, new protocols, etc.

@YourShadowDani

I look back at the good old days, when one day I client asked me to bulletproof their websites and computers so they could never steal something, and I went under the desk and unplugged their first computer.
They learned.

But now with AI it's a whole other level.

@claudius @Codeberg

@YourShadowDani @claudius @Codeberg

Yeah I was thinking about much more strict user agent strings.

Seems like opaqueness suits these AI entities.

@claudius @Codeberg At some point we're going to start paying a lawyer a few dollars to send the AI companies a registered return-receipt-requested letter saying "You are denied access to my web site. I have taken every step possible to prevent you from accessing it. If you continue to circumvent these measures and access my site anyway, you will be billed $1000/access. This fee will take effect 14 days after you receive this notice."

Then start sending bills.

@tknarr @claudius @Codeberg I like your idea. Tencent and ByteDance hit our sites badly and they deserve a massive bill from us.
Definition: circumvent a technological measure from 17 USC § 1201(a)(3) | LII / Legal Information Institute

@alex @tknarr @Codeberg my guess: they don't think, copyright applies to them. They claim "fair use" (which is totally ridiculous).

@claudius @alex @tknarr @Codeberg

As I recall, this was one of the most hideous bits of the DMCA: that the anti-circumvention parts were decoupled from the copyright bits and you didn’t need to have valid copyright claim to enforce the anti-circumvention bits.

If that’s not true, there are terms in the UK’s Computer Misuse Act have explicit things about bypassing access control. It includes prison time as the recommended sentence.

I think the tough part is figuring out who’s doing it, though I could be wrong. Lots of scrapers will fake their user agent and use sketchy residential proxies to get around IP bans, so it’s quite hard to figure out the origin.
@flammableengineering @Codeberg @tknarr this. And I have so many better things to do.
@tknarr @claudius @Codeberg I would expect them to start using proxy companies (registered in, say, Russia) to hide who is really doing the scraping.
@claudius @Codeberg IANAL but I'm wondering if adding explicit copyright text to every page they consume might at least provide some future ability to claim against them, particularly if it included specific fees for use without permission?

@claudius @Codeberg

More folks need to begin adopting ... unorthodox solutions for those groups which have been so wonderful as to ignore robots.txt. Disguised petabyte ZIP bombs. Poisoned pages. Image folders chock full of Nightshade.

The legal argument to be made and adopted here is that if the companies weren't willfully breaking the law, then they wouldn't have subjected themselves to those attacks. It certainly doesn't even fall under entrapment in most cases.

@claudius @Codeberg don't fight them circumventing it. Feed them garbage. I let the crawlers download gigabytes of randomness everyday, generated from Discworld books on my instance of Iocaine by @algernon :

https://olyfjan.blomi.is

Terry Pratchett's Discworld Ólyfjan

@claudius @Codeberg

"They" do not appear to be very good with Captcha and similar mechanisms... but it would appear a lot of people don't like them either!

@Owen_G_Richards
The captcha thing they outsourced to humans by creating a browser (atlas)...

@claudius @Codeberg

@src_esther @claudius @Codeberg

Got my own version of a captcha - which won't stop real peeps from getting through, but the crawlers don't like it - approx 30 per day get bounced away from my website.

@Owen_G_Richards
Well real peeps use the Atlas browser... So how do you know it isn't sending your website info to OpenAI?

@claudius @Codeberg

@src_esther @claudius @Codeberg

Honestly... I have never heard of the Atlas browser.

As for knowing if they scrape my data, or not, if they don't get past the Captcha, they won't reach the landing page. I only get indications of passed the captcha, or failed... they fail.

Atlas - Open AI... I see... perhaps I'll need to find a way to block those "real peeps" that use that browser.

@src_esther @claudius @Codeberg

Could be blocking a lot of people then...

How about I just simply give the whole idea the bird and pack my bags and walk away?

Seems like AI is the invasive degenerative disease of the internet and the only real way to avoid it is isolation/quarantine.

@Owen_G_Richards
Definitely the invasive degenerative disease of the internet. I think that is the correct description.

@claudius @Codeberg