Mastodawn

jslakro 2d ago

Moving from GitHub to Codeberg, for lazy people

https://unterwaditzer.net/2025/codeberg.html

Moving from GitHub to Codeberg, for lazy people - Markus Unterwaditzer

Show thread

INTPenis

Lazy has nothing to do with it, codeberg simply doesn't work.

Most of my friends who use codeberg are staunch cloudflare-opponents, but cloudflare is what keeps Gitlab alive. Fact of life is that they're being attacked non-stop, and need some sort of DDoS filter.

Codeberg has that anubis thing now I guess? But they still have downtime, and the worst thing ever for me as a developer is having the urge to code and not being able to access my remote. That is what murders the impression of a product like codeberg.

Sorry, just being frank. I want all competitors to large monopolies to succeed, but I also want to be able to do my job/passion.

Show thread

freedomben 2d ago

I've had the same experience.

Philosophically I think it's terrible that Cloudflare has become a middleman in a huge and important swath of the internet. As a user, it largely makes my life much worse. It limits my browser, my ability to protect myself via VPNs, etc, and I am just browsing normally, not attacking anything. Pragmatically though, as a webmaster/admin/whatever you want to call it nowadays, Cloudflare is basically a necessity. I've started putting things behind it because if I don't, 99%+ of my traffic is bots, and often bots clearly scanning for vulnerabilities (I run mostly zero PHP sites, yet my traffic logs are often filled with requests like /admin.php and /wp-admin.php and all the wordpress things, and constant crawls from clearly not search engines that download everything and use robots.txt as a guide of what to crawl rather than what not to crawl. I haven't been DDoSed yet, but I've had images and PDFs and things downloaded so many times by these things that it costs me money. For some things where I or my family are the only legitimate users, I can just firewall-cmd all IPs except my own, but even then it's maintenance work I don't want to have to do.

I've tried many of the alternatives, and they often fail even on legitimate usecases. I've been blocked more by the alternatives than I have by Cloudflare, especially that one that does a proof of work. It works about 80% of the time, but that 20% is really, really annoying to the point that when I see that scren pop up I just browse away.

It's really a disheartening state we find ourselves in. I don't think my principles/values have been tested more in the real world than the last few years.

Show thread

rglullis 2d ago

Either I am very lucky or what I am doing has zero value to bots, because I've been running servers online for at least 15 years, and never had any issue that couldn't be solved with basic security hygiene. I use cloudflare as my DNS for some servers, but I always disable any of their paid features. To me they could go out of business tomorrow and my servers would be chugging along just fine.

Show thread

j16sdiz 2d ago

Sometime it is not security , it could be just bandwidth or CPU.

I have website small enough not to attract too many bot, but sometime, something very innocent can bring my website down.

For example, I put a php ical viewer.. and some crawler start loading the calendar page, taking up all the cpu cycle.

Show thread

dspillett 2d ago

> and use robots.txt as a guide of what to crawl rather than what not to crawl

Mental note, make sure my robots.txt files contain a few references to slowly returning pages full of almost nonsense that link back to each other endlessly…

Not complete nonsense, that would be reasonably easy to detect and ignore. Perhaps repeats of your other content with every 5th word swapped with a random one from elsewhere in the content, every 4th word randomly misspelt, every seventh word reversed, every seventh sentence reversed, add a random sprinkling of famous names (Sir John Major, Arc de Triomphe, Sarah Jane Smith, Viltvodle VI) that make little sense in context, etc. Not enough change that automatic crap detection sees it as an obvious trap, but more than enough that ingesting data from your site into any model has enough detrimental effect to token weightings to at least undo any beneficial effect it might have had otherwise.

And when setting traps like this, make sure the response is slow enough that it won't use much bandwidth, and the serving process is very lightweight, and just in case that isn't enough make sure it aborts and errors out if any load metric goes above a given level.

Show thread

matrss 2d ago

So, basically iocaine (https://iocaine.madhouse-project.org/). It has indeed been very useful to get the AI scraper load on a server I maintain down to a reasonable level, even with its not so strict default configuration.

iocaine - the deadliest poison known to AI

Show thread

willx86 2d ago

https://blog.cloudflare.com/ai-labyrinth/

A bit like this? ( iocaine is newer)

Trapping misbehaving bots in an AI Labyrinth

How Cloudflare uses generative AI to slow down, confuse, and waste the resources of AI Crawlers and other bots that don’t respect “no crawl” directives.

The Cloudflare Blog

Show thread

dwedge 2d ago

While I sympathise, I disagree with your stance. Cloudflare handle a large % of the Internet now because of people putting sites that, as you admitted, don't need to be behind it there.

[dead]

My own git server has been hit severely by scrapers. They're scraping everything. Commits, comparisons between commits, api calls for files, everything.

And pretty much all of them, ByteDance, OpenAI, AWS, Claude, various I couldn't recognize. I basically just had to block all of them to get reasonable performance for a server running on a mini-pc.

I was going to move to codeberg at some point, but they had downtime when I was considering it, I'd rather deal with that myself then.

Show thread

marginalia_nu 2d ago

Anyone actually scraping git repos would probably just do a 'git clone'. Crawling git hosts is extremely expensive, as git servers have always been inadvertent crawler traps.

They generate a URL for every version of every file on every commit and every branch and tag, and if that wasn't enough, n(n+1)/2 git diffs for every file on every commit it has exited on. Even a relatively small git repo with a few hundred files and commit explodes into millions of URLs in the crawl frontier. Server side many of these are very expensive to generate as well so it's really not a fantastic interaction, crawler and git host.

If you run a web crawler, you need to add git host detection to actively avoid walking into them.

Show thread

Tharre 2d ago

And yet, it's exactly what all the AI companies are doing. However much it costs them in server costs and good will seems to be worth less to them then the engineering time to special case the major git web UIs.

Show thread

frevib 2d ago

OP is about Github. Have you seen the Github uptime monitor? It’s at 90% [1] for the last 90 days. I use both Codeberg and Github a lot and Github has, by far, more problems than Codeberg. Sometimes I notice slowdowns on Codeberg, but that’s it.

[1] https://mrshu.github.io/github-statuses/

The Missing GitHub Status Page

Historical GitHub uptime reconstructed from archived status data.

Show thread

kevinfiol 2d ago

To be fair, Github has several magnitudes higher of users running on it than Codeberg. I'm also a Codeberg user, but I don't think anyone has seen a Forgejo/Gitea instance working at the scale of Github yet.

Show thread

z3t4 2d ago

[delayed]