Mastodawn

grebedoc had its highest share yet of serving garbage requests yesterday (a wave peaking at 150 req/sec)

these waves are getting bigger and bigger which is somewhat concerning. it's nowhere near the hardware capacity yet but i'm hitting some software bottlenecks that i've never thought would be relevant

Show thread

✧✦Catherine✦✧

git-pages has a sophisticated multilayer cache system which fails to perform well in exactly one case: if someone sends a lot of requests to domains that don't even have valid sites deployed

because i figured that nobody would do this. certainly that nobody would do it regularly and at incredibly high speed

well. fucking scrapers

Show thread

✧✦Catherine✦✧1d ago

i'm going to have to add a Bloom filter and another cache invalidation mechanism which i'm not enthusiastic about but it seems prudent to do it before it results in an outage (grebedoc has never had a scraper-induced outage so far, and neither had the codeberg git-pages instance)

Show thread

Eashwar 1d ago

@whitequark Is the additional cache invalidation to handle removals from the bloom filter? Are you just planning to rebuild the bloom filter periodically or ...?

Show thread

✧✦Catherine✦✧1d ago

@e_nomem rebuild whenever a domain is added or removed (or on a superset of those operations, ideally a small superset to avoid waste of resources) but not more often than e.g. 60s

Show thread

Taneb 1d ago

@whitequark do you need to rebuild when a domain is removed? Given that there'll be false positives anyway... (and inserting a domain should be relatively cheap with a Bloom filter, until the false positive rate gets higher than you want it to)

Show thread

✧✦Catherine✦✧1d ago

@Taneb yeah now that you mention it, not really

Show thread

Chris 1d ago

@whitequark bloom filter was also the first thing that came to my mind when reading this thread. It’s the Wild West out there, apparently.

Show thread

yopp 15h ago

@whitequark nginx has amazing limit_req_module that can easily throttle IP’s that do some nasty shit, like doing a lot of 404 request. You can just tell it to spit 1bbps to connections that fall in given zone.

It’ll cost you having open connections, but otherwise cheap way to solve this without doing another layer of caching.

But given you have caddy for tls provision, it’s not immediately obvious how to front it with nginx

Module ngx_http_limit_req_module

Show thread

Andrew 1d ago

@whitequark now that I actually look at git pages I realize it’s exactly the thing I wanted to make a few months back to replace ReadTheDocs, I’ll have to try it out, thanks

Show thread

✧✦Catherine✦✧1d ago

@cinebox oh nice, how much replacing are we talking about? like for your own needs or as a service for others?

Show thread

Andrew 1d ago

@whitequark just for myself and some community projects

Show thread

✧✦Catherine✦✧1d ago

@cinebox oh yeah that's basically how i started grebedoc

one thing git-pages intentionally omits is any sort of "run user-provided code in a container" because i believe that most of the solutions here cannot be left unattended if you expect to not be compromised by malware at some point. maybe firecracker vms would work but this still has a lot of issues. so i just let people use forgejo actions or something if they need processing

Show thread

Andrew 1d ago

@whitequark yeah I already have forgejo actions for that. I just needed a solution for deploying the resulting html.

Show thread

✧✦Catherine✦✧1d ago

@cinebox yeah https://codeberg.org/git-pages/action

action

Forgejo Action for uploading a directory to a git-pages site

Codeberg.org

Show thread

truh 1d ago

@whitequark how does one even mess up scraping that badly?

@truh beats me

@truh @whitequark Believing that LLMs are in fact AI.

I help manage a site where 'deep' URLs follow obvious patterns. The elements are obvious & one can build millions of possible URLs for the site using public info, mostly of which don’t exist.

The so-called "AI Scrapers" have been asking for thousands of such invented URLs at the site all at once, with most of those which could be correct taking a few seconds to construct from mostly-archived data. The scrapers don’t even wait.