git-pages has a sophisticated multilayer cache system which fails to perform well in exactly one case: if someone sends a lot of requests to domains that don't even have valid sites deployed
because i figured that nobody would do this. certainly that nobody would do it regularly and at incredibly high speed
well. fucking scrapers
@whitequark nginx has amazing limit_req_module that can easily throttle IP’s that do some nasty shit, like doing a lot of 404 request. You can just tell it to spit 1bbps to connections that fall in given zone.
It’ll cost you having open connections, but otherwise cheap way to solve this without doing another layer of caching.
But given you have caddy for tls provision, it’s not immediately obvious how to front it with nginx
@cinebox oh yeah that's basically how i started grebedoc
one thing git-pages intentionally omits is any sort of "run user-provided code in a container" because i believe that most of the solutions here cannot be left unattended if you expect to not be compromised by malware at some point. maybe firecracker vms would work but this still has a lot of issues. so i just let people use forgejo actions or something if they need processing
@truh @whitequark Believing that LLMs are in fact AI.
I help manage a site where 'deep' URLs follow obvious patterns. The elements are obvious & one can build millions of possible URLs for the site using public info, mostly of which don’t exist.
The so-called "AI Scrapers" have been asking for thousands of such invented URLs at the site all at once, with most of those which could be correct taking a few seconds to construct from mostly-archived data. The scrapers don’t even wait.
@dpk with these waves it's pretty easy because they send Host: requests whose A/AAAA records don't point to grebedoc and never did (it's something like every domain from the Cloudflare 1M list, last time we investigated it)
requests that have no business being ever sent to grebedoc
some others try to brute-force wp-admin paths and such
@avioletheart.me @solonovamax the current plan is:
domain.tld or domain.tld/directory, you know like github lets you have a main site and project-specific ones on the same domain@whitequark @avioletheart.me please use cuckoo or xor filters
they're (to my understanding) simply better than bloom filters
@whitequark @avioletheart.me it seems like there's packages for cuckoo, xor, ribbon, and binary fuse filters
I'm not sure which out of all of those is best, however afaik any of those will be significantly better than bloom filters.
though idk if those packages are good quality or not, but realistically making your own implementation is relatively straightforward as filters are not really doing a lot. they're just a bit of math and then some lookups into a big byte array
@whitequark @avioletheart.me that's fair
libraries exist for basically all of them, and then just searching for a quick comparison between like cuckoo, xor, and (maybe) binary fuse filters should probably give you a good enough idea of which one to choose that you don't need to think about it further
@garthk @solonovamax i knew what i was getting into, that's why git-pages has so many layers of defense woven into it from the start
i just haven't expected people to send millions of requests to domains that don't even resolve to grebedoc