as some of you may know i made sth called bombai (name comes from bomba and ai) thats kinda like anubis or iocaine in purpose. those solutions are kind of “fine” already, but the thing about them is that anubis sucks for regular users too, and isnt always effective, and that iocaine usually relies on lists or similar things.

scrapers unfortunately come in all shapes and sizes and with new user agents or hidden ones all the time. after using anubis for a while, i got my forgejo downed again and so i went looking. iocaine seems like a good idea, but i want something that is sure to stop my git from going down even if i dont maintain it or the lists are incomplete.

what i made now does the following:

  • very configurable detection entirely based on behaviour, without modifying site content
    • request counting
    • one fail = timeout, continuously resetting if attempts continue (this can be excellently combined with trap paths)
    • weighted by path and such
    • blobbing entire subnets together if desired (needed for alibaba’s bot for example)
    • allows setting up “trap paths” that instantly flag someone for timeout upon visit
  • customizable response
    • redirection to iocaine or other trap
    • zip bombs (small ones usually since most scrapers are smart enough to not decompress them fully otherwise - but it makes it cheaper on bandwidth either way)
    • maze similar to but less sophisticated than iocaine
    • plain http or html response from file

it is worth mentioning especially explicitly that paths that are expensive for the server to provide (in storage or otherwise), can be thus limited extremely well, and of course most scrapers are blocked even far before the first time they request such a path.

if interest exists and help is needed setting it up or you just want to chat about it, DO IT I WOULD LOVE THAT.

if a bot makes it through that is capable of doing any harm, i consider that a bug immediately. make an issue and ill debug either your config or the program itself.

i use this myself for a bit now, and it works excellently for me with my forgejo.

#opensource #forgejo

bombai

bomba + ai = bombai. it bombs ai

Tud's forge
@tudbut having outright caddy support is great, i'll be sure to set this up for my instance soon (i don't think i've had an issue with AI bots scraping my server despite it running for over a year now, but this'll still be nice to have)
@EeveeEuphoria it took a few months for me to get issues with it, but after a while it happened that scrapers downloaded a ton of /…/…/archive/… files. these get cached on disk for each commit they download so it filled up my disk and nuked my DB.
@EeveeEuphoria this has happened twice btw. i should probably set up sth that stops forgejo when my disk gets full, but i prefer also going beyond such peaceful solutions
@tudbut@social.tudbut.de I can't even open the website on my regular-ass iPhone using Safari. Are you blocking normal iPhones?
@ulveon very much not. what is it saying?
@tudbut@social.tudbut.de it is not loading at all

I see the progress bar and nothing else

Switching to NordVPN made the website load
shrug

I come from a domestic NL connection, nothing fishy, not a server farm.
@ulveon i did get a fail for User-Agent: Misskey/2025.4.4 (https://derg.social/). the post is kinda being requested a lot atm so false positives for very common ip ranges may happen. ill see what i can do tho one sec
@ulveon and no safari got blocked in at least 10 mins
@tudbut@social.tudbut.de aha that must be it
@ulveon allow/block ratio is currently 2921 allow / 280 blocked, with the blocks looking like this, so seems to have been pretty accurate for the most part xd
@ulveon found the issue: your server is right next to that of SemrushBot lol
@tudbut@social.tudbut.de FWIW it is not mine, I am not an admin or even a moderator here. Do you mean it shares IP with another bad server?
@ulveon it shares a datacenter it seems - im being very strict in my config because of things alibaba did
TudbuT (gonna have recovered by 39c3) (@tudbut@social.tudbut.de)

WHAT THE FUCK

@ulveon

Handling request from 185.191.171.2 (185.188.0.0) for /failure/1766059719494135.html.
is in continuous failure
Request is not OK. Sending you to the gallows.
User-Agent: SemrushBot/7~bl; +http://www.semrush.com/bot.html
i am constructor of zip bomb
done

...

Handling request from 185.189.148.195 (185.188.0.0) for /favicon.ico.
matched costly directive 3
Request is not OK. Sending you to the gallows.
User-Agent: Misskey/2025.4.4 (https://derg.social/)
i am constructor of zip bomb
done
Semrush Bot

SemrushBot is the search bot software that Semrush sends out to discover and collect new and updated web data.

Semrush
@tudbut@social.tudbut.de who is 185.189.148.195? That is not my IP
@ulveon its the ip of derg.social
Derg Social

Derg Social provides a home for the dragons of the Fediverse and the furry fandom. Open for registrations :)

Derg Social
@tudbut@social.tudbut.de that makes sense yeah
@ulveon okay nope it isnt apparently?? i have no idea then tbh.
@tudbut thanks, I'm bookmarking this to give it a look as soon as I have time. I currently have a bunch of things set up via fail2ban, but it's not as effective as I would like so I'm always looking for new tools.
@oblomov if you need any help setting it up, hmu. its a single binary and just needs to run in the background (systemd service or whatever), then caddy needs to be configured to go through it which is like 5 lines of config
@tudbut thanks for the offer. I'll try to set it up as soon as I find th time (how soon will that be remains a huge question mark), and if I have issues I'll let you know.
@tudbut love the name. Thank you for creating this!