wait, how often does the whole region disappear anyway?
that was never a concern multiple employers ago when i got to help out at the datacenter
they did redundant everything inside the rack, regular cable-yank failover tests and everything, but no geographical redundancy iirc
maybe i'll inquire about a vm on another host within the same rack when i get closer to dragging clients on board and just forget about higher availability than that for now
embarrassed to admit that i've today taken one halfhearted step toward learning wtf snmp is by way of (re)reading the rrdtool tutorial
no, not smtp the email sending thing. snmp the monitoring of hardware status thing
all because i want to put up some pretty charts of computer doing inscrutable computer thing
(accuracy? that's like number seven or twelve down the list of nice-to-haves)
well, actually,
my ipv4-only client
-> colocataires' ipv4-to-ipv6 sniproxy on port 443
-> my ipv6-only vm
-> haproxy to unwrap proxy protocol
-> prosody xmpp server
... experiment is not working โจ๐โจ
so:
did it never work and i mistakenly thought that it did?
or
did it work at first but i broke it?
an easy fix would be to get an ipv4 address which obviates the need for sniproxy. but dammit before i do that i want answers: is this setup possible? if so, what'd i mess up?
(maniacal cackling)
i have finally got iocaine installed. wasn't even hard, just needed to sit down and do the steps and brain is real good at not that sometimes
hooked it up to the apt-installable anarchism faq for its markov corpus and the biggest canadian flavored apt-installable wordlist i could get
feels good. like the invulnerability you get from your favorite winter gloves and jacket before going out to play in the blizzard
now it's safe to blog again ๐
i um only just now noticed that the apt-installable anarchism faq, in uncompressed markdown format, which i fed to iocaine for its markov corpus,
is twelve megabytes. of text.
almost 1.9 million words.
iocaine seems to be doing just fine so far
accidentally set caddy to syslog every request sent to iocaine 3 and oh gosh my website is pumping so much poison markov trash into chatgpt and claude rn ๐ ๐
and it's using less cpu and memory than systemd-journald to do so
might need to look into setting bandwidth limiters on this thing
i'm still casting around for anti-cloud(flare) mechanisms of regional failover. like if the cable to the datacenter i use gets cut, or there's political upheaval, how to automatically shunt traffic to a different datacenter faster than a dns update would propagate through caches
i'm vaguely aware of this technology called anycast but i don't know much
https://grebedoc.dev/ uses https://rage4.com/ to do it
yeah eat it, ai scraper assholes
(gradually improving my monitoring, iocaine stats newly added to my collectd/rrdtool dashboard)
tiddlywiki doesn't come with a basic to-do feature, to make checkboxes and tick them off without having to tediously edit the page and type some [x]s
but it does have a plugin mechanism. found two plugins (both by the same author) that do checklists: Kara and Todolist
installation instructions made me nervous though, since i'm using tiddlyPWA that is rather different on the backend...
i haven't put any rate limiters on here yet (i definitely will), but seems like claude and chatgpt limit themselves to 25 requests per second to my websites. i wonder how they picked that number, and if they'll ramp it up. and if i ratelimit, will they send more requests from other ip addresses. etc.
feels so good to know these assholes' language models are chugging down low-effort ungrammatical poison after ignoring my robots.txt
should i do traffic shaping using tc, haproxy, or shove yet another plugin into caddy?
should i slow the response down to a trickle for all the llm scrapers, or randomly drop their connections? ๐
despite it being part of linux since version 2.2, which is about as long as i've been daily-driving it, i hadn't heard of tc until this past month. that's "traffic control," a tool to control the kernel's network traffic limiting, smoothing, and prioritization
and for a command with such a tiny name wow it's a lot
i only want to restrict the bandwidth of one process so i think i'll look for easier mechanisms before i attempt to swallow this whole burrito
til: trickle, a lightweight userspace bandwidth shaper
could i just wrap iocaine with this and be done?
... except trickle doesn't work on statically linked executables, like iocaine. womp womp
i guess i could do a trick like wrap socat with it, then talk to iocaine through that,
but that feels more complicated than just switching back to haproxy and using its builtin traffic shaping features
what bits of haproxy, lighttpd, nginx, caddy, static-web-server should i string together?
requirements:
dang i gotta draw up a feature matrix or something
it's pretty weird that it took me this long to actually do but
tonight i have set up for the first time a program running on a computer inside my home, that people may access like a normal website, without learning my cable modem's ip address in the process, and if someone starts ddosing me i can just unplug and let the household continue watching videos unaware
(i'm having my @colocataires vps proxy traffic through a tailscale vpn to my closet fileserver)
safe(r) home-hosting by reverse proxy from a little computer in a datacenter is one of those things that seems like complex esoteric engineering from afar
but once you've experienced it, and then again when you've set it up yourself, all of a sudden it makes sense and is totally normal and a whole mess of possibilities for what you can cheaply and casually build on the internet blasts wide open
like the first time you experience nerd astral projection
llm scrapers ignoring my robots.txt and pounding on my small website 28 times per second, 24/7. 600kbps of my available bandwidth wasted just on markov trash
it's easy to imagine how they'll ddos any service that does a bit of compute on each request
it's not super exciting but if you're the kind of weirdo who wants to look at my vm's gauges, they are viewable here:
https://telemetry.orbital.rodeo/
i have been cobbling it together using collectd, rrdtool, and scripts instead of the far more reasonable and popular prometheus / grafana combo. because it might be more lightweight? haven't measured
for now it updates only when i run the command, so don't sit there wondering
no light mode or explanatory text (yet) soz
oof, it's somewhat heavy though. went from about 3% avg cpu use to about 6%
the throttling that haproxy does just gets buffered up by caddy in front of it and the result is a long initial delay before a fast transmission of data. like latency
which could probably be implemented more simply with a sleep statement somewhere
i wonder what other strategies i can use to slow down crawlers. thinking random connection drops or http errors 429, 402, and 451
problem: once detected, how best to slap back at ai scrapers? return poison quickly? tarpit? throttled poison drip? drop their ip's packets at the firewall?
idea: drop packets during business hours to free up bandwidth for legit visitors; fast poison otherwise to collect ip addresses for next day's ip ban. "party all night sleep all day" strategy
lots of bot traffic hitting port 80 (http) on my vm just to get redirected to port 443 (https) where they get a "go away, bot" error
who am i keeping port 80 open for?
who types in "orbital.rodeo," lands on http, and doesn't know to or can't try https instead?
many people use hsts and abandon 80
caddy auto-magically puts a redirect on 80 for my sites but i'm increasingly annoyed by its magic. wanna go back to haproxy
think i'll shut it
oh, whoopsie
if an ipv4->ipv6 proxy is telling my webserver the clients' ipv4 addresses using proxy protocol, blocking those ipv4 addresses at the webserver firewall isn't going to do much ๐คฆ
i have a v4 address now, i was just lazy about reconfiguring dns to send v4 web traffic direct to my vm instead of through the v4-v6 proxy. time to get on that
๐ค i wonder how many innocents i'll accidentally shut out if i adopt a policy of, "any /24 prefix with 3 or more scrapers within it dooms the lot"?
๐ค i could set up a "pls let me back in" automation. tell me my biceps are eleven out of ten in this web-form and you get added to an inclusion list that takes effect before the block list
i could implement both of those defense mechanisms
reduce bookkeeping on my part by being a bit overeager about blocking whole prefixes instead of individual ip addresses
definitely want to do something like @alex's butlerian jihad where i block all networks from any ASN abusing my sites
but also, have a cooldown that sends traffic from blocked prefixes to a "let me back in" form that allowlists individual addresses
haha oops i accidentally banned my own ip. fixed it but guessing i'll have to flush the ban lists and rebuild in case i caught any more i shouldn't have
one super nice thing i'm doing this time around is using a wireguard-based vpn for all my ssh'ing. so even when i blocked my own ip address my ssh session was unaffected and i could fix it. and zero log spam from vulnerability scanners constantly trying the door ๐
i want to block any requests from google and facebook; also i want to block any isp who would tolerate scrapers
the database of ip range ("prefix") assignments is downloadable but it's big. 590 entries just for as32934 (facebook). too big to just dump into the firewall
but there's often nothing between multiple records for any given asn. maybe i could treat that as a single range, which would let me express the set of ranges to block more concisely ๐ค
too big to just dump into the firewall
whoopsie, that was a wrong assumption on my part based on a bad time i had with way too many iptables firewall rules created by fail2ban many years ago
these days i'm using nftables and its set structure to hold ip addresses, which uses radix trees just like the routing tables do, and you can dump addresses in there all day long, it will manage merging them into ranges and auto expiring them if you want, works great
so upthread i was surprised that you can just shovel truckloads of ip addresses into nftables' "set" structure, for blockin' purposes
but i want to do stuff like detect if several addresses within some autonomous system's range are coordinating for shenanigans, and block the whole damn asn
this example, on the nftables wiki itself, loads a whole ass maxmind geoip db into nftables' "map" structure and my first reaction was "surely not"
https://wiki.nftables.org/wiki-nftables/index.php/GeoIP_matching
woah cool i just learned about the nftables feature concatenations
i'm already ๐คฉ about nftables' very fast sets and maps but today i learned that you can store essentially tuples of data in them
which in some cases can let you test multiple conditions at once, replacing multiple rules with a fast set-membership check
about 1.5 days after asking iocaine to not just poison but also block ai scrapers masquerading as browsers, i have about 36000 ip addresses blocked at the firewall
this is for a site that is not advertised anywhere, disliked by search engines, and contains maybe 10 blog posts that rarely change. AND which preemptively blocks several whole gafam corporate ASNs so not even counting them
so i expect more popular sites are seeing many multiples of this traffic
anyway, thinking again about how to analyze this ever growing set of blocked ai scraper addresses, most of which are probably "residential ips."
calculate for each asn the percentage of its ip range that i've blocked, and above a certain threshold block the whole range? (that would be more efficient than recording every single bad address)
ideas contd.:
have an unblocked subdomain where a legit user of a blocked ip might fill out a form and click a "let me back in" button to get onto an allow-list
double extra forever-ban anybody that uses the "get me back in" button then starts snarfing down poison again
also, at some point, i want to bring iocaine to work. i'm on easy mode now because idgaf about my site's visibility to search engines
but what to do when boss requests that when customers ask their ai bullshit to order from our website on their behalf, maybe i shouldn't reply with an HTTP redirect into the fucking sun with gigabytes of foul invective zip bomb for the response body
ideas contd.:
live-updated status page listing all the ip addresses i've blocked, in nice formats for easy import into firewalls, tools for consuming and contributing to said databases
serious looking landing page for blocked addresses "your ip address is sending malicious traffic to this domain and has been reported. check for compromise immediately."
live-updated ASN leaderboard naming and shaming those with the most ip addresses used by ai scrapers
ideas contd.:
undo my mild mitigation against syn flood, crank the synack retries back up, and collect the ip addresses guilty of doing it. for blocking
โ make caddy 'abort' the connection after one ioproxy poison reply, which closes the socket and blocks ip addresses faster
ideas contd.:
i'm extremely doubtful that most isps will give any shits at all about complaints that llm bots are using their network to destroy websites
i was thinking upthread about an error message to show to legit users of residential ips who get blocked from services; showing them a scolding message like "your ip has been sending malicious traffic"
but maybe more effective will be to direct them, with contact info, to their own isp's abuse line
ideas cont'd.:
poison url generator that encodes the spider's address, so when the headless browsers on residential ips begin scraping them we know which big tech cos are buying access to residential ip address proxies to disguise themselves
there are so many! 133k addresses in my firewall now. starting to wonder if maybe ip blocklists are untenable and i need a blocked-by-default-policy with a request-access mechanism instead
i've got iocaine set to block only new connections when an ip requests a poison page. that keeps their ip from returning, but doesn't kick them off my server immediately
i tried adding an abort to the end of iocaine's handle_response block in caddy, (and rebooted)
i think what i'm seeing now is scrapers successfully getting kicked out at the first request, but their sockets now get stuck in fin-wait-1 state until they time out
sudo sysctl net.ipv4.tcp_orphan_retries=1 clears out those FIN-WAIT-1 sockets nice and quick ๐
i wonder what the downsides are. if any. maybe those retries are useful on shittier connections but this is a vps in a datacenter not a mobile phone on a subway car
isps that support ipv6 typically allocate for each of their customers a /64
therefore,
if i'm going to block a scraper's ipv6 address at the firewall, probably safe to block their whole /64
right?
hell yeah, iocaine with firewalling ๐ฅ
usually ai llm scraper traffic starts and stops abruptly; this gradual fall-off conforms to my expectation of what should happen over time as each ip address used to scrape gets blocked
also nice: cpu overhead is about 1% with 120k blocks so far. memory requirement is tiny (proportional to poison training corpus)
a nice quiet environment is the perfect place to host some prose and programs, and this motivates me to do so again!
accomplishment: got most of my web thingies, still protected by iocaine, back under haproxy instead of caddy, like i wanted
scraper bot traffic to my sites remains nice and low. i have only 130k addresses blocked which isn't the high water mark so i think they must also have backed off of my ip a bit
they'll probably be back once i start updating the sites and reopen the poison maze to the big tech corpo network crawlers
ideas cont'd:
blocking millions of ip addresses used by ai scrapers feels good, but how sure am i that it is effective? maybe the scrapers make each request with the least-recently-used stolen ip that they have access to and my blocklist won't be effective until they loop around
to examine this maybe i can accept everything for 30 seconds to examine the spike, if any, in bots served during that time. don't worry, blocked or not, they receive only markov trash
idea implemented: stuck a counter on iocaine's firewall rule banning ips that request poisoned urls, and graphed it
now i can quantify how effective firewalling is:
at the moment, with less than a million ips blocked, it is shielding me from about 75% of the wave of bots (the rest get served poison and added to the block list)
and that doesn't even include attempts from corpo networks to run their crawlers! i should graph that too
love to see it
uh oh. what if my graph of successfully blocked requests is just bots that keep trying at the moment iocaine blocks them?
then, i'm not actually avoiding 75% of bot traffic and iocaine's firewall feature isn't doing much
well, i can test this: copy the blocked ips to another set that takes effect before the one that iocaine is actively updating
if iocaine's firewall is working as intended, i should see the number that reach my graph decrease, at least a little:
occasionally i pull the caddy request logs down to my local machine for analysis
wow, the past 7 days' worth is up to 3 gigabytes. of just logs
which seems like a lot but i get nowhere near the amount of traffic other sites do
ai bros currently sustaining ~120 requests per second to my site for the past 12 hours. effortlessly dropping ~100 at the firewall, serving markov-generated nonsense to the remaining 20. per second
fuck ai, what a stupid waste
i sortof eyeballed previously that iocaine's firewall was letting me avoid 75% of bot attempts to scrape my site. so for every bot i served trash to, there were about 3 more connection attempts blocked at the firewall from known-bad ip addresses, during each unit of time
hmm maybe i can graph that
๐คฉ yesss i love it, now blocking 16:1, 94%
tweaked my graph to calculate a nice percentage instead
this graph shows the result of iocaine firewalling off ip addresses that are used to scrape my site: about 55% to 95% (as you can now see in the chart) of scraping attempts are callously, effortlessly, ignored at the firewall. and the rest get markov trash!
hell yeah iocaine
now that fortifications are deployed it's time to build fantastic stuff safely behind the walls
@pho4cexa You might want to try the current head of the iocaine-3.x branch, that has a fix that addresses the FIN WAIT state: adding a ct state vmap { established : accept, related : accept, invalid : drop } rule to the start of the blocking chain.
(You can add something similar manually too, but iocaine will drop & recreate the chain on restart)
@algernon running iocaine compiled from there (hash 7bb5447) is having trouble starting for me
when I run:
sudo nft destroy table inet iocaine
sudo iocaine -c /opt/iocaine/etc/iocaine
it successfully builds its table, chain filter with rules, and four sets but then quits with:
[...] failed to initialize firewall options="VaccineSpecs { [...] }" error="nftables already initialized"
Error: : init script not found, at [...]/means_of_production/mod.rs:291:24
@pho4cexa Hrm. Interesting! Are you using the built in script, or NSoE?
I suspect it is a race condition, but not sure yet.
enable. everything else is defaults@pho4cexa Mmmmh. Do you have multiple http-handlers that use a script with firewall enabled?
If so, I can reproduce, and a fix will be coming shortly.
@pho4cexa Yep, that's it then. If you're comfortable with patching:
diff --git a/iocaine-powder/src/vaccine/linux.rs b/iocaine-powder/src/vaccine/linux.rs
index a8a9792..03a9388 100644
--- a/iocaine-powder/src/vaccine/linux.rs
+++ b/iocaine-powder/src/vaccine/linux.rs
@@ -44,10 +44,6 @@ static BLOCK_METRICS: LazyLock<IntCounterVec> = LazyLock::new(|| {
impl Vaccine {
fn init_nftables(options: &VaccineSpecs) -> Result<()> {
- if TABLE_NAME.get().is_some() {
- return Err(VibeCodedError::message("nftables already initialized").into());
- }
-
let mut nft = Nftables::new();
command(&mut nft, format!("add table inet {}", options.table_name))?;
@@ -136,6 +132,10 @@ impl Vaccine {
}
pub fn init(options: &VaccineSpecs) -> Result<()> {
+ if TABLE_NAME.get().is_some() {
+ return Ok(())
+ }
+
Self::init_nftables(options)?;
let (queue_tx, mut queue_rx) = mpsc::unbounded_channel::<IpAddr>();
This addresses the issue. The fix I will commit will likely end up being a bit different (it will have some sanity checks instead of blindly returning if already initialized), but it should unblock you until I get around to fixing it in git.
@pho4cexa Sorry about that! I dropped a critical if FIREWALL_BLOCK_RULE_HITS.matches(ruleset) during a rebase :(
The issue should be fixed on current iocaine-3.x - which also changed a bit how the firewall is enabled: it's no longer part of the handler configuration, but a top-level firewall node in the KDL config:
firewall {
enable
}