(maniacal cackling)

i have finally got iocaine installed. wasn't even hard, just needed to sit down and do the steps and brain is real good at not that sometimes

hooked it up to the apt-installable anarchism faq for its markov corpus and the biggest canadian flavored apt-installable wordlist i could get

feels good. like the invulnerability you get from your favorite winter gloves and jacket before going out to play in the blizzard

now it's safe to blog again 🎉

i um only just now noticed that the apt-installable anarchism faq, in uncompressed markdown format, which i fed to iocaine for its markov corpus,

is twelve megabytes. of text.

almost 1.9 million words.

iocaine seems to be doing just fine so far

accidentally set caddy to syslog every request sent to iocaine 3 and oh gosh my website is pumping so much poison markov trash into chatgpt and claude rn 😍 💕

and it's using less cpu and memory than systemd-journald to do so

might need to look into setting bandwidth limiters on this thing

i'm still casting around for anti-cloud(flare) mechanisms of regional failover. like if the cable to the datacenter i use gets cut, or there's political upheaval, how to automatically shunt traffic to a different datacenter faster than a dns update would propagate through caches

i'm vaguely aware of this technology called anycast but i don't know much

https://grebedoc.dev/ uses https://rage4.com/ to do it

https://en.wikipedia.org/wiki/Anycast

yeah eat it, ai scraper assholes

(gradually improving my monitoring, iocaine stats newly added to my collectd/rrdtool dashboard)

tiddlywiki doesn't come with a basic to-do feature, to make checkboxes and tick them off without having to tediously edit the page and type some [x]s

but it does have a plugin mechanism. found two plugins (both by the same author) that do checklists: Kara and Todolist

installation instructions made me nervous though, since i'm using tiddlyPWA that is rather different on the backend...

Kara 0.9.7

In tiddler plain checklist and interstitial journaling plugin

turns out though, Todolist is super easy to install on a self-hosted TiddlyPWA! drag and drop, click the button to reupload the wiki file to the server, and done.

entirely through the browser, way easier than messing around with directories. woah tiddlypwa is easier than stock tiddlywiki!?

Todolist Plugin 1.5.0

Organize, prioritize, and plan your work

i haven't put any rate limiters on here yet (i definitely will), but seems like claude and chatgpt limit themselves to 25 requests per second to my websites. i wonder how they picked that number, and if they'll ramp it up. and if i ratelimit, will they send more requests from other ip addresses. etc.

feels so good to know these assholes' language models are chugging down low-effort ungrammatical poison after ignoring my robots.txt

should i do traffic shaping using tc, haproxy, or shove yet another plugin into caddy?

should i slow the response down to a trickle for all the llm scrapers, or randomly drop their connections? 😈

despite it being part of linux since version 2.2, which is about as long as i've been daily-driving it, i hadn't heard of tc until this past month. that's "traffic control," a tool to control the kernel's network traffic limiting, smoothing, and prioritization

and for a command with such a tiny name wow it's a lot

i only want to restrict the bandwidth of one process so i think i'll look for easier mechanisms before i attempt to swallow this whole burrito

til: trickle, a lightweight userspace bandwidth shaper

could i just wrap iocaine with this and be done?

... except trickle doesn't work on statically linked executables, like iocaine. womp womp

i guess i could do a trick like wrap socat with it, then talk to iocaine through that,

but that feels more complicated than just switching back to haproxy and using its builtin traffic shaping features

GitHub - mariusae/trickle: Trickle is a userland bandwidth shaper for Unix-like systems.

Trickle is a userland bandwidth shaper for Unix-like systems. - mariusae/trickle

GitHub

what bits of haproxy, lighttpd, nginx, caddy, static-web-server should i string together?

requirements:

  • iocaine can plug in somewhere
  • can control the bandwidth of iocaine's garbage generator
  • static web server
    • for multiple domains ("virtual hosts")
    • uses sendfile() for speed
    • precompressed files trick
  • haproxy: has traffic shaping, proxy, fastcgi. no sws; have to proxy one. which?
    • lighttpd: small, sendfile(), correct webdav. can do reverse proxy itself, makes haproxy redundant? can i plug in iocaine?
    • nginx: i quit it because it was segfaulting when i tried to configure too many features. but if i'm only using it for static files maybe it's ok
    • static-web-server: i like rust. don't like a copy paste chunk of toml per configured domain
  • caddy: proxy, fastcgi, builtin static fileserver, traffic shaping requires a module that doesn't do quite what i want. getting tired of guessing my way around caddyfile syntax. don't need its magic certificate management.
    • use 'tc' for traffic shaping? big learning curve
    • front it with haproxy? lots of redundant features, feels heavy

dang i gotta draw up a feature matrix or something

it's pretty weird that it took me this long to actually do but

tonight i have set up for the first time a program running on a computer inside my home, that people may access like a normal website, without learning my cable modem's ip address in the process, and if someone starts ddosing me i can just unplug and let the household continue watching videos unaware

(i'm having my @colocataires vps proxy traffic through a tailscale vpn to my closet fileserver)

safe(r) home-hosting by reverse proxy from a little computer in a datacenter is one of those things that seems like complex esoteric engineering from afar

but once you've experienced it, and then again when you've set it up yourself, all of a sudden it makes sense and is totally normal and a whole mess of possibilities for what you can cheaply and casually build on the internet blasts wide open

like the first time you experience nerd astral projection

llm scrapers ignoring my robots.txt and pounding on my small website 28 times per second, 24/7. 600kbps of my available bandwidth wasted just on markov trash

it's easy to imagine how they'll ddos any service that does a bit of compute on each request

it's not super exciting but if you're the kind of weirdo who wants to look at my vm's gauges, they are viewable here:

https://telemetry.orbital.rodeo/

i have been cobbling it together using collectd, rrdtool, and scripts instead of the far more reasonable and popular prometheus / grafana combo. because it might be more lightweight? haven't measured

for now it updates only when i run the command, so don't sit there wondering

no light mode or explanatory text (yet) soz

i learned how to make haproxy throttle iocaine's output so the scrapers continue to download delicious poison but now only at 56kbps (down from 600kbps) 🎉

oof, it's somewhat heavy though. went from about 3% avg cpu use to about 6%

the throttling that haproxy does just gets buffered up by caddy in front of it and the result is a long initial delay before a fast transmission of data. like latency

which could probably be implemented more simply with a sleep statement somewhere

i wonder what other strategies i can use to slow down crawlers. thinking random connection drops or http errors 429, 402, and 451

somewhere on my to-do list: look into how yunohost compares to just installing various bits of software on your vps. is it heavier, easier, less customizable, can you put other stuff alongside it, etc.

problem: once detected, how best to slap back at ai scrapers? return poison quickly? tarpit? throttled poison drip? drop their ip's packets at the firewall?

idea: drop packets during business hours to free up bandwidth for legit visitors; fast poison otherwise to collect ip addresses for next day's ip ban. "party all night sleep all day" strategy

lots of bot traffic hitting port 80 (http) on my vm just to get redirected to port 443 (https) where they get a "go away, bot" error

who am i keeping port 80 open for?

who types in "orbital.rodeo," lands on http, and doesn't know to or can't try https instead?

many people use hsts and abandon 80

caddy auto-magically puts a redirect on 80 for my sites but i'm increasingly annoyed by its magic. wanna go back to haproxy

think i'll shut it

HTTP Strict Transport Security - Wikipedia

oh, whoopsie

if an ipv4->ipv6 proxy is telling my webserver the clients' ipv4 addresses using proxy protocol, blocking those ipv4 addresses at the webserver firewall isn't going to do much 🤦

i have a v4 address now, i was just lazy about reconfiguring dns to send v4 web traffic direct to my vm instead of through the v4-v6 proxy. time to get on that

oh that's more like it
after blocking ai scrapers at the firewall, cut my cpu use down to 1% and bot traffic to almost nothing
you love to see it!

🤔 i wonder how many innocents i'll accidentally shut out if i adopt a policy of, "any /24 prefix with 3 or more scrapers within it dooms the lot"?

🤔 i could set up a "pls let me back in" automation. tell me my biceps are eleven out of ten in this web-form and you get added to an inclusion list that takes effect before the block list

i could implement both of those defense mechanisms

reduce bookkeeping on my part by being a bit overeager about blocking whole prefixes instead of individual ip addresses

definitely want to do something like @alex's butlerian jihad where i block all networks from any ASN abusing my sites

but also, have a cooldown that sends traffic from blocked prefixes to a "let me back in" form that allowlists individual addresses

oh cool, while i wasn't paying attention anubis has grown dataset poisoning features like what iocaine does and a (paid) collaborative reputation database mechanism
Making sure you're not a bot!

haha oops i accidentally banned my own ip. fixed it but guessing i'll have to flush the ban lists and rebuild in case i caught any more i shouldn't have

one super nice thing i'm doing this time around is using a wireguard-based vpn for all my ssh'ing. so even when i blocked my own ip address my ssh session was unaffected and i could fix it. and zero log spam from vulnerability scanners constantly trying the door 😌

i want to block any requests from google and facebook; also i want to block any isp who would tolerate scrapers

the database of ip range ("prefix") assignments is downloadable but it's big. 590 entries just for as32934 (facebook). too big to just dump into the firewall

but there's often nothing between multiple records for any given asn. maybe i could treat that as a single range, which would let me express the set of ranges to block more concisely 🤔

ooo python's builtin ipaddress library has collapse_addresses and address_exclude functions, and pyasn uses those. if i study those functions i think i should be able to come up with a "collapse_addresses" variant that absorbs unallocated gaps between allocated subnets for a more concise specification

https://github.com/python/cpython/blob/3.14/Lib/ipaddress.py#L304

haha while searching ASNs for "AWS" to block i learned of the existence of AS214513 EEPYPAWS and AS401962 CUDDLE-PAWS

i wonder what other fun autonomous system names are out there, and what they're doing

when it often feels like the internet is like six giant websites consuming everything, it's great to feel lost in a massive database of tiny organizations doing a niche highly technical thing like registering an autonomous system for internet shenanigans

Q.: is communicating to an xmpp server's "direct tls" aka "xmpps" port the same (in terms of protocol) as communicating through a tls tunnel / reverse proxy to its plaintext xmpp port?

so, iiuc, i shouldn't be blocking crawl bots from google, facebook, etc. network ranges, because then i can't feed them poison urls, whereupon i won't be able to identify the more carefully-disguised requests from residential botnets masquerading as browsers

but i do want to very much limit the bandwidth they may consume

so instead of

ip saddr @miscreants drop

let's try

ip saddr @miscreants limit rate over 1/second drop

update: rate limiting packets at the firewall from networks controlled by my biggest bot offenders (facebook, microsoft, google, apple) has accomplished exactly what i wanted: i continue logging and feeding them poison, but their bandwidth is greatly reduced

my implementation might be causing their connections to close mid-request but i don't mind very much. lots of sockets (50) in SYN-RECV state compared to ESTAB (3) rn which could eventually be a problem

oh wait whoops. lots of sockets in SYN-RECV wasn't the fault of my inexpert ratelimiting. an asn outside my "big tech" filtered set was sending me SYN packets and not following up with--

oh my gosh, was i being syn-flooded? was someone angrily trying to deny service to the maybe 3 legit people that want to see my website??

anyway i added them to the limiter so now they're holding around ~3 sockets in SYN-RECV state instead of ~40

graph: sockets in SYN-RECV state. if i understand correctly, this occurs when somebody says "hey let me connect" and my server says "ok you can connect" and then they just never reply

eventually it stops waiting but until then the socket is in use. so if some miscreant fires off a ton of "hey let me connect" without replying they can clog up the pipes

in previous experiments this line averaged either 40 or 0

love being able to just turn the firehose off

i've seen some example nftables rules that monitor how quickly any one ip address is opening new connections, and if it exceeds some threshold instantly blocks them. going to study that and get it working for mine too

setting sysctl tcp_synack_retries=1 or even 2 down from the default of 5 seems to have significantly cut down on the sockets sitting around in SYN-RECV state.

5 means the server will try for about a full minute to reply with a RECV packet to the client that sent it a SYN. 2 brings it down to only about a second or two before it closes a socket that doesn't complete the TCP handshake

i'm kindof playing on easy mode wrt the slopmachine scrapers: since i utterly despise microsoft, amazon, google, facebook, and (slightly less?) apple, easy step zero is permanently block their corporate ip ranges at the firewall

their spiders were the majority of my bot traffic back in december and it's gone, no drawbacks

but if i had business brainworms such that i wanted to appear in google search results etc, i might have a harder time of it

short term project goal:

automate creation of incus containers in my vps to hold each new cursed project that lands on my list; make it easy to stand em up and knock em down on a whim

run one script, get subdomain, ipv6, debian container

no docker, no kubernetes

i think it's totally possible, maybe even easy, if only i were better versed in incus, networks, bridges, etc.

trying to calibrate my sense of how many requests/second is "a lot"

  • 28M rps http: the most optimized benchmark on a huge server serving "hello world"
  • 140k rps http: the fastest i could make my axum/rust server thing go on my desktop
  • 89k rps http: same as \1 for multiple-query workload
  • 20k rps https: enough to dos a 16core / 32gb vps
  • 8k rps https: enough to make a small vps stop recording metrics
  • 440hz: middle A
TechEmpower Web Framework Performance Comparison

Performance comparison of a wide spectrum of web application frameworks and platforms using community-contributed test implementations.

www.techempower.com

@pho4cexa

ooooh, I should add incus to the fork-drawer