Mastodawn

a very weeny construct 💀Dec 23

haha oops i accidentally banned my own ip. fixed it but guessing i'll have to flush the ban lists and rebuild in case i caught any more i shouldn't have

one super nice thing i'm doing this time around is using a wireguard-based vpn for all my ssh'ing. so even when i blocked my own ip address my ssh session was unaffected and i could fix it. and zero log spam from vulnerability scanners constantly trying the door 😌

a very weeny construct 💀Dec 23

i want to block any requests from google and facebook; also i want to block any isp who would tolerate scrapers

the database of ip range ("prefix") assignments is downloadable but it's big. 590 entries just for as32934 (facebook). too big to just dump into the firewall

but there's often nothing between multiple records for any given asn. maybe i could treat that as a single range, which would let me express the set of ranges to block more concisely 🤔

a very weeny construct 💀Jan 7

too big to just dump into the firewall

whoopsie, that was a wrong assumption on my part based on a bad time i had with way too many iptables firewall rules created by fail2ban many years ago

these days i'm using nftables and its set structure to hold ip addresses, which uses radix trees just like the routing tables do, and you can dump addresses in there all day long, it will manage merging them into ranges and auto expiring them if you want, works great

a very weeny construct 💀Jan 27

so upthread i was surprised that you can just shovel truckloads of ip addresses into nftables' "set" structure, for blockin' purposes

but i want to do stuff like detect if several addresses within some autonomous system's range are coordinating for shenanigans, and block the whole damn asn

this example, on the nftables wiki itself, loads a whole ass maxmind geoip db into nftables' "map" structure and my first reaction was "surely not"

https://wiki.nftables.org/wiki-nftables/index.php/GeoIP_matching

GeoIP matching - nftables wiki

a very weeny construct 💀Jan 27

i mean are there limits? how many rules and addresses can i dump into nftables tables, chains, maps and sets (which, iiuc, all live in the kernel) before it crashes

a very weeny construct 💀Jan 27

anyway it's goblin week, or it was recently? so mnabye imma try implementing automatic, immediate, asn matching and blocking in nftables rules 😈

a very weeny construct 💀Mar 18

woah cool i just learned about the nftables feature concatenations

i'm already 🤩 about nftables' very fast sets and maps but today i learned that you can store essentially tuples of data in them

which in some cases can let you test multiple conditions at once, replacing multiple rules with a fast set-membership check

Concatenations - nftables wiki

about 1.5 days after asking iocaine to not just poison but also block ai scrapers masquerading as browsers, i have about 36000 ip addresses blocked at the firewall

this is for a site that is not advertised anywhere, disliked by search engines, and contains maybe 10 blog posts that rarely change. AND which preemptively blocks several whole gafam corporate ASNs so not even counting them

so i expect more popular sites are seeing many multiples of this traffic

anyway, thinking again about how to analyze this ever growing set of blocked ai scraper addresses, most of which are probably "residential ips."

calculate for each asn the percentage of its ip range that i've blocked, and above a certain threshold block the whole range? (that would be more efficient than recording every single bad address)

ideas contd.:

have an unblocked subdomain where a legit user of a blocked ip might fill out a form and click a "let me back in" button to get onto an allow-list

double extra forever-ban anybody that uses the "get me back in" button then starts snarfing down poison again

also, at some point, i want to bring iocaine to work. i'm on easy mode now because idgaf about my site's visibility to search engines

but what to do when boss requests that when customers ask their ai bullshit to order from our website on their behalf, maybe i shouldn't reply with an HTTP redirect into the fucking sun with gigabytes of foul invective zip bomb for the response body

ideas contd.:

live-updated status page listing all the ip addresses i've blocked, in nice formats for easy import into firewalls, tools for consuming and contributing to said databases

serious looking landing page for blocked addresses "your ip address is sending malicious traffic to this domain and has been reported. check for compromise immediately."

live-updated ASN leaderboard naming and shaming those with the most ip addresses used by ai scrapers

a very weeny construct 💀Mar 28

ideas contd.:

undo my mild mitigation against syn flood, crank the synack retries back up, and collect the ip addresses guilty of doing it. for blocking

✅ make caddy 'abort' the connection after one ioproxy poison reply, which closes the socket and blocks ip addresses faster

a very weeny construct 💀Mar 28

ideas contd.:

i'm extremely doubtful that most isps will give any shits at all about complaints that llm bots are using their network to destroy websites

i was thinking upthread about an error message to show to legit users of residential ips who get blocked from services; showing them a scolding message like "your ip has been sending malicious traffic"

but maybe more effective will be to direct them, with contact info, to their own isp's abuse line

a very weeny construct 💀Mar 29

ideas cont'd.:

poison url generator that encodes the spider's address, so when the headless browsers on residential ips begin scraping them we know which big tech cos are buying access to residential ip address proxies to disguise themselves

there are so many! 133k addresses in my firewall now. starting to wonder if maybe ip blocklists are untenable and i need a blocked-by-default-policy with a request-access mechanism instead

a very weeny construct 💀Mar 29

i've got iocaine set to block only new connections when an ip requests a poison page. that keeps their ip from returning, but doesn't kick them off my server immediately

i tried adding an abort to the end of iocaine's handle_response block in caddy, (and rebooted)

i think what i'm seeing now is scrapers successfully getting kicked out at the first request, but their sockets now get stuck in fin-wait-1 state until they time out

a very weeny construct 💀Mar 29

sudo sysctl net.ipv4.tcp_orphan_retries=1 clears out those FIN-WAIT-1 sockets nice and quick 😎

i wonder what the downsides are. if any. maybe those retries are useful on shittier connections but this is a vps in a datacenter not a mobile phone on a subway car

a very weeny construct 💀Mar 31

isps that support ipv6 typically allocate for each of their customers a /64

therefore,

if i'm going to block a scraper's ipv6 address at the firewall, probably safe to block their whole /64

right?

a very weeny construct 💀Mar 31

hell yeah, iocaine with firewalling 💥

usually ai llm scraper traffic starts and stops abruptly; this gradual fall-off conforms to my expectation of what should happen over time as each ip address used to scrape gets blocked

also nice: cpu overhead is about 1% with 120k blocks so far. memory requirement is tiny (proportional to poison training corpus)

a nice quiet environment is the perfect place to host some prose and programs, and this motivates me to do so again!

a very weeny construct 💀Apr 1

accomplishment: got most of my web thingies, still protected by iocaine, back under haproxy instead of caddy, like i wanted

scraper bot traffic to my sites remains nice and low. i have only 130k addresses blocked which isn't the high water mark so i think they must also have backed off of my ip a bit

they'll probably be back once i start updating the sites and reopen the poison maze to the big tech corpo network crawlers

a very weeny construct 💀Apr 7

ideas cont'd:

blocking millions of ip addresses used by ai scrapers feels good, but how sure am i that it is effective? maybe the scrapers make each request with the least-recently-used stolen ip that they have access to and my blocklist won't be effective until they loop around

to examine this maybe i can accept everything for 30 seconds to examine the spike, if any, in bots served during that time. don't worry, blocked or not, they receive only markov trash

a very weeny construct 💀Apr 8

idea implemented: stuck a counter on iocaine's firewall rule banning ips that request poisoned urls, and graphed it

now i can quantify how effective firewalling is:

at the moment, with less than a million ips blocked, it is shielding me from about 75% of the wave of bots (the rest get served poison and added to the block list)

and that doesn't even include attempts from corpo networks to run their crawlers! i should graph that too

love to see it

a very weeny construct 💀Apr 9

uh oh. what if my graph of successfully blocked requests is just bots that keep trying at the moment iocaine blocks them?

then, i'm not actually avoiding 75% of bot traffic and iocaine's firewall feature isn't doing much

well, i can test this: copy the blocked ips to another set that takes effect before the one that iocaine is actively updating

if iocaine's firewall is working as intended, i should see the number that reach my graph decrease, at least a little:

a very weeny construct 💀Apr 12

occasionally i pull the caddy request logs down to my local machine for analysis

wow, the past 7 days' worth is up to 3 gigabytes. of just logs

which seems like a lot but i get nowhere near the amount of traffic other sites do

a very weeny construct 💀Apr 12

ai bros currently sustaining ~120 requests per second to my site for the past 12 hours. effortlessly dropping ~100 at the firewall, serving markov-generated nonsense to the remaining 20. per second

fuck ai, what a stupid waste

a very weeny construct 💀Apr 14

i sortof eyeballed previously that iocaine's firewall was letting me avoid 75% of bot attempts to scrape my site. so for every bot i served trash to, there were about 3 more connection attempts blocked at the firewall from known-bad ip addresses, during each unit of time

hmm maybe i can graph that

🤩 yesss i love it, now blocking 16:1, 94%

a very weeny construct 💀Apr 14

tweaked my graph to calculate a nice percentage instead

this graph shows the result of iocaine firewalling off ip addresses that are used to scrape my site: about 55% to 95% (as you can now see in the chart) of scraping attempts are callously, effortlessly, ignored at the firewall. and the rest get markov trash!

hell yeah iocaine

now that fortifications are deployed it's time to build fantastic stuff safely behind the walls

a very weeny construct 💀Apr 30

occasional impulses to learn a little bit more about the popular llms so that i might craft more effective, damaging poison for them to consume

but then i think, no actually i don't want to let any of that shit into my brain

i'll just stick with markov nonsense for now and block millions of ip addresses

a very weeny construct 💀May 1

grumpy feeling of yet another wave of scrapers wasting 2%-4% of my available cpu so i can feed them poison

offset by the joy of seeing the firewall effectiveness meter increasing: a larger and larger percentage of the whole slopbot wave gets dropped before it even reaches the poisoning machine

cmon assholes show me every one of your fraudulently obtained residential ip addresses i'll make a blocklist half a billion entries long and share it with everyone idgaf

a very weeny construct 💀May 1

at some point maybe, normal people are going to have to adopt a security posture of internet traffic that is hostile by default

all the good stuff--art, feels, shitposts, messages--inaccessible until you demonstrate evidence of your anti-corporatehood, to be granted tentative permission to visit

because i can block half a billion or more ipv4 addresses but that won't scale to ipv6

time to build out client certs and weighted networks of trust

a very weeny construct 💀May 3

don't stop i'm almost there

you may recall that i and others have observed slop scrapers to use a two phase approach:

simple web-crawling scripts that announce themselves as bots in their user-agent string and run from within big-tech corporate networks hammer your site to map it out, following all the links

later, huge waves of what appear to be legitimate browsers hammer the urls that the crawlers above previously recorded, coming from random residential ip networks around the world

i'm protecting my site with iocaine. it detects the crawlers by user-agent string and serves them randomized text containing fake links that have an identifiable sigil in them

that's how we know when disguised browsers return that they're actually bots from the big tech companies

we serve the same crap to the returning bots and block their ip address

a while back, i got tired of the crawlers and blocked them too. it was fairly trivial to look up the entire network range for several of the big tech companies pounding on my websites with crawlers and block the entire range. and that was that, no more crawlers. just bots returning to the old generated poison urls

it's been a few months, any crawler that has tried got a "connection refused." maybe they've given up? so i set up a counter and graphed it

maybe after blocking the worst offenders of big tech corporate networks scraping my site for months, they've given up?

nope, chart says ai bros at their desks inside google or meta are still having their crawlers attempt to connect to my site 10 times per second. constantly since i started monitoring it 30 minutes ago

dang it feels good to have iocaine defending my site. glad to have a shield to build stuff under while tech bros continue to enshittify the internet atmosphere

technomancy May 20

@pho4cexa what's the charting setup you got look like?

https://telemetry.orbital.rodeo

@technomancy haha lol it's an absolute mess!

collectd
a couple messy but short scripts to let it collect data from nftables counters and prometheus format metrics from caddy and iocaine
rrdtool
a big horrible script to call rrdgraph to generate all the graphs and the html scaffold
static served by caddy

the goal was maximal efficiency on my old tiny vps but i haven't measured any competitors

planning on cleaning it up or exploring other options... someday

Telemetry | orbital.rodeo

@technomancy ill share the source code as soon as i get either my own forge set back up or decide to just go with codeberg