Reflections on running a self-hosted personal site in the year two thousand twenty-six
Yet again, a long drought between my last post and this one. Unusual for me in the reasonably recent past, now a common occurrence. I keep asking myself “why?”, why am I not posting as much when I loved it so much before?
Edit: Wow… I just realized my last post was almost a year ago.
While recently troubleshooting some bonkers bad bot traffic on an org website that I help maintain, I finally realized: It is getting a lot harder to maintain a self-hosted personal site. I’m seeing all of the same problems that org site is seeing, just at a slightly smaller scale and with less visibility since I haven’t had client-side analytics since late 2020.
Picture the scene.
You have a self-hosted personal site that has been part of your life for a well over 10 years. Maybe it’s a tech blog, maybe it’s where you share the trials and tribulations of being a parent, maybe it’s where you share your DIY misadventures, maybe it’s where you share recipes. The important thing is that it’s your space in an ocean of digital not-your-spaces, and that is why you return to it.
But your bandwidth consumption has been creeping up. The onslaught of email spam, always annoying, has graduated from “firehose” to “endless tidal bore”. Your few ancient posts with comments enabled attract hundreds of OnlyFans-splattered comments a month. You receive at least one legitimate “increased attack rate” notification per day, sometimes multiple. But it’s hard to sift that legit notification email from the many malicious ones telling you to upgrade your nonexistent Norton Antivirus, that you just purchased a Microsoft license that you definitely didn’t buy, etc. Or from the many hundreds of emails you receive a day from not-likely-real “people” offering to guest post, asking for a link, offering a redesign quote, threatening that your SEO isn’t good enough. You have 2FA turned on for CMS login, so now you log in less often because of that small added friction. How are AI bots archiving the more personal details you like to share, like your birth story?
It’s not that fun anymore.
I have to make it fun again, get rid of the most annoying parts. I want to keep my site. I want to use it like I used to use it, and I want to like it like I used to like it. I want people to reach out via email and say hi, give me their thoughts. (Thank you to those of you that have, I’m sorry about the lack of reply. I’m digging out of the depths of the spam garbage dump that is my inbox and will reach sunlight someday!)
This post outlines a few baby steps I’m taking to reclaim my space. Please reach out if you have other suggestions.
I’m hoping to cover the org-facing bot proliferation approach in a separate post. How to differentiate “bad” bots from “good” bots (as much as that is or isn’t possible), and how to keep you analytics data truthful for decision-making purposes. When reading through below, keep in mind that the priorities for an org site are *very* different from a personal site.
***
Here’s a condensed list of the things I want to mitigate on my personal site:
- Personal posts being easily consumed by bots
- The ever-increasing influx of spam email coming in to my publicly-available email inbox
- Unnecessarily high bandwidth consumption
- Comment / pingback spam
I definitely don’t want to block all bots. Even though AI tools are likely contributing to all of the above since they make it a lot easier to automate certain tasks, I don’t want to make it impossible to find some of my less-personal posts through those tools.
Along those lines, this is my initial plan.
1. Route traffic through Cloudflare
It probably sounds like a major oversight that I haven’t used Cloudflare previously, and maybe it was, but it felt like overkill on what is a relatively simple setup. No longer.
I’ll basically follow Flywheel’s documentation in combination with Cloudflare’s docs, being careful to ensure that Wordfence still sees the real visitor IPs.
2. Add WAF rules to block or challenge certain traffic
There are a handful of endpoints and paths that should be blocked or challenged with WAF rules to reduce spam and prevent resource waste.
I’m probably sticking to a free plan, so there are some limitations. One is that I can’t rely on cf.bot_management.score in rules, and another is that you only get five rules for free. Here’s what I’m thinking.
Block WordPress comment submissions and XML-RPC (after disabling Jetpack)
(lower(http.request.uri.path) eq "/wp-comments-post.php")
or
(lower(http.request.uri.path) eq "/xmlrpc.php")I’ve already turned off comments and trackbacks / pingbacks in WordPress, but this blocks the traffic before it hits the site.
I need to make sure I turn off disable and uninstall Jetpack first though since it uses xmlrpc.php. I currently only use Jetpack for backups, downtime monitoring, and account protection, and each of those things can be covered sufficiently for my purposes by Flywheel, Cloudflare, and Wordfence respectively.
Challenge WordPress login and admin area
(
lower(http.request.uri.path) eq "/wp-login.php"
or starts_with(lower(http.request.uri.path), "/wp-admin/")
)
and not lower(http.request.uri.path) eq "/wp-admin/admin-ajax.php"Slightly on the fence about this since it is another point of friction for logging in on occasion… But hopefully this will kill off those “increased attack rate” emails from Wordfence. I have a crazy amount of scanner traffic looking for vulnerable files according to the server logs.
Block obvious scanner clients on sensitive paths
(
lower(http.user_agent) contains "python-requests"
or lower(http.user_agent) contains "go-http-client"
or lower(http.user_agent) contains "curl"
or lower(http.user_agent) contains "wget"
)
and (
starts_with(http.request.uri.path, "/wp-admin/")
or http.request.uri.path eq "/wp-login.php"
or http.request.uri.path eq "/wp-comments-post.php"
or http.request.uri.path eq "/xmlrpc.php"
)This is an attempt to block some of that vulnerability scanning that is prompting those frequent “increased attack rate” emails.
Block unneeded SEO bots
lower(http.user_agent) contains "serpstatbot"
or lower(http.user_agent) contains "seokicks"
or lower(http.user_agent) contains "7siters"
or lower(http.user_agent) contains "sogou"
or lower(http.user_agent) contains "petalbot"Honestly, I’m happy with the SEO on this site in that everything is written in a human- and bot-approachable way. I don’t want to go farther than that. So these bots don’t provide enough value to justify the noise they generate.
Block AI-related user agents from personally-sensitive categories
(
starts_with(http.request.uri.path, "/category/b/")
or starts_with(http.request.uri.path, "/category/ab/")
or starts_with(http.request.uri.path, "/category/children/")
)
and (
lower(http.user_agent) contains "gptbot"
or lower(http.user_agent) contains "chatgpt-user"
or lower(http.user_agent) contains "claudebot"
or lower(http.user_agent) contains "anthropic-ai"
or lower(http.user_agent) contains "perplexitybot"
or lower(http.user_agent) contains "bytespider"
or lower(http.user_agent) contains "ccbot"
or lower(http.user_agent) contains "amazonbot"
or lower(http.user_agent) contains "facebookbot"
)I want to block bots from posts within personally-sensitive categories, but unfortunately my permalinks are configured with slugs only with no primary category slug.
Honestly… This is probably sort of futile. There are a million other ways to get to the more personal content on my site since everything is so interconnected. Maybe I should block certain tags? Or if the slug contains something like “family”. Something to consider later, maybe with a Cloudflare Worker.
3. Do some major email cleanup
I want to keep a public email address on my site since I really do like hearing from real folks. That said, the amount of spam I get to my piperhaywood.com email address is UNREAL. Here’s what I’ll try.
Reduce exposure by only exposing my email address on one challenged page
My public-facing email address is in the global footer. I’m going to remove it from there and have it only on my Contact page.
If I had a paid Cloudflare, I’d probably set up a separate WAF challenge rule along the lines of:
http.request.uri.path in {"/contact" "/contact/"}But since there are only five rules on free accounts, I’ll probably consolidate that with the existing challenge rule pertaining to WordPress login and admin.
Adding a rule like this does mean that real people will see a Cloudflare challenge before they get to the page. I’m ok with that for now since 85% of the people I need to talk with already have my personal email address, and the rest either reach out through Bluesky + Mastodon or can go through the challenge.
Retire old public-facing email address
Obviously, the above is pointless if I don’t retire my old public-facing email address.
I need to:
- Automatically move emails sent to my wildcard email address to a “Catch-all” folder for periodic review
- Set up a new public-facing email address and add it to my
/contact page - Find my old public-facing email address and remove it from my website footer and anywhere else it appears
- Automatically move emails sent to my old public-facing email address to a “Old addresses” folder for periodic review
More aggressive email filtering
I already have some phrase filters in place for my email, but I’m adding another one that sends anything that matches the regular expression below to a “Likely spam” folder automatically.
(?i)\b(guest post|link insertion|seo optimization|seo services|backlink|high authority site|domain authority|sponsored post|collaboration opportunity)\b4. Monitor logs for abuse and consider ASN blocks
Above are some lightweight moves, but I am considering blocking Autonomous System Numbers (ASNs) for networks perpetrating major abuse.
An ASN is a unique number assigned to groups of IP addresses on the internet. So if blocking an IP address is like blocking a person, blocking an ASN is like blocking a neighborhood. ASNs are also a lot more stable than IP addresses.
***
Will this make enough of a difference? Time will tell.
#AI #bots #email #personalSite #security #spam