Treehouse service notice: we are aware of intermittent connectivity issues for certain internet providers. This is caused by a partial outage affecting Hurricane Electric connectivity in Seattle.

We don’t have any particular recommendations at this time other than to try different network paths (example: VPN).

#TreehouseOpsLog

#TreehouseOpsLog on our tech debt todo list, non-exhaustive:
- set up logs collection and aggregation so it doesn't take forever to figure out where those 502s come from
- rework our nginx / docker compose file setup for easier horizontal scaling
- finish rolling out alpine 3.23
- set up internal service sso (this one's mostly an excuse in figuring out how to set up authentik or whatever)
- acquire remaining parts / schedule downtime for hardware upgrades
- migrate backups to cheaper hosting
- figure out why woodpecker keeps dying during jobs (it's probably untuned postgres on proxmox zfs *again*. somehow. sqlite was worse, but it's still annoying to see 15 minute build jobs fail)
- update to nginx >=1.29.4 for some very nice bug fixes and performance improvements

all of this is taken from our ops book, powered by mdbook+curl-PUT-to-git-pages (and a hacky woodpecker+apko+justfile build system, which is itself an image built using a similar setup in another repo)

figuring out how to release some of our ops stuff for public curiosity is on the very long term todo list; part of the reason it's not is because writing for public consumption is harder than writing for a restricted audience.

here's a public (edited for brevity) transcript of an alert that i caused / resolved earlier

first person to figure out the root cause gets an internet point

https://hackmd.io/@k-sparkles/ByjE29uIbe

#TreehouseOpsLog

[PUBLIC] 2026-01-28 proxmox mastodon disk usage alert - HackMD

# 2026-01-28 proxmox mastodon disk usage alert ## transcript (edited) ```sh root@mastodon ~# df -h

HackMD

redis/valkey backups are scary…

  • Turn off automatic rewrites with CONFIG SET auto-aof-rewrite-percentage 0
  • Make sure you don't manually start a rewrite (using BGREWRITEAOF) during this time.
  • Check there's no current rewrite in progress using INFO persistence and verifying aof_rewrite_in_progress is 0. If it's 1, then you'll need to wait for the rewrite to complete.
  • Now you can safely copy the files in the appenddirname directory.
  • Re-enable rewrites when done: CONFIG SET auto-aof-rewrite-percentage <prev-value>

#TreehouseOpsLog

okay, I think I've figured out how redis backups work. next step: running it by some other staffers, cleaning things up and pushing the tools image, and getting this change onto social.treehouse.systems

after that, a new services host is probably my personal #1 priority. forgejo upgrade will come naturally with that.

#TreehouseOpsLog

bikeshed

bikeshed

Treehouse Gitea

we’re back, after the worst unplanned outage in treehouse history

post mortem to come once everyone returns from hibernation

#TreehouseOpsLog

Due to an unfortunate typo, Treehouse was down for ~2.5h. All system are now back to normal.

#TreehouseOpsLog

we continue our migration off of "legacy minio" to the miracle of "cloud storage but it's cheaper than the server that we were running minio on and need to retire anyway”

#TreehouseOpsLog

we're done. probably.

treehouse.systems. 15 IN TXT "we're done!"

#TreehouseOpsLog