Cursed homelab update:
Still trying to calm down the fans on Server #3, I decided to upgrade Kubernetes from 1.31 to 1.34 (well I wanted to upgrade beyond that, but ... you'll see...)
Kubernetes upgrades with Kubeadm are delightfully straightforward: upgrade the stuff managed by Kubeadm, then upgrade Kubelet (This is NOT upgrade instructions, please follow the documentation) Other than there being some (hopefully) very rare manual steps to migrate configuration, this is so simple as to be trivially automated and I'm seriously thinking of adding this functionality to @geerlingguy's excellent Ansible role. (Right after I contribute support for multiple control plane nodes)
(This paragraph contains a rant about AI, so feel free to skip it.)
My process is to basically follow a set of instructions with the various commands spelled out for copying-and-pasting. Doing, say, a migration from Terraform to OpenTofu is similarly easy, which is why I'm so frustrated that a colleague "automated" this with some AI agent tool thing and it fucked it up.
This went very well, including an extra step to upgrade to the latest patch version of 1.31 as I misunderstood the documentation. Well that was right up until the migration from 1.33 to 1.34 broke on server #1. No idea what happened, but the etcd pod on that server ended up in the "Pending" state, got stuck and this broke Kubeadm's upgrade process.
(Note here: my cluster is now extremely "unbalanced" with server #3, the DL380, having 89% of my CPU cores and 70% of my RAM, so there's a very specific order to do everything (biggest node to smallest) and stuff occasionally times out on servers #1 and #2 as they're both slower and overloaded)
After a bunch of attempts to unstick it, I went nuclear and "kubeadm reset" the node, cleaned up the remaining items in the cluster, and re-joined the cluster, which worked, and I then continued the upgrade process without issue.
I wanted to be on 1.35 (current is 1.36) but this was annoying enough that I'm stopping for now.
So has this calmed down server #3's fans?
Well not really. On one hand, 1.31 did occasionally have etcd and the API server chew through ~10 cores of CPU for just long enough that the fans spun up and 1.34 seems to only use ~5-7 cores for these burst which is much better. I believe these spikes are due to some issue with server #1 or #2, so I think the only way to "fix" this is by reducing the contention on servers #1 and #2 by upgrading their hardware.
So I need to figure out another way to cram more RAM and CPUs in those servers, hopefully without spending too much money (or buying another server-class machine. One a year is more than enough)
#homelab #cursedhomelab #kubernetes