Cursed homelab update:
Cleaning up:
I've redistributed my VMs between the servers (Libvirt's migration feature is magic when it works) and given them zram swap devices instead of local virtual disks and upgraded Cilium to 1.18.
And I think Kubernetes is slightly more stable.
The issue I have is system pods (controller-manager, apiserver, scheduler and cilium-operator) frequently restarting on all three servers and this causing various other bits of that infrastructure to "panic", which causes server #3 to spin it's fans up for a moment.
(By "panic" I mean they have to quickly compensate)
The short theory of this is that servers #1 and #2 are overloaded enough that important parts of Kubernetes get stuck just enough that timeouts are frequent, and that entire stack seems unable to retry timed out requests, so things can't do The Thing, and crash.
(Sidebar: there's two schools of thought on error handling: recover from errors vs crash on errors. Option 1 needs extensive error handling code to keep itself up when errors happen. Option 2 requires that things are small and quickly restartable. I advocate for something in the middle and choosing whether and how to handle errors based on the type and severity of them)
So the plan was simple: move the heaviest VM from server #1 to #3, move one of the lighter VMs from server #2 to #1 and hopefully things will calm down.
They did and that heavy VM is running faster. But Kubernetes still isn't happy.
On the suspicion that this might be a networking issue, I upgraded Cilium (which I've been putting off for a while) and it now seems to be slightly calmer, though I want to see last restart times over 1 hour across the board before I call this "fixed".




