Today, trying to get the k3s cluster back to a working and smooth sailing state...
It has been somewhat unstable since the moving, and with little time to take care of it, outages were around every corner
Turns out, CNPG wasn't clearing out WAL on local replicas, inflating local disk usage, making them unavailable for scheduling...
Now that I understand why everything goes down, it is time to setup ntfy and link with with alertmanager+prometheus to get proper insights on the cluster in realtime
Now, back to rebuilding all Longhorn volumes because of those random outages *sight*
But underlining the importance of having good backups ! Because no data was lost, despite random downtimes
#k3s #devops #kubernetes #prometheus #alertmanager #cluster #cnpg #wal #postgres #psql #ntfy #longhorn #volume #backups #homelab #selfhosted