#ceph #selfhosting
I've had an annoying problem on two of my k8s nodes for the past many months that I never had time to really dig into. Two of my nodes would freeze up almost every week. I could not see anything troubling in journalctl or grafana or anywhere. When it would freeze, the node was still pingable, but no ssh or console access. Then I found it was happening every Monday morning. Ah ha! fstrim had a weekly timer. And I have an lv on my root SSD that holds a Ceph OSD. Trim conflict.
I confirmed by manually running fstrim.service and within a couple minutes the system was hung. RHEL 9 had this timer disabled by default. RHEL 10 has it enabled by default. Disabling it should stop the outages.