PSA:
Maintenance work has been finished.
Sadly there is no good news regarding the Ceph issue why I did all this trying to fix.
Ceph is hitting a long standing bug (for 7 years now)
The main tracking bug is #47380: https://tracker.ceph.com/issues/47380
Related bugs:
#43893 (lingering osd_failure ops): https://tracker.ceph.com/issues/43893
#24322 (original report by Sage Weil, 2018): https://tracker.ceph.com/issues/24322
#50637 (OSD slow ops not clearing): https://tracker.ceph.com/issues/50637
Bug #47380 is the most relevant — still open, updated about 1 month ago, tagged for tentacle.
Sadly, the dedicated switch for the datacenter won't help here either.
It's a little bit disappointng that #Ceph has some kind of issues for that long time.
The problem here is, that slow ops on mon caused OSDs to show latency/commit as 0/0. Restarting OSDs helps, until they show real values and vanish from the health warning of slow ops. While they do show up there, peering of OSDs is halted as well.
A restart of the mon leader might also help, but that I can't confirm as I just tried for the first time. No long term experience yet.
I'm now thinking about not using Ceph anymore because of this, but honestly this doesn't seem as an option as it would result in no redundancy of the services anymore.
Wasted like >6 hours just today with that annoying bug. GRMPF!
/ @ij




