PSA:

Maintenance work has been finished.

Sadly there is no good news regarding the Ceph issue why I did all this trying to fix.

Ceph is hitting a long standing bug (for 7 years now)

The main tracking bug is #47380: https://tracker.ceph.com/issues/47380
Related bugs:

#43893 (lingering osd_failure ops): https://tracker.ceph.com/issues/43893
#24322 (original report by Sage Weil, 2018): https://tracker.ceph.com/issues/24322
#50637 (OSD slow ops not clearing): https://tracker.ceph.com/issues/50637

Bug #47380 is the most relevant — still open, updated about 1 month ago, tagged for tentacle.

Sadly, the dedicated switch for the datacenter won't help here either.

It's a little bit disappointng that #Ceph has some kind of issues for that long time.
The problem here is, that slow ops on mon caused OSDs to show latency/commit as 0/0. Restarting OSDs helps, until they show real values and vanish from the health warning of slow ops. While they do show up there, peering of OSDs is halted as well.
A restart of the mon leader might also help, but that I can't confirm as I just tried for the first time. No long term experience yet.

I'm now thinking about not using Ceph anymore because of this, but honestly this doesn't seem as an option as it would result in no redundancy of the services anymore.

Wasted like >6 hours just today with that annoying bug. GRMPF!

/ @ij

In case you never hear from me again: I am now running a three node Openstack cluster in my homelab. Using ceph storage coming from the same three nodes.

Ceph deployed with cephadm.
OpenStack deployed with kolla-ansible.

The basics are working. First demo VM deployed via OpenStack client and via OpenTofu. Now I can study and learn the admin side of things, finally.

#OpenStack #Ceph #homelab #HellYeah #WhyNot @homelab

[ Blog ] Remove #Ceph #datastore from #Proxmox

To remove Ceph datastore from Proxmox several steps are involved, especially if you want to completely clear Ceph from your system.
Before proceeding, keep in mind that this process will permanently delete all data stored on the Ceph datastore. Make sure to have a working backup of any critical data before proceeding.
 
Migrate VM disks off http://rviv.ly/9WJn4b

[ Blog ] Proxmox upgrade #Ceph #Reef to #Squid

If you are running Ceph Reef in your #Proxmox infrastructure and plan to upgrade to Proxmox 9, you must first upgrade Ceph Reef to Squid to meet the prerequisites.

As a best practice, before proceeding with the upgrade make sure to have a working backup of your VMs and Containers.

 
Prerequisites
To upgrade Ceph Reef to http://rviv.ly/tmWeXQ #aggiornamento

It's once again time for a new #introduction

Hi! I'm Crabbypup, or just 'crabby', though only in name.

I'm in Kitchener-Waterloo, and I'm a linux-flavoured computer toucher.

I share stuff about the region, Open source software in general and #linux in specific.

I like to tinker in my #homelab, where I run #proxmox, #ceph, and a bunch of other #selfhosted services including #homeassistant.

I'm a rather inconsistent poster, and a prolific booster, but I'm glad to be here.

[ Blog ] Proxmox #node replacement in #Ceph cluster

If a core Proxmox server fails taking its Ceph OSDs with it, the Proxmox node replacement doesn't have to be a nightmare.

To fix this issue you must cleanly decommission the failed server and correctly perform the Proxmox node replacement to ensure your Ceph data remains resilient.

The Ceph cluster should have the status http://rviv.ly/hDkYPo #replace

[ Blog ] Remove #Ceph #datastore from #Proxmox

To remove Ceph datastore from Proxmox several steps are involved, especially if you want to completely clear Ceph from your system.
Before proceeding, keep in mind that this process will permanently delete all data stored on the Ceph datastore. Make sure to have a working backup of any critical data before proceeding.
 
Migrate VM disks off http://rviv.ly/9WJn4b

[ Blog ] Shutdown #Proxmox #cluster with #Ceph storage

To #shutdown Proxmox cluster and prevent data loss or corruption, especially when Ceph storage is in use, you must follow a specific procedure.

When working with a high-availability infrastructure, performing maintenance tasks can be a delicate process. Shutting down a cluster, even for a planned event like a power outage or hardware http://rviv.ly/GCyTZM

@korkenzieher OK, that was a really helpful RTFM. I now have a virtual IP up and running and can connect to the S3 endpoint via that vIP.

Sadly, the TLS settings seem to be a little less capably in the tentacle release. If I understand the documentation (https://docs.ceph.com/en/tentacle/cephadm/services/rgw/#high-availability-service-for-rgw) correctly, I can handover a TLS certificate and key.

In the `latest` documentation on can specify this should be a self-signed certificate provided by cephadm.

So until I get the Let's Encrypt certificate, I have HTTP traffic served on port 443, as just specifying `ssl: true` seems to do nothing in Tentacle...

#ceph #cephadm #rgw #radosgw #s3 #storage

RGW Service — Ceph Documentation

@korkenzieher Sometimes I do not understand the Ceph documentation structure. Why are there two codeblocks in that section, being almost identical, but without an explanation what the difference is?

#ceph