Adventures in #selfhosting!
I just finished failing to do a "correct" and "proper" upgrade of cloud-native postgres, #cnpg, from using standard #longhorn volume backups to the barman cloud plugin.

I got the plugin loaded accord to the migration docs, but couldn't get it to write to #s3, nor could the pods become ready. I worked at it for hours, but I saw lots of other people online and recently having the same issues and log messages that I was.

The reason I did this in the first place was that I noticed that I had some duplicate backup jobs causing issues with #fluxcd reconciliation.

In the end, I gave up and went back to the original longhorn backups, which have worked and I've already done disaster recovery with (don't ask), and deleted the duplicates.

Currently I'm waiting for the previous primary/write node to fully restart and clear out the barman side car. Then I'll turn flux back on and hopefully things will be good.

#keyboardvagabond #kubernetes #comingSoonTM

As an a-side, I had to settle with a 5G router because enabling fiber takes at least 10 days (thank you owner of those light cables in the ground)

And that router is working extra hours, 75mbps is not enough for Longhorn to rebuild all its volumes

Curious to see the dashboard tomorrow morning

#longhorn #selfhosted #homelab #dashboard #router #5g #fiber #isp #download #upload

Nice, I finally powered back on the NAS, it everything came back up

Longhorn backups, CloudNativePG backups, MinIO, OpenmediaVault, Wireguard

The full stack came back without a hiccup

(Small sike on me tho', I had to SSH into the machine to restart the containers and wireguard because I had forgotten to enable them at boot...)

#selfhosted #homelab #kubernetes #k3s #longhorn #wireguard #cnpg #postgres #minio #openmediavault #backups #sike #nas #ssh

We are SO back

This moving experience was.. Interesting

I will describe it as a "Controlled catastrophe"

Services went down, but only some
Most data was still available while the rack was being transited, Synapse decided to behave for most of the time (even the bridges were operating) and now that everything is back up and running, it's all behaving as expected

But I learned :
- Disaster recovery is not on point
- Losing a node may not be handled well by the cluster
- The cluster ain't so HA as of now
- Data is safe tho'

Let's see for the next moving if I can iron out all those quirks !

#selfhosted #homelab #matrix #element #ha #kubernetes #k3s #longhorn #outage #moving #disasterrecovery

If there is a race condition in a CSI, is that likely a #kubernetes bug or a bug in the implementer, such as #longhorn? I'm curious about this code in particular: https://github.com/longhorn/longhorn-manager/tree/master/csi More specifically, I am looking at https://github.com/longhorn/longhorn-manager/blob/master/csi/deployment.go

I am not a #golang dev, but I do not see anything about locking there. Maybe the locking is something K8s does? Maybe it is something the OS does? Really not sure where to poke exactly for the disk not being detached properly.

longhorn-manager/csi at master · longhorn/longhorn-manager

Millions and millions of volumes orchestrated. Contribute to longhorn/longhorn-manager development by creating an account on GitHub.

GitHub
A unfortunate side affect of Longhorn experiencing I/O latency and saturation, is that volumes attached to pods become remounted read-only. This has very very unfortunate side effects on running databases, caches etc. Any tips on making #Longhorn behave would be greatly appreciated. I've looked briefly into OpenEBS' distributed, replicated storage but the requirements are currently not allowing it - specifically, replicated #OpenEBS needs an entire physical disk for itself.

A node crashed today, and unfortunately the remaining three were overburdened with a seemingly spontaneous filesystem check of some rather large volumes, including the ones holding the three database replicas and three media storage replicas for #mstdndk. We're running #Longhorn on #Kubernetes for replicated block storage, but we're doing this on a bit of a budget, which means spinning metal and 1Gbit connections between servers. You can get quite far by very carefully prioritizing the I/O and CPU of certain processes, but once in a rare while it tumbles over. If we had the money, we'd get fit each server with 16TB NVMEs and a 10Gbit backbone, but unfortunately that's not currently the case.

Is there any benefit to running this on Kubernetes? I have no doubt about it, but I'm also convinced that our current problem is replicated block storage and requirements associated with it.

Our current setup is four i7-8700's with 64GB RAM, 1x10TB spinning metal and 2x512GB SSDs.

A frosty #longhorn #calf
Your #infra, your rules. Discover how #Harvester empowers teams to reduce costs & scale smart with #opensource power. Catch this #oSC25 session to learn how Harvester + #Longhorn can transform your #virtualization stack! https://youtu.be/OOLTpwQWspI?si=2InT3vvAccWHwN6B
openSUSE Conference 2025 - Production-Ready Virtualisation with Harvester and Longhorn

YouTube