Unpopular opinion: I have fought with ZFS under Talos for months, but in reality what I needed was Longhorn.
Yeah, yeah, I now, different things. But that's just to say that ZFS is not the silver bullet that some people try to convince you of.
Unpopular opinion: I have fought with ZFS under Talos for months, but in reality what I needed was Longhorn.
Yeah, yeah, I now, different things. But that's just to say that ZFS is not the silver bullet that some people try to convince you of.
Yeah, I'm plenty aware that using networked volumes with Kubernetes is the better way to go, but I gotta hand it to Longhorn: the distributed replicas make it a breeze to move stuff around and do physical maintenance in the nodes. ๐
So it appears like, for the last ~months, my MTU configuration was REALLY wrong
Hinted by the immich longhorn replica not rebuilding, but I also didn't know, the extreme slowness of any service using a db cluster where the master node wasn't in the same region
I had put the slowness on the shoulders of packets hopping a lot between regions, but it turns out, it was just db requests maxing past the configured MTU value, silently dropping
Now that BOTH the wireguard and flannel MTU values are set properly, everything is so damn snappy
This feels like new skin
#homelab #selfhosted #selfhosting #wireguard #mtu #vpn #mesh #longhorn #immich #flannel #devops #linux #opensource #networking
For the last ~6 months, my immich Longhorn PVC wouldn't rebuild replicas across regions, and timeout instead
Today, I figured I had misplaced my MTU configuration for the Wireguard network under k3s...
So some packets were getting dropped silently...
Woops
#kubernetes #k3s #longhorn #network #networking #wireguard #wg #mesh #homelab #selfhosted #selfhosting #mtu
Using #django and #longhorn rwx volumes: when I run collect static and put static on an rwx volume it tries to write some 900 files in short succession and this kills the rwx volumes.
Also when creating folders via migrations the same happens. I get I/O errors.
I solved that with an RWX Django storage Class that allows for retries. One retry is enough and it works.
So am I doing it wrong?
PS: you could argue that static files should belong in an emptyDir. The underlying issue gives me headaches though
Show scheduled on Sept. 11 has #SexPistols making explosive return to the #Longhorn Ballroom in #Dallas for first time since 1978, this time sans Johnny Rotten
I've been a little rough and irresponsible with my #baremetal #Kubernetes cluster, especially when it comes to randomly rebooting nodes. Today I fixed that.
I'm running a bunch of somewhat delicate workloads, including database clusters with CSIs like #Longhorn and #OpenEBS. Checking if everything is in working order has been demanding task and often something I've skipped before rebooting or upgrading nodes - occasionally with horrific results.
Last night I finally took the time and wrote a pretty thorough script that checks that everything is working and healthy, before politely cordoning off a node, draining it and applying upgrades.
I felt so confident today that I tested it by running this new safe upgrade script for all the nodes in the cluster - and it worked! All nodes are now fully upgraded and running kernel 6.12.73 on Debian 13.
This also fixes the outstanding issue caused by #Hetzner no longer supporting obtaining IP addresses through DHCP.
I accidentally had a QLC drive (that's the really shit kind) in an nvme raid array hosting a Longhorn cluster. Could not understand why it would regularly shit itself for longer than I'd care to admit. It didn't help that Longhorn 1.11.0 has a memory leak and OOMs were triggering replica rebuilds on a weekly basis.