Mastodawn

Tero Keski-Valkama Jan 23, 2023

@dustinrue, that's cool. It's probably what this set up here in my home, well, lab is as well. But there are a few users on the Mastodon instance so I try to aim for production quality.

Anyways, I now made a lot of effort to move all the NFS pods handling the ReadWriteMany endpoints for the ReadWriteOnce things to one single node, to the node I trust the most, and not the one which hangs every day now for what I believe are memory chip issues... So I don't have three single points of failures, but one, which is an improvement I hope.

Let's see what happens in a day or so when #Curie goes down again, giving me a lot of practice and exercise in hard reboots... Before I get a new node which should also come by mail tomorrow, the two nodes don't seem to have enough capacity to run everything anyhow, so when Curie goes down, all the rest fall like the North American electrical netw... sorry, like dominoes.

Tero Keski-Valkama Jan 24, 2023

#RukiiNet #SelfHosting update:
Just after writing this #Curie went down again, and it didn't help that the #NFS pods were all on a different node. It all went down regardless.

Even got some data corruption again, it's always a huge manual hassle to bring everything back up. I read somewhere that #MicroK8S tends to be bad with hard reboots if some specific singleton cluster pods like coredns or calico or nfc controller or hostpath provisioner are on the node which goes down. I wonder if it's possible to just add replicas for those...

I found a new (old and known) bug with #OpenEBS, and a mitigation. In some cases, #Jiva has replicas in a readonly state for a moment as it syncs the replicas, and if the moon phase is correct, there's an apparent race condition where the iSCSI mounts become read-only, even though the underlying volume has already become read-write.

To fix this is to go to the node which mounted these, do "mount | grep ro,", and ABSOLUTELY UNDER NO CIRCUMSTANCE UNMOUNT (learned the hard way). Instead, I think it's possible to just remount these rw.

There's also an irritating thing where different pods run their apps with different UIDs, and the Dynamic NFS Provisioner StorageClass needs to be configured to mount the stuff with the same UID. I originally ran this by just setting chmod 0777, but the apps insist on creating files with a different permission set, so when their files get remounted, their permissions stay but the UID changes, and after a remount they don't have write access to the files anymore.

This compounds with the fact that each container runs on its own UID, so each needs its own special StorageClass for that UID... Gods.

I got the new #IntelNUC for the fourth node in the cluster to replace the unstable Curie node, but memories for it are coming Thursday.

Tero Keski-Valkama Jan 24, 2023

#RukiiNet #SelfHosting:
And it went down again, and caused a tangle when getting up. At least I managed to repeat the thing where node iSCSI mounts became to be read-only. Tried to remount them, unsuccessfully...

tero@betanzos:~$ mount | grep ro, | grep jiva
/dev/sda on /var/snap/microk8s/common/var/lib/kubelet/plugins/kubernetes.io/csi/jiva.csi.openebs.io/e1c8e8f0b1da20ffea013552ead7fda30ad63cbc1ba31fc788c9b4951c8ad74c/globalmount type ext4 (ro,relatime)
/dev/sdd on /var/snap/microk8s/common/var/lib/kubelet/plugins/kubernetes.io/csi/jiva.csi.openebs.io/6015ae8728c659ae798e1919f079c38b672b760a73cf1613b4b2bcd1ece14ff9/globalmount type ext4 (ro,relatime)
tero@betanzos:~$ sudo mount -o remount,rw /var/snap/microk8s/common/var/lib/kubelet/plugins/kubernetes.io/csi/jiva.csi.openebs.io/e1c8e8f0b1da20ffea013552ead7fda30ad63cbc1ba31fc788c9b4951c8ad74c/globalmount
mount: /var/snap/microk8s/common/var/lib/kubelet/plugins/kubernetes.io/csi/jiva.csi.openebs.io/e1c8e8f0b1da20ffea013552ead7fda30ad63cbc1ba31fc788c9b4951c8ad74c/globalmount: cannot remount /dev/sda read-write, is write-protected.

The only way to get around that issue safely was to restart the node. The one flaky node is now dying multiple times a day, and the replacement hardware can't really arrive soon enough.

At least fewer things explode now when a node crashes. Of course the goal would be that nothing would be corrupted or entangled in complex race conditions, but that's apparently impossible for MicroK8S...

Tero Keski-Valkama Jan 27, 2023

#RukiiNet #SelfHosting update:
I lost the last update in a database clean up, damn.
Stuff is stabilizing, there are now three cluster nodes all of which are rock solid, no more daily outages.
Everything works otherwise, but ElasticSearch pods need some tuning with the permissions which is probably caused by some upstream change in Bitnami container images.
After that, I need to dive deep in that file permissions issue as it spans all the pods, some of which I temporarily worked around. It seems volume mounts and all are ok, but when the container entrypoint creates some directories, it tends to create them so that it doesn't itself have access to them anymore. Created with 0000 permissions...
Anyhow, my backups were run by #Curie node, which was the unstable one, so I'll need to migrate those to some other node for it to work now that Curie isn't a node anymore.
I would hope a new era of stability for rukii.net would start from here.

Tero Keski-Valkama Jan 28, 2023

#RukiiNet #SelfHosting update:
Cleaned up the permissions so that all services work perfectly now. Yeeted #Flux, I have no idea why people like that. It's a pain.

I think I found the reason for seemingly random connection errors. It was a huge mystery, went through everything with a microscope from #MetalLB ARP tables, routing, ingress services and gateways, service level... I think the root cause in the end was that I had nodes both in wired Ethernet and in WiFi, so for some reason (although it shouldn't happen) MetalLB sometimes advertised an interface which for some reason didn't work. I don't know why, it should work with redundant interfaces as well, but apparently not.

Backups are manual until I get around to moving the automation from one host to another.

Now there should be a continuous uptime to get me back to two nines for a 30 day window.

Tero Keski-Valkama Jan 28, 2023

The diagram showing #RukiiNet outages from StatusCake between 2022-12-20 and today. One server started severely failing towards the end and I replaced it so it should be solid green from now onwards.
#SelfHosting

Tero Keski-Valkama Jan 31, 2023

#RukiiNet #SelfHosting update:
Since deploying #MicroK8S from scratch to a cluster of 3 stable nodes, and disabling WiFi from them (as WiFi interfaces seemed to somehow degrade networking), the cluster has been boringly stable.
So I guess it's 100% uptime from now on.

Tero Keski-Valkama Feb 3, 2023

#RukiiNet #SelfHosting update:
The site has been boringly up 100% since the last fixes. Today got an error of disk full but mitigated it before it became an issue.
However, I decided to add 100 GB of space to the mastodon-system volume, which meant that I needed to create a new PersistentVolumeClaim, copy stuff there, and flip the app over to that while moving the hundreds of gigabytes of cached data from the old PVC to the new one.
So, the images are broken for a while until the cache has been moved, it takes some hours.

Tero Keski-Valkama Feb 8, 2023

#RukiiNet #SelfHosting update:
The site was down for 12 hours 50 minutes during last night as there were storm caused networking issues and I was sleeping.
Two nodes became disconnected from the cluster, and the one node left wasn't enough to continue work for various reasons, including requirement of two Jiva volumes for quorum.
Rebooted them in the morning, nodes went up, the application didn't. It required tuning the permissions of the /opt/mastodon/public/assets and system again as they went to bits 0000 again for some weird reason.
The ElasticSearch seems to be degraded still, need to look into it what they have sideways today.

Tero Keski-Valkama Feb 11, 2023

#RukiiNet #SelfHosting update:
Everything has been working fine. Will upgrade Mastodon to version v4.1.0 at some point to change that.

Tero Keski-Valkama

#RukiiNet #SelfHosting update:
Updated to v4.1.0, was boringly uneventful.

Tero Keski-Valkama Feb 11, 2023

#RukiiNet #SelfHosting update:
Here's an all-time uptime graph from #StatusCake. Managed the first week with a 100% uptime recently, and the weekly uptime is now in single-nine. Monthly uptime isn't quite there yet, but improving steadily.

Tero Keski-Valkama Feb 15, 2023

#RukiiNet #SelfHosting update:
Got a long electricity #blackout today, almost an hour.
What happens is that all the nodes reboot, but faster than the router-switch in between them, so #MicroK8S #Kubernetes daemons die, and don't respawn without a reboot.
Second, it takes a long time for all the pods to be refreshed from the Unknown state.
Third, Jiva starts resyncing, which tends to fill up one of the smaller hard drives in one of the nodes. Should probably buy a larger one, 500 GB isn't quite enough. When it syncs, it uses this layered file system thing, where it accrues deltas and finally does a garbage collection to squash the space. But it never gets there if the hard disk is already filled up in between. There I need to manually delete the Jiva replica PVC, and let it resync from scratch.
Then Redis pods get stuck as against all reason they corrupt their append-only files if they do a hard reboot. And when #Bitnami #Redis reboots and has a corrupt AOF file, instead of just recovering it, it dies with an error message for how to recover the file manually... It's a huge hassle, it's easier to just scale down Redises, delete PVCs and rescale them back up.
Then #ElasticSearch is dead because of unreachable shards or whatnot. That means I need to bash into the master pod and delete all the indices with curl -XDELETE. After that, I need to bash into Mastodon web pod and do tootctl search deploy all over again...
Gods I hate blackouts.

Thomas Berker Feb 15, 2023

@tero ...or buy a #UPS, good ones have gotten much cheaper. I bought a little Powerwalker for some 500 Euros which keeps my (energy efficient) home server up for 90 minutes. Of course after the purchase we had no blackout for months, but it feels surprisingly good to have a tiny little independence from the grid.

Sebastian Fritz Feb 15, 2023

@tero I read through your challenges and hooe you can get your stuff fixed. I did some tests with containerized storage and decided against it. It only seems to work in a low demand scenario. It gets unstable fast if it can‘t keep up with iops/bandwith demand. Perhaps something more sinple like a nfs provisioner (as service on a host, not as container) would be an option?

Tero Keski-Valkama Feb 15, 2023

@cardes, I don't think containerization as such would form a performance bottleneck. I just don't think OpenEBS Jiva with its replication scheme would work in a high volume case.
Some sort of a RAID-like storage would be better. For example OpenEBS Mayastor.

In my case, I think if I just gradually build some automation for getting the cluster back up, it should be fine. For example, I can just add the Redis AOF fix command to its start-up commands. Not sure what to do with ElasticSearch except a script to delete all indices and recreate them.

Sebastian Fritz Feb 16, 2023

@tero I got little experience with OpenEBS, we did Tests based on Longhorn and Openshift Data Foundation (basically Rook/CEPH). As you already said replication is a big performance killer, but also cpu scheduling and priority. You can work around it a bit by assigning storage workload to dedicated nodes but even that will always depend on the reliability of the kubernetes layers to function.

Sebastian Fritz Feb 16, 2023

@tero In small setups its best to rely on systemd services providing f.e. NFS directly from a hardwarenode based in a raid setup. Its a lot less prone to errors. I know elasticsearch needs block storage so you might need something else like iSCSI. From my experience elasticsearch tends to create cpu load spikes, if you limit it with ressource limits the chance increases for getting corruption, this may get worse if storage is on the same priority(container wise).

Sebastian Fritz Feb 16, 2023

@tero I hope my loose thoughts can be helpfull and sorry for the long posts. The thoughts came to mind while writing.

Philipp Krenn Feb 16, 2023

@tero elasticsearch: is that a multinode cluster that didn‘t form correctly again? because otherwise (unless you ran into a bug) this should IMO only happen if you have disk corruption

Tero Keski-Valkama Feb 16, 2023

@xeraa, the multi-node cluster, yes, but it formed correctly after rebooting. The problem was that the nodes got up before the network, and Ubuntu/Snap MicroK8S is built to give up if there's no network.

Tero Keski-Valkama Feb 16, 2023

@xeraa, ah sorry, ElasticSearch, that is weird and I have no idea why that happens. No disk corruption. Except the inherent one that happens at a hard shutdown.

Philipp Krenn Feb 17, 2023

@tero would be interesting to see what's in the elasticsearch log.
otherwise, there's also a command-line tool to clean up data with that problem, which might come in handy: https://www.elastic.co/guide/en/elasticsearch/reference/current/shard-tool.html

elasticsearch-shard | Elasticsearch Guide [8.6] | Elastic

Tero Keski-Valkama Feb 17, 2023

@xeraa, thanks! I'll check the logs in more detail when I get another blackout and this repeats. And it is a multi-node ElasticSearch cluster as well with three replicas (I added one above the default 2 because I had one node go down and it stopped working).

Tero Keski-Valkama Feb 17, 2023

@xeraa, I switched the storage of the ElasticSearch nodes to hostpath instead of OpenEBS Jiva. The Jiva replication doesn't seem to bring increased reliability, seems the opposite.

Philipp Krenn Feb 18, 2023

@tero yeah, generally we don't recommend any other replication, since elasticsearch replicates on its own already. that will only increase your cost (unnecessarily) and might make things less stable if they interfere with each other

Tero Keski-Valkama Feb 18, 2023

@xeraa, yeah, I believe Jiva might have flushed its file writes in a weird order or something and a hard crash causes the persistent file being in an intermediate state which is not necessarily "partially applied append".

My journaling filesystem on the host does nothing as Jiva makes its own files as its own layered filesystem on top of that.

Tero Keski-Valkama Mar 7, 2023

#RukiiNet #SelfHosting update:
Got an electricity blackout in the morning. This caused a 5 hour downtime.

JivaVolumes went into a tangle because one node has only 500 GB hard disk, and resyncing the large JivaVolume there doubles its size temporarily, which fails.

I always have to delete it manually and start resyncing from an empty state.

Good news is that nothing else broke down now. ElasticSearch, PostgreSQL, Redis, all came up nicely.

I also got an UPS, which I will install at some point, which should prevent further long downtimes.

Tero Keski-Valkama Mar 8, 2023

#RukiiNet #SelfHosting update:
Had a couple of outages totalling 2 hours due to disk becoming full.
Mitigated the problem by switching to 2 replicas per Jiva Volume, so now not all volumes become replicated across three nodes. Need to buy a larger hard disk for one of the nodes which only has 500 GB, whereas others have 1 TB.

Tero Keski-Valkama Mar 8, 2023

#RukiiNet #SelfHosting update:
Had a pre-emptive, controlled 5 hour downtime to install UPS, update OSes, groomed JivaVolumes a bit.