One of my projects this week is to bring up a K8S cluster on our Proxmox homelab to perhaps eventually migrate EphemeraSearch on it.

EphemeraSearch currently runs on a 7-node K8S cluster at Hetzner.

I'm going to drop some notes in this thread, to perhaps consolidate them into a blog post or something later 馃

#kubernetes #homelab #selfhosted #proxmox

The whole thing is provisioned with Tofu; and one of my favorite things to do is to verify that the end-to-end provisioning works fine.

So that means a lot of "tofu destroy" + "tofu apply".

However, the TF configuration includes the Talos disk images used by the cluster, and I didn't want to re-download them every single time.

My first intention was to use "tofu taint" on the virtual machines. But they are declared in a for_each block; and you can't use "tofu taint" or "tofu plan -replace" on a for_each resource (unless you enumerate each resource individually).

However, you can do a targeted destroy:

tofu plan -destroy -target proxmox_virtual_environment_vm.k8s_nodes

And destroy will follow dependencies (if you destroy a resource, the resources that depend on it will automatically be destroyed), so in my case I could also do e.g.:

tofu plan -destroy -target talos_machine_secrets.this

(Because pretty much every Talos-related resource depends on this directly or indirectly).

#terraform #opentofu #talos #kubernetes #homelab #selfhosted

Also, I want the K8S cluster to support IPV6, which meant replacing Talos' default CNI (Flannel) with Cilium.

(OK, it might be possible to support IPv6 with Flannel on Talos, but the Talos docs say very little about how to customize Flannel, and I wanted Cilium for other reasons too - e.g. LoadBalancer support with L2 announcements, replacing kube-proxy...)

This means declaring "cni: none" in the Talos machine config, and then either:

1) manually installing Cilium after provisioning the cluster

2) finding a way to automatically install Cilium when the cluster is provisioned.

Of course I went for option 2, right :-)

Which leads us to a rabbit hole of multiple options:

1) wait for the cluster to be up (=K8S API is functional) and then use the Helm provider to create a helm_release resource on the cluster

Problem: there is no easy and clean way to wait for the cluster to be up.

Talos has a talos_cluster_health resource, but this one waits for all nodes to be "Ready", which isn't going to happen since the CNI hasn't been deployed yet. (There is a skip_kubernetes_checks option but it doesn't seem to help.)

Declaring something like a kubernetes_nodes resource in Tofu sort of works, ... until you reprovision the cluster. Then you realize that you can't even do a "tofu plan" because Tofu tries to refresh that resources' status, which requires the cluster to be up. So, this is a non-starter.

2) use Talos "inlineManifests" feature, which instructs talos to apply a bunch of YAML to the cluster when it's provisioned

Problem: this requires Cilium YAML manifests; and the way I install it is typically with the Helm chart.

Solution: use a helm_template data source to do the equivalent of the "helm template" command, and render the Cilium chart into ready-to-apply YAML manifests.

Next problem: the Cilium Helm chart is very sophisticated, and depends on Capabilities.KubeVersion - in other words, when we invoke the helm_template resource, we need to pass it the correct kube_version.

Next solution: that version is available in talos_machine_configuration resources.

And with that (and a good amount of Cilium configuration!) our cluster comes up fully functional!

#kubernetes #talos #proxmox #cilium #opentofu

Let's continue the Proxmox + Tofu + Talos + Cilium adventure, with two little footnotes. "Devil is in the details!"

First: Talos "inlineManifests" behavior.

When you add some inlineManifests to your Talos MachineConfig and push that MachineConfig, the manifests get applied immediately. Yay!

However, when you update or remove some inlineManifests and push the MachineConfig ... Nothing happens. Talos does a full (potentially destructive!) reconcile only when executing a cluster upgrade. (This is pretty well explained in the Talos docs[1])

This means that our initial installation of CIlium will work immediately, but subsequent configuration changes won't work (the YAML won't be applied) until we run a "talosctl upgrade-k8s". (Pro-tip: make sure to specify "--to" with the current k8s version, otherwise it'll execute a "real" upgrade which implies downloading new images and restarting the whole control plane one component at a time - which takes a while.)

So, are we there yet?

Not quite!

The second issue: each time I'd do a "tofu plan", it would tell me that something had changed. Which is kind of annoying. If you don't change your Tofu configuration, variables, etc, normally, you'd expect "tofu plan" to tell you a reassuring:

No changes. Your infrastructure matches the configuration.

So, what is going on? 馃

[1] https://docs.siderolabs.com/kubernetes-guides/advanced-guides/inlinemanifests#how-talos-handles-manifest-resources

#terraform #talos #opentofu #homelab #kubernetes #cilium

inlineManifests and extraManifests - Sidero Documentation

Learn what inlineManifests and extraManifests are, how they differ, and why they matter.

Sidero Documentation

When looking in the "tofu plan" output, we'd actually see a *huge* change. That huge change is the YAML rendering of the Cilium Helm chart. And since that YAML gets included in our Talos MachineConfigs... Yeah, that was annoying, because each "tofu apply" would repush a new MachineConfig to our Talos nodes.

(That push turns out to be mostly a no-op, but still. Unclean! Boo!)

My first intuition was: "the Cilium Helm chart is probably generating some UUID, secrets, keys, whatever". And, yes, that's exactly it! Cilium generates its own internal CA, and then uses it to issue a couple of certificates.

This is not a problem when using Helm "normally" (i.e. "helm upgrade --install ...") because the Cilium Helm chart is sophisticated enough to do this conditionally, only on the initial install.

However, when rendering the chart YAML "out of the box" (as we do here with Tofu and Talos, or as one would do with e.g. Flux or Argo), the Helm renderer has no access to the Kubernetes API, and doesn't know that there is already a certificate and that it shouldn't generate a new one.

Thankfully, the solution is fairly straightforward:

- generate that certificate (e.g. with the Tofu "tls" provider, it boils down to a couple of resources of a few lines each)

- pass that certificate (and associated key) in the Cilium Helm chart values

- also set a couple of values to tell Cilium to generate the other certificates later (instead of generating them from within Helm)

...And with that, our "tofu plan" now tells us the expected message:

No changes. Your infrastructure matches the configuration.

Next up for today: setting up the Proxmox CSI plugin [1] so that our K8S cluster has a StorageClass - actually two StorageClasses; we want our users to be able to request fast, efficient local volumes (using Proxmox local-zfs) as well as distributed, resilient ones (using ceph) !

[1] https://github.com/sergelogvinov/proxmox-csi-plugin

GitHub - sergelogvinov/proxmox-csi-plugin: Proxmox CSI Plugin

Proxmox CSI Plugin. Contribute to sergelogvinov/proxmox-csi-plugin development by creating an account on GitHub.

GitHub

... And we have the CSI provider. This makes it possible to create a PVC (PersistentVolumeClaim) in the Kubernetes cluster, and this will automatically create a volume in Proxmox and attach it to a Kubernetes node.

That part was both the easiest and the hardest.

The easiest because there wasn't much to do (install a Helm chart, so we repeat the templating technique used earlier with Cilium; and create a Proxmox user and associated token) but also the hardest because there are many little variations possible here.

Example: the Proxmox CSI plugin [1] needs to have the well-known labels topology.kubernetes.io/region and zone. "Region" here means "Proxmox cluster" - which allows us to have a Kubernetes cluster spanning multiple Proxmox clusters; and "Zone" means "Proxmox hypervisor". This is used by the CSI plugin to know where volumes should be created.

But there are many ways to set these labels!

1) through Talos MachineConfiguration [2]

2) by installing the Proxmox CCM [3]

3) by installing something like topomatik [4]

For now, I went with the first option, because I'm already generating MachineConfigurations in the TF configuration, so adding a few lines there was trivial.

But in the long run, I might settle for topomatik, as I believe it would behave correctly if I end up migrating worker nodes from one hypervisor to another. (I don't have any plans to do that at the moment, though!)

[1] https://github.com/sergelogvinov/proxmox-csi-plugin

[2] https://docs.siderolabs.com/kubernetes-guides/advanced-guides/node-labels

[3] https://github.com/sergelogvinov/proxmox-cloud-controller-manager

[4] https://github.com/enix/topomatik

GitHub - sergelogvinov/proxmox-csi-plugin: Proxmox CSI Plugin

Proxmox CSI Plugin. Contribute to sergelogvinov/proxmox-csi-plugin development by creating an account on GitHub.

GitHub
@jpetazzo I'm doing something similar in a different virtualization platform, although also based on the same building blocks (Ceph, Qemu). You may want to check if the CSI driver keeps working on a kubernetes node if you migrate that node to a different physical host. I quickly discovered that it was breaking and thus I discarded everything that came with migration, live or not.
@godzilla I'll test that this weekend! If memory serves me well (back from I had set up a similar stack with scripts instead of terrafu) live migration worker but not provisioning new volumes on a node after its migration. We'll see. 馃

@godzilla going back to this - when using ceph-based volumes, it's still possible to migrate Kubernetes nodes to different hypervisors. However, when using local volumes, Proxmox prevents VM migration because the volumes are tied to a placeholder VM id. The Proxmox CSI plugin offers a manual tool (pvecsictl) to facilitate the migration [1].

I think I'll do exactly like you did, and discard migration of Kubernetes nodes. That could be annoying on a large setup if the infrastructure operators want to e.g. consolidate / reorganize / migrate hypervisors; but in my case, it's totally fine, and Kubernetes already handles moving / restarting workloads around; and stateful workloads are going to either leverage replication and local volumes (e.g.: CNPG) or ceph volumes :)

[1] https://github.com/sergelogvinov/proxmox-csi-plugin/blob/main/docs/pvecsictl.md

proxmox-csi-plugin/docs/pvecsictl.md at main 路 sergelogvinov/proxmox-csi-plugin

Proxmox CSI Plugin. Contribute to sergelogvinov/proxmox-csi-plugin development by creating an account on GitHub.

GitHub
@jpetazzo wow! That CSI driver seems much more expressive than the one I'm using (lxd-csi). In mine, a node label/annotation reflects the physical host name and that seems to prevent the CSI driver working on a different host, should the Kubernetes node migrate, which *has* to be something like an oversight on part of the driver. However, as without migration the PV attachment works fine, I should be good to go even without migration. I didn't think about using local volumes. I was thinking of using just Ceph-based PVs for persistent data and EmptyDir for writable temporary storage. Indeed local volumes bring a whole new bunch of failure options with them.
@godzilla something that may or may not be relevant to your interests : topomatik - that's a thing you can run on k8s nodes to automatically update labels (e.g. the topology labels) using any combination of LLDP, smbios data, dmi stuff iirc... (Sorry for brevity, about to go offline)
@jpetazzo ouch! Maybe I'm starting to see some shortcomings of the stuff I was building. The performance of memory-mapped files on those Ceph filesystem-type volumes mounted with virtiofs is abysmal. I'll need something different for MySQL databases.
@godzilla different story, but: for PostgreSQL databases, I *love* to use the CNPG operator on K8S, because it makes it ridiculously easy to have streaming replication, backups... And once you have replication, it's safe enough to use local storage; and local storage performance is amazing. There are also operators for MySQL/MariaDB. Just a thought!
@jpetazzo but Ceph on its own would work flawlessly. If for testing I start a privileged pod (container) with a volumeDevices block-type volume from a lxd-csi (Ceph) PVC I get about 300 MB/s random read performance after formatting and mounting the device inside the container, versus 3 MB/s on a volumeMounts filesystem-type lxd-csi volume. There is an awful waste somewhere, or I'm obviously using it in reverse. It's possibly related to virtiofs which handles the filesystem-mode volumes. I wish there was a magical annotation to force the CSI driver to use a block device inside the VM and mount it in the countainer instead of relying on such virtualization.

@godzilla oh wow, indeed :(

Out of curiosity how is that declared? Is that through lxd-csi "ceph" or "cephfs" driver?

@jpetazzo it's ceph, the volume per se is a block device. Unfortunately it's mounted on the virtualization host and passed to the VM as a virtiofs. I really can't find an explanation for that, even knowing that the LXD platform supports lightweight container-type instances besides virtualized ones, nothing would prevent passing the Ceph RDB device to both containers and VMs. Possibly they are just conflating ReadWriteMany and ReadWriteOnce access modes.

@godzilla

That virtiofs thing is really trippy.

I'm not an expert with CSI, but my understanding is that in all cases (block-based and fs-based), the filesystem gets mounted on the node, and then there is a bind-mount with the containers.

I would expect virtiofs to be used only when crossing a VM boundary (e.g. sharing from hypervisor to VM, or from VM to nested VM).

I'm going to ask a silly question, but - could that be caused by a fancy container runtime that either uses some virtualization, or disables bind-mounts for some reason?

Oh, or maybe some userid remapping? (I think stuff like shiftfs is normally used to do this efficiently, but who knows...)

@jpetazzo the stack is built like this:

1. (bottom layer) physical hosts with Microcloud. This provides something Proxmox-like, with embedded Ceph. We happen to be able to recycle like this several machines with many non-RAID disks (EOL hyperconvergence nodes).

2. Kubernetes nodes Microcloud LXD VMs (Ubuntu with k3s, boot disk is a Ceph volume) having a special virtual socket that allows calling LXD APIs, and a special flag that allows them to handle disk volumes.

2.5. lxd-csi driver, as a helm manifest that mostly installs some DaemonSet calling LXD APIs and the CSI instrumentation (controllers, etc).

When I create a PersistentVolume, the CSI driver actually creates an LXD (ceph) volume and dynamically adds that to the VM (which is the kubernetes node where my pod mounting the PV will run). *This* happens via virtiofs, unexpectedly. Then the VM makes some bind mount in the container and the PV is good to go, at 3 MB/s random read performance.