Cursed homelab mini-update:

The 80GB server, or how I upgraded both our laptops.

One of the outcomes of the Qotomihilation is that I ended up with two perfectly nice SODIMMs of 32GB DDR4 RAM from Crucial. So I did what any enterprising techie in the rampocalypse would do: buy converters so I can plug them into a desktop motherboard.

Reviews told me that AMD seems to be the most tolerant of these sorts of shenanigans and it turns out that my current gaming machine is AMD (2nd gen Ryzen), DDR4, and currently has 3x 8GB sticks in it's 4 slots.

Obviously the solution is: 2x 8GB DIMMs + 2x 32GB SODIMMs for a total of 80GB.

However it won't even boot with a converted RAM stick installed.

So where to deploy them? We have two laptops, both DDR4 era, so they're the obvious second choice:

My laptop: a Thinkpad T480, 8th gen Intel, claims it doesn't support more than 32GB of RAM. Boots fine with 1x 32GB stick in it. Still boots fine with an 8GB stick in the second slot. Memtest86+ passes, so upgrade complete! (I then found reports of people cramming a full 64GB into these laptops without issue.)

My partner's laptop, an Ideapad L3, 11th gen Intel, but I can't find a maximum RAM spec. So I YOLO'd the other 32GB stick into it, so 36GB total with the 4GB soldered to the board, and it runs fine and would have passed a memtest86+ run if the battery hadn't ran out at about 75%, so upgrade complete!

This left me with an 8GB Crucial stick and a 4GB SK Hynix stick to play with, so threw them back into the Ryzen system, and it still won't boot.

So project complete:
80GB server upgrade: ❌
40GB laptop upgrade: ✅
36GB laptop upgrade: ✅
2x spare SODIMMs: ✅
4x useless converters: ✅

All in a days work over here in the cursed homelab!

(I'm unlikely to buy anything else that's DDR4 era, so if these RAM sticks or converters are free to a good home, or if you know how to convince motherboards to work with the converters, I'm all ears)

#cursedhomelab #homelab #rampocalypse

Cursed Homelab mini-update:

Unbreaking Kubernetes.

Once again I've realised that my Kubernetes cluster is broken and unbroken it.

This time the symptom was "exec" style health checks not working.

It turns out that cgroups v1 and cgroups v2 are both different, and need to be configured in not one, not two, but three different places, and it's possible to hack this configuration so that it mostly works without setting all three to the correct settings for version 2.

How do you do this?
1. Kubernetes: cgroup driver => systemd
2. Containerd: systemd enabled => false
3. Systemd: unified cgroup hierarchy => off

And this almost entirely works, but it's weirdly glitchy and you will end up running into problems.

When setting all of this up a couple of years ago, I stumbled on this configuration, it worked (i.e. it started containers) and left it at that.

Roll forward a couple of years and my fancy new super powered server keeps spinning up the fans because containerd-shim is using a lot of CPU for no apparent reason, so I started looking at logs and yeah, lots of failures to find cgroups to run "exec" style health checks.

So clearly my cgroup configuration is wrong.

But how? cgroupv2 is mounted, but systemd is complaining about the unified hierarchy being off. Kubernetes is using the systemd driver for cgroups, so digging deeper into the docs, there's a setting for containerd that needs to be set too, and it's off.

Which means some deep surgery. I used @geerlingguy's excellent Ansible roles to set all of this up, so maybe there's a clue there? And there is, a set of tasks to set this specific configuration option. Which I'm guessing weren't there when I downloaded the role a couple of years ago.

So path forwards is obvious: fix the setting, remove the systemd hack and Just Restart everything?

The Kubernetes documentation has dire warnings about doing this migration, stuff about pods getting broken and the like. So it's a quasi-upgrade then: one node at a time, drain it, reboot it and go.

But I'm also running a Cephadm cluster on these nodes and this is also using runc as provided by containerd via Podman, so that needs to get restarted too.

So the procedure is:
1. Fix the settings and remove the hack
2. Ceph host to maintenance mode
3. Drain (and cordon) the node n Kubernetes
4. Reboot
5. Uncordon the node
6. Exit maintenance mode

And it worked with one _minor_ issue:

I have three nodes, with 108 CPUs and 92GB of RAM with 89% and 70% respectively on the new server.

So if I do server 3 first, all my pods will end up on server 3 after 1 and 2 are drained. If I do server 3 last, it'll have no pods. And slightly worse is that for a little while I'll be running my entire workload on my two slower servers. Ugh.

I ended up doing server 3 last, which means that it's load increased when I fixed 1 and 2, and then the cluster nearly ground to a halt with everything on 1 and 2 while I fixed 3. Some judicious restarting of pods moved most of the workload back to 3.

And now it's all working pretty well, with no weird error messages in the logs.

#homelab #cursedhomelab #kubernetes #systemd

Y'know, it's very difficult to get into a server's management interface when it gets it's IP from DHCP and the DHCP server runs on the same server.

#homelab #cursedhomelab

Cursed Homelab Server Upgrade - Part 8

Living with it.

It's loud.

It's unacceptably loud.

Even after sorting out the 100% fans issue, it's still too loud with the fans at just over 30%.

Last night I slept barely ok and woke up in the middle of the night. I don't know for certain if this was due to the server, but I'd be surprised if it wasn't.

I pointed a fan at it this morning (there's no AC in the closet) and the fans dropped under 30%.

I did some stuff (compressed some data down to ~16GB) and now it's back up to just under 35%.

The issue now is twofold:
1. intake temperature is up to 32 (42 caution temp 😬)
2. the iLO controller is at 75 degrees (110 caution temp)

I don't know why it's not going much below 30% and I'm going to have to do a bunch more research to figure this out.

But the intake temperature? I know that one: the laundry, downstairs bathroom and the closet the servers are in share an "exhaust" system (in that there isn't one) and the dryer is currently on, so I think with the dryer off, the fan running, and some circulation time the intake temp should come down.

Sigh. This fucking house.

Ok, so next steps with this homelab:
1. Some form of airflow mechanism for that closet that isn't a pedestal fan on a step-stool
2. Get my forge working (installed, but pushes stall and time out)
3. Home automation

#homelab #cursedhomelab

Cursed Homelab Server Upgrade - Part 7

The bootination!

So server is on the shelf, I'm rushing to put server #2 on top of it, hook up all the cables, put the QNAP box on top of that, and hook it all up before Linux finishes booting and Ceph starts expecting the drives to be there.

Except it hadn't booted.

Turns out that UEFI was still expecting a SATA drive to boot off and wasn't finding one, so we had a problem: UEFI doesn't like my KVM's keyboard thing and wouldn't respond to it, the other USB socket on the back was hooked up to the PiKVM which is functionally offline right now (though I could have gotten in if I needed to) so I grabbed a spare USB keyboard, plugged it into the front and started debugging it.

The obvious solution here is to boot off a thumb drive, chroot into Linux, and run grub-install and efibootmgr to update UEFI's boot list.

But this is a server, so they've obviously been here before you, so you can browse any GPT partitioned drive's EFI partition and just boot whatever you want.

And of course I ran into the usual issues of grub-install thinking the mirrored EFI partitions are a MBR disk, so still some efibootmgr shenanigans, but it's now booting fine.

Next step: networking. Ripped everything out of the three bridges, brought all the interfaces up, then noted which ones went down when I unplugged their cables. Hooked them up, rebooted for good measure, and we're booted.

Then to undo all the "64GB of server in 24GB" stuff.

And then the fans slowly span up to 100% while the server sat there doing nearly nothing (load of 2-3).

I'd been worried previously about iLO getting paranoid about stuff and running the fans too fast, but this was different, this was 100%, not 30% when 20% will do. Scoured the internet, poked around in all sorts of help articles, found people who'd hacked iLO 4 (the server has always run 5) to enable manual fan control, and found nothing.

HPE's article on this said that it'll spin up the fans if it's within 10 degrees of a caution temperature, and the only one was "76-AHCI HD Max", which was claimed to be at the front of the case, in the drive cages. Er. So it's in the drive backplanes!?!?!?

Ok, fair enough, let's check that, and iLO claims it isn't even plugged in (I'll get back to this *) but is more than happy to tell me about the external drives on the non-HPE SATA adapter (and the NVMe drives too)

Weird. Ok, can I just change the cooling scheme? How about "Enhanced CPU Cooling"? iLO resets and the fans spin down, then slowly go back up to 100%.

But while that happens, that AHCI HD Max sensor goes away, and I came across an article saying if AMSd isn't running, iLO runs the fans at full speed. (I also came across an article claiming that the conservative limits on this sensor was a conspiracy by HPE to make people buy their drives)

So what happens if I just switch AMSd off for a moment?

And the fans span down to ~30% and stayed there.

My best guess at this point is that AMSd is reading the sensors in the chips in the NVMe drives (76 degrees and happy) averaging them with the external drives (30-40 degrees) and presenting the resulting temperature (54 degrees) to iLO which then freaks out because it's near the 60 degree caution level.

And that's where I am now, it's all working well (*) and I'm reasonably happy. (Also internet is working flawlessly now)

#homelab #cursedhomelab

Cursed Homelab Server Upgrade - Part 6

The adapterening!

So as discussed in a previous nanoupdate, the adapters I bought have a bunch of issues, but ultimately only two matter for me right now:
1. They're not physically compatible with 2.5" drive caddies and backplanes (though they are tantalisingly close)
2. The presence lines are floating so the backplane doesn't light up the port

Let's fix that!

So the electrical bit is easy: just solder a bridge on the U.2 (SFF-8639) connector to ground one presence pin and the convenient ground pin next to it.

And for extra added fun, solder in a completely inappropriate wire to join the M.2 connector end of the blue NVMe activity LED to the appropriate pin on the SFF connector, giving the backplane and therefore caddy an activity signal.

Physical bits were much more complicated.

I needed to hold a PCB (upside down) at an exact point in space so a connector would line up with the correct spot on the backplane.

To the CADMobile!

I ended up modelling the outline of the connector, the shape of the board, then cutting the board out of some 2.5" drive sized rails. To hold these together I designed a web structure that clipped into the rails and created a 2.5" drive with a big hole in it to hold the PCB.

This was printed in 3 parts: each rail, and the web. The rails had pockets cut out for the rail to snap into. I printed the web the exact same size as the pockets, so they fit in with just friction and the layer lines essentially made the connection permanent.

Of course that wasn't enough. The people who designed the PCB gave it cutouts for the screws, but they were about 1mm off, so I had to cut some small chunks out of the PCB to clear the screws I'd be screwing into the rails.

It was an adventure, but it all came together in the end and is working flawlessly. (nearly, will post a followup)

This is when I discovered that my NVMe caddies were from two very different batches.

The obvious difference was the eject buttons: one set lit up white, the others lit up black.

What was worse was that at least two of the ones that lit up black were "glitchy" in some form and would light up the ring of activity LEDs without a drive inside (and lit it up in a different colour which was also weird.)

I found the two ones that lit up nicely and put my drives into them, plugged it into the server, took servers #2 and #3 offline, swapped the SATA card over ...

I discovered why there's a weird black plastic block attached to the riser cages: it's so you can stand them upright while installing cards in them. Clever!

I then plugged the drives into their adapters and the adapters into the server and...

...it worked. Next post for details.

#cursedhomelab #homelab

Cursed homelab nanoupdate:

HPE DL380 Gen10s seem to be utterly _paranoid_ about non-HP PCIe devices:

On bed, NVMe SSDs installed, doing nothing: nearly silent.
On shelf, NVMe SSDs installed + PCIe AHCI adapter: ramped up to 100% and nearly deafening

Turns out that if you tell it to use "enhanced CPU cooling, they spin down to ~30% and are relatively quiet. (It's still the loudest of the 3, but one has Noctua fans so it's not really relevant competition)

#cursedhomelab #homelab #hpe

Cursed homelab nanoupdate:

It turns out that the crappy cheap NVMe adapters work perfectly if you short the presence pin to ground.

They even light up the activity LED in the caddy if you connect the NVMe activity LED to the activity pin.

#cursedhomelab #homelab

Let's take a look at one of those adapters:

Step 1: what the heck is going on with the power supply rails?

TL;DR: 12V from the U.2 connector goes into a step-down converter, which converts it to 5V, which is fed into a second step-down converter to convert it to 3.3V. The 5V input on the U.2 connector is connected to the common 5V rail via what appears to be a diode.

I'm guessing this is done this way so it can be powered by either 12V or 5V and potentially draw current from both. Looking at a similar adapter from a different brand, they seem to be doing only one stage of power conversion, so maybe this is extra fanciness or paranoia or something. It's probably not a big deal.

All grounds on the SATA side are connected together, so that seems legit.

Step 2: what about all the other signals in the power part of the connector?

On your standard SATA power connector, the only fun pin was 11 which could be used to delay spinup of a drive. By the time this evolves into U.2, we have sleep/wake, activity, and two presence-ish lines and a bunch of signals on the "key". The sleep/wake lines go through to the M-key connector, all but one line on the key go somewhere, but both presence detection lines are floating.

I'm assuming at this point that the PCIe lines are hooked up adequately: all the PCIe-related signals on this side of the connector go somewhere on the M key socket.

So maybe it's the presence lines?

Next step: find some documentation on what they're supposed to be/do.

(Obviously what I should be doing right now is dropping $100 on two slightly higher quality adapters from a local retailer, but I'm stubborn.)

#homelab #cursedhomelab

Cursed Homelab Server Upgrade - Part 4

Whoops I got these out of order.

Let's talk HDD caddies or trays.

HPE's Gen10 servers have caddies which are the exact intersection of "price optimised" and "crazy enterprise nonsense".

My experience is SilverStone and QNAP, both of which are a literal metal tray you screw the drive into.

HPE is a bit different, but probably identical to literally everyone else in this space: Two rails on the sides, latch mechanism on the front and they're flimsy as fuck without a drive in them.

Except unlike everyone else I know of that just has two long light pipes to pipe light from the backplane to the front, HPE has a whole fricking flex circuit with firmware. (1)

So what this means is that if you want to use an empty caddy as a slot filler, you can't because they're too flimsy. (2)

The problem is that the connector end of the caddies needs to be spread to the correct width, so I modelled and 3D printed a block that is the shape of the last 20% of a drive with a cutout so it doesn't interfere with the backplane connector.

Note 1:

There are actually 5 different types of 2.5" caddies:
1. Nearly passive ones with two LEDs only
2. SATA/SAS caddies with firmware and LEDs: these have a ring of LEDs for drive activity and LEDs for identification, do-not-eject and drive status.
3. NVMe caddies: same as the smart SATA/SAS ones, but have an additional light-up power button
4. "uFF/SCM" caddies which take two smaller caddies containing M.2 SATA drives.

#2 and #3 have the flex circuit with firmware, #4 has a PCB that I believe is either using the 2 SAS lanes for the second drive or some SAS -> SATA expansion magic

#2 and #3 are very common, I haven't seen any of #1, and #4 is apparently so cursed that I have only seen 1 for sale online.

I bought 10x of the SATA/SAS ones and 6x of the NVMe ones.

Note 2:

Also the slots they go into are barely a suggestion of a slot. The caddies have a bunch of bowed pieces of metal and plastic so they sit in the right spot, but the rails they slide into are just lines stamped into the top and bottom of the tray. The caddies pop out of those rails with little force, and the big grounding springs on the side are more than strong enough to do this.

Also, for some reason the latch is at the top, but the electronics connect to a flex-pin-thing at the bottom, so the bottom of the caddy wants to pop out.

And enclosure stuff is nuts: if the SATA ports are in AHCI mode, you get nothing (as far as I can tell), if software RAID is enabled (3) it's clearly doing enclosure stuff and gets confused by empty caddies (like seriously it re-scans the backplane multiple times) NVMe is definitely doing enclosure stuff but I don't know how/why/what, and having caddies in empty slots does confuse it slightly.

Note 3:

Yes, I said software RAID as an alternative to AHCI. The S100 "RAID" card on the main board is essentially hardware RAID with software assistance and is explicitly not supported on Linux. My best guess is that the Windows(?) driver does most of the heavy lifting and it's mostly just configuration and enclosure management.

#homelab #cursedhomelab