Fun little thing I have been working on: teach systemd to boot directly into a disk image downloaded via HTTP within the initrd.

In v257 systemd learnt the ability to download disk images at boot via systemd-import-generator, both DDIs and tarballs, and place them in /var/lib/machines/, /var/lib/portables/, /var/lib/confexts, /var/lib/extensions/. The goal was to provide a way to provision any of these resources automatically at boot. But now that we have this, we can take it a step further:

download the root disk image itself with this. There were a bunch of missing bits to make this nice though:

First of all, for raw disk images we need to attach them to a loopback block device, to make them mountable. Easy-peasy, systemd-dissect --attach already delivers that.

Then, for tar disk images we need to bind mount the downloaded and unpacked image to /sysroot/ (which is where the rootfs goes before we transition into it).

Then, to make this nicer, it makes sense to allow deriving the URL to download the rootfs image from directly from the UEFI HTTP boot URL. Or in other words: if you point your UEFI to boot a UKI from some URL (i.e. http://example.com/somedir/myimage.efi), then that UKI's initrd is smart enough to derive from that same URL a different URL for the rootfs (by replacing the final component, so that it becomes http://example.com/somedir/myimage.raw.xz).

Net result of this: I can now point my UEFI to a single URL where it will load the UKI from. A few seconds later the initrd will pick up the rootfs from the same source, and boot it up. Magic!

Why all this though?

It's mostly to tighten my test loop a bit, for physical devices. So here's what this entails:

1. You build your image with mkosi one your development machine, and ask it to serve your image as HTTP. In other words: `mkosi -f serve`.

2. You boot into the target machine once, and register an EFI variable that enables HTTP boot from your development machine. Simply do `kernel-bootcfg --add-uri=http://192.168.47.11:8081/image.efi --title=testloop --boot-order=0`, using @kraxel's wonderful tool.

3. You simply reboot that target machine. It will now fetch the UKI kernel, which then fetches the root disk image. And everytime you reboot this happens again. The target's machine#s local disks are unnaffected.

4. …

5. Profit!!

Sounds simple? That's because it is.

(Well of course, you wonder where the magic sauce is. It's here: you need to build your UKIs a certain way: i.e. add to the kernel cmdline: `rd.systemd.pull=verify=no,machine,blockdev,bootorigin,raw:root:image.raw rootflags=x-systemd.device-timeout=infinity ip=any`)

So, two take-aways here:

1. Really nice test loop now for testing immutable, modern OSes on physical devices, with onboard tooling

2. Yeah, you can frickin' boot into a damn tarball now, with just an UKI.

WIP PR for all of this is here:

https://github.com/systemd/systemd/pull/36314

[WIP] Support booting from rootfs acquired via HTTP by poettering · Pull Request #36314 · systemd/systemd

This extends systemd-import-generator to not only download a disk image at boot, but also attach it to a loopback device, so that we can boot from it. We have most of the pieces already in place, t...

GitHub
oh, and one more comment: this will only work on systems that are relatively high on the systemd adoption scale: you definitely need a systemd-based initrd for this. For deriving the rootfs URL from the UEFI network boot URL you need a systemd-stub based UKI.

and even one more comment:

next steps: instead of downloading root fs via http, access it via nvme-over-tcp.

Benefit: better performance (no ahead of time download, but download as needed), and even better: persistency!

@pid_eins How about WebDAV? 
@pid_eins a lot of people still default to iSCSI because it's been around a long time. But NVMe/TCP is what people should really be defaulting to these days for this kinda solution.
@pid_eins It all sounded very good until the last moment. The whole point if downloading the whole thing is to let the thing be stored compressed or shared in unlimited ways. Once you start downloading block-by-block, you're throwing it all out the window. Might was well just back the root with that image on a translucent (CoW) filesystem or something.
@pid_eins any thoughts on preventing tampering yet? Or restricting an image to a specific machine?

@pid_eins what would be needed for a verification (verify=yes)?

Overall this sounds really cool and a somewhat interesting replacement scenario for PXE in some cases 🤔

@dvzrv right now verify=yes means gpg (specifically: SHA256SUMS signed with some key whose public key is baked into the initrd). We really want to get away from gpg though, hence I hope to add pkcs7 or so eventually, and maybe other stuff.
@dvzrv and of course: just use DDIs, i.e. signed verity enabled disk images. way better security, and you simply don't have to bother about download-time verification, because you have something much better: continous use-time verification.
@pid_eins looking forward to opening that up with VOA then 😅
@pid_eins @kraxel Huh maybe useful for @liminix as well ?
@pid_eins For one, no more need for USB media!

@highvoltage well, you probably need it once to create that HTTP boot URL BootXXX efi variable so that the target system just goes to your development device asking for the UKI.

(you could of course also use DHCP/pxe stuff instead, but uh, that's pain, you'd have to use a separate network for that, and run your own DHCP server, much more painful)

@highvoltage that said, some fancy bioses allow you to enter the URL also interactively in firmware setup. I think tianocore does, but never tried it that way.
@pid_eins Modern Dell, Lenovo and HP firmware too.

@pid_eins @highvoltage It's true that DHCP/PXE has always been a pain, which is why I built a provisioner in #mgmtconfig to make it trivially easy: https://purpleidea.com/blog/2024/03/27/a-new-provisioning-tool/

Of course it could be modified to also host this file for this systemd style provisioning too.

A new provisioning tool built with mgmt - https://purpleidea.com/

A modern, easy and powerful provisioning tool

You can now boot the latest Debian daily ISOs via UEFI HTTP boot if your firmware supports that and you don't want to deal with silly USB sticks. The needed pmem modules were missing from the installer initrd but not anymore. @highvoltage @pid_eins

@pid_eins Slowly but surely, systemd is turning into a container engine and I'm here for it!

Out of curiosity, did you ever take a look at boot2container (https://gitlab.freedesktop.org/gfx-ci/boot2container)? It is my podman- and u-root based initrd that boots any container(s) without any installation, based on the kernel cmdline.

That's IMO the next level of flexibility, but I must admit I have not worked on its security at all... but this is mostly meant for CI purposes (DUTs or gateway) so the needs are different.

gfx-ci / Boot2container · GitLab

A tiny initramfs that sets your machine up, and runs one or more containers specified in the kernel command line. Optional features: caching the container images, NTP, overriding...

GitLab
@mupuf OCI/podman is really not my world, sorry. I didn't drink that cool-aid.

@pid_eins aside from OCI or DDIs, are there any plans for a more practical or efficient image format?

It currently feels somewhat cumbersome to try to generate and distribute raw ddi's + extensions for things like portable services or nspawn. It also feels a bit wasteful when you're basing multiple containers on the same image.

I'd love to see something git- or ostree-like…

@risen uh, i am happy with ddis.

To say this politely I am not a believer and the security model ostree folks and OCI folks subscribe to. I subscribe to the idea that we should do W^X also for file systems: i.e. a file system is either writable, or it may contain executable files, but never both, as part of guaranteeing that attackers cannot gain persistency, no matter what.

DDIs fit perfectly into the model, but ostree (regardless with or without composefs glue) does conceptually not…

@risen … come close, and well, OCI is just terrible by any standard.
@pid_eins I agree with your idea of non-writable images, but I'd see using ostree (or something similar) more as a way of efficiently deploying updated images. Just like you would drop (write) a new ddi in /var/lib/machines/example.raw.v/

@pid_eins @risen I love using disk images for my system drive, but I really do not want to reserve space for X images during install.

I used to just drop disk images as a file into a simple file system and had a mount unit mount that before mounting the system image as a loopback file.

The downside is obviously that someone could corrupt the filesystem holding the images and I have no way to detect that:-( But on the upside: As many images as I want (and have space for).

@hunger did you see what android did there? they basically did a poor man's LVM based on dm-linear. It's called "dynamic partitions". see:

https://source.android.com/docs/core/ota/dynamic_partitions/implement

We should be able to do something similar. Maybe something as simple as this: if some special bit is set in the GPT flags of a partition we want to use, look for "extension" partitions whose identifying uuid is hashed from the original in counter mode. Pick up all such extensions partitions, then merge them via dm-linear.

Implement dynamic partitions  |  Android Open Source Project

Android Open Source Project

@hunger the android folks did measurements which suggest this basically has no IO perfomance cost.

I think doing this kind of setup at boot would be superduper easy within the systemd framework. Other side of the story would be then to teach repart to optionally fulfill grow requests with such extension partitions if needed, and for sysupdate to know how to write them.

finally, it might make sense to have some separate tool we can call on some mounted fs to make space available.

@hunger an OS upgrade with sysupdate would become a bit more complex: instead of just calling sysupdate we would call the rootfs shrinker, then repart, then sysupdate, then reboot.

@pid_eins so they have a partition and put a GPT into that. Then they manage the embedded GPT dynamically? Should be super easy to support: if the disk GPT has some special UUID, then loop-back mount that partition and continue discovery on the contained GPT...

Sorry, I need to read up on dm-linear :-)

@hunger so they do a 2nd level of gpt partitions, i am not sure that's necessary, we should be able to just use the first level
@hunger i mean gpt by default allows 128 partitions iirc, which should be a lot. it's not that we are going to put bazillions of images there

@pid_eins I used a custom image based system for years and routinely kept about 10 images around. At that point the EFI partition used to overflow as I had UKIs for each image -- each booting that one image only:-)

I kept the initial install, one per customer, and the images going a few days back.

Especially the per customer images proved useful: Getting back to a customer, I always tried the newest image first, having the last one I know worked before as a fallback.

@hunger no need to read up on dm-linear: it just glues together a bunch of block devices. some people might call that raid0.

@pid_eins partially related to this, theres ongoing work in U-Boot to utilise the "pmem" feature on ARM boards at least.

This will enable the bootloader to download one disk image with everything on it and make it accessible to the initrd via /dev/pmemX just like any other block device. So you can directly HTTPs boot distro images or installers with little/no modifications

@cas uefi ramdisk support works the same way: they insert a fake pmem entry in the memory table and linux can directly consume it then.

but i am not too keen to rely on that tbh. i much prefer to download a smallish UKI as first step, and then the big root from linux userspace. simply because firmware code quality sucks ass, and linux is quite OK...

@pid_eins that. Honestly. Sounds.

Freaking awesome.