Mastodawn

Lennart Poettering Feb 10, 2025

Fun little thing I have been working on: teach systemd to boot directly into a disk image downloaded via HTTP within the initrd.

In v257 systemd learnt the ability to download disk images at boot via systemd-import-generator, both DDIs and tarballs, and place them in /var/lib/machines/, /var/lib/portables/, /var/lib/confexts, /var/lib/extensions/. The goal was to provide a way to provision any of these resources automatically at boot. But now that we have this, we can take it a step further:

Show thread

Lennart Poettering Feb 10, 2025

download the root disk image itself with this. There were a bunch of missing bits to make this nice though:

First of all, for raw disk images we need to attach them to a loopback block device, to make them mountable. Easy-peasy, systemd-dissect --attach already delivers that.

Then, for tar disk images we need to bind mount the downloaded and unpacked image to /sysroot/ (which is where the rootfs goes before we transition into it).

Show thread

Lennart Poettering Feb 10, 2025

Then, to make this nicer, it makes sense to allow deriving the URL to download the rootfs image from directly from the UEFI HTTP boot URL. Or in other words: if you point your UEFI to boot a UKI from some URL (i.e. http://example.com/somedir/myimage.efi), then that UKI's initrd is smart enough to derive from that same URL a different URL for the rootfs (by replacing the final component, so that it becomes http://example.com/somedir/myimage.raw.xz).

Show thread

Lennart Poettering Feb 10, 2025

Net result of this: I can now point my UEFI to a single URL where it will load the UKI from. A few seconds later the initrd will pick up the rootfs from the same source, and boot it up. Magic!

Why all this though?

Show thread

Martin Roukala (né Peres)Feb 11, 2025

@pid_eins Slowly but surely, systemd is turning into a container engine and I'm here for it!

Out of curiosity, did you ever take a look at boot2container (https://gitlab.freedesktop.org/gfx-ci/boot2container)? It is my podman- and u-root based initrd that boots any container(s) without any installation, based on the kernel cmdline.

That's IMO the next level of flexibility, but I must admit I have not worked on its security at all... but this is mostly meant for CI purposes (DUTs or gateway) so the needs are different.

gfx-ci / Boot2container · GitLab

A tiny initramfs that sets your machine up, and runs one or more containers specified in the kernel command line. Optional features: caching the container images, NTP, overriding...

GitLab

Show thread

Lennart Poettering Feb 11, 2025

@mupuf OCI/podman is really not my world, sorry. I didn't drink that cool-aid.

Show thread

Mourad De Clerck Feb 11, 2025

@pid_eins aside from OCI or DDIs, are there any plans for a more practical or efficient image format?

It currently feels somewhat cumbersome to try to generate and distribute raw ddi's + extensions for things like portable services or nspawn. It also feels a bit wasteful when you're basing multiple containers on the same image.

I'd love to see something git- or ostree-like…

Show thread

Lennart Poettering Feb 11, 2025

@risen uh, i am happy with ddis.

To say this politely I am not a believer and the security model ostree folks and OCI folks subscribe to. I subscribe to the idea that we should do W^X also for file systems: i.e. a file system is either writable, or it may contain executable files, but never both, as part of guaranteeing that attackers cannot gain persistency, no matter what.

DDIs fit perfectly into the model, but ostree (regardless with or without composefs glue) does conceptually not…

Show thread

Lennart Poettering Feb 11, 2025

@risen … come close, and well, OCI is just terrible by any standard.

Show thread

Tobias Hunger Feb 11, 2025

@pid_eins @risen I love using disk images for my system drive, but I really do not want to reserve space for X images during install.

I used to just drop disk images as a file into a simple file system and had a mount unit mount that before mounting the system image as a loopback file.

The downside is obviously that someone could corrupt the filesystem holding the images and I have no way to detect that:-( But on the upside: As many images as I want (and have space for).

Show thread

Lennart Poettering Feb 12, 2025

@hunger did you see what android did there? they basically did a poor man's LVM based on dm-linear. It's called "dynamic partitions". see:

https://source.android.com/docs/core/ota/dynamic_partitions/implement

We should be able to do something similar. Maybe something as simple as this: if some special bit is set in the GPT flags of a partition we want to use, look for "extension" partitions whose identifying uuid is hashed from the original in counter mode. Pick up all such extensions partitions, then merge them via dm-linear.

Implement dynamic partitions | Android Open Source Project

Android Open Source Project

Show thread

Tobias Hunger Feb 12, 2025

@pid_eins so they have a partition and put a GPT into that. Then they manage the embedded GPT dynamically? Should be super easy to support: if the disk GPT has some special UUID, then loop-back mount that partition and continue discovery on the contained GPT...

Sorry, I need to read up on dm-linear :-)

Show thread

Lennart Poettering Feb 12, 2025

@hunger so they do a 2nd level of gpt partitions, i am not sure that's necessary, we should be able to just use the first level

Show thread

Lennart Poettering Feb 12, 2025

@hunger i mean gpt by default allows 128 partitions iirc, which should be a lot. it's not that we are going to put bazillions of images there

Show thread

Tobias Hunger

@pid_eins I used a custom image based system for years and routinely kept about 10 images around. At that point the EFI partition used to overflow as I had UKIs for each image -- each booting that one image only:-)

I kept the initial install, one per customer, and the images going a few days back.

Especially the per customer images proved useful: Getting back to a customer, I always tried the newest image first, having the last one I know worked before as a fallback.