So, if you ask me what my takeaway from the Crowdstrike issue is, I'd say: boot counting/boot assessment/automatic fallback should really be a MUST for today's systems. *Before* you invoke your first kernel you need have tracking of boot attempts and a logic for falling back to older versions automatically. It's a major shortcoming that this is not default behaviour of today's distros, in particular commercial ones.

Of course systemd has supported this for a long time:

https://systemd.io/AUTOMATIC_BOOT_ASSESSMENT/

Automatic Boot Assessment

And it's a shame that commercial distros do not hook into that, and the boot stack of them hasn't changed in more than a decade, is laughably bad at security (unsigned initrds, ffs!) and robustness, and the if you have boot assessment enabled at all turn it into a fantastic DoS (by showing you a boot menu instead of reverting to a working boot choice).
@pid_eins I'm not disagreeing. It makes me wonder how you would categorize/assess/mitigate the security and operations risk of having a system that's supposed to be on one kernel fall back to a previous one?
@poleguy the way automatic boot assessment with systemd works is that on each boot we make one of three assesments: "good", "bad", "dontknow". If we make the "bad" assessment we'll count down the entry's counter (and if it ist zero we give up on it in the future). if we make the "good" assessment we'll drop the counter entirely from the entry, marking it as good for basically all eternity. If we do "dontknow" we don't do a thing

@poleguy this means that a bad actor can play games with us until the point we managed to do one boot that worked correctly, but from that point on, we'll never regress anymore.

I like to believe that that's quite a sensible and simple policy that should work for most cases. It balances robustness against chance for attackers to hold off updates indefinitely.

@pid_eins thanks. That does seem reasonable and for remotely managed systems and better than the alternative, which is manual intervention. I worry a smidge about added complexity. I can't shake the feeling that we keep adding layers of complexity to our systems. It feels okay to add complexity that is proportional to the complexity of the problem being solved. In this case it seems sane. However these remotely managed systems all tend to have out of band methods to recover already, no?

@pid_eins but would this really prevent it, when the configuration of a kernel driver goes bad? If I understand things correctly here (big if), only if you store that config in a volume that can be reverted it would be possible to fix the issue.

Otherwise you boot into the emergency shell and you are non the wiser than Windows systems are right now.

And given it's an endpoint protection that is supposed to react pretty instant to changes, I don't see how you would get theses in the A/B update.

@sheogorath on linux drivers dont really have a "configuration" per se. At least not much you pass into the early, risky parts of the boot process. Subsystems might have some config. In a systemd world you wrap the im authenticated/signed PE addons or confext images, and those you drop next to a specific kernel image, thus you can revert them together as one or update as one and so on. Or in other words: the way we parameterize kernels in modern ways also makes it easy to do assessment/fallback.

@pid_eins how exactly is a successful boot defined though?

Boots to init?
Boot and all services are started successfully? Some services?

What happens if the system boots successfully, runs for ~60 seconds, and then the kernel panics when the first cron job/timer runs?

@JustinAzoff @pid_eins see the link at the start of the thread, flexible strategies available
@JustinAzoff depends on the usecase. Different systems/OSes want different stuff there. Some might just check if system manages to reach some point in the boot process, others might want to also require network pings to work, other stuff might instead just want to check that some services stay up for some minimum amount of time and so on. systemd gives you the basic infra for this and some super basic tests in this sense, but individual OS images might want to fill in more tests/conditions.
@JustinAzoff I assume anything to make boot more complex also opens up new threats.

@pid_eins

Well, the lesson for me (aside for other obvious ones) is that for the industrial systems it should be absolutely mandatory to be something like #SUSE #SLEMicro (or its Red Hat equivalent): snapshot based, with R/O system, where the system would automatically boot from an older snapshot if the current one fails.

The fact that airline computers are not something like this, is just mind-blowing.

Yes, preaching the same gospel @sysrich preached for years.

https://youtu.be/idZEJ0OYfWU

- YouTube

Enjoy the videos and music you love, upload original content, and share it all with friends, family, and the world on YouTube.

@pid_eins for a system like Crowdstrike, you'd want to extend that to cover data files the kernel loads. I wonder how well that'd work with the rate of updates they were pushing out?
@jamesh i think everyone agrees you have to cover the kernel itself and the initrd with these assesment/fallback schemes. I personally would also then cover the rootfs you boot into with that, but people have different opinions how far the coverage should reach, and how much you "pin" through a boot attempt.
@pid_eins unfortunately this wasn't the kind of issue that would be solved by falling back to old versions. The bug in the kernel module was there for a long time or possibly from the beginning, and falling back to an older version would still just have crashed in the same way

@vurpo nope, of course boot assessment would catch this. Key is just that you "pin" enough as part of an attempt, and thus can revert sufficient parts to get things working.

On Linux you'd pin kernel *and* initrd at the very leas, and in the model i propose even the entire /usr for each attempt, to maximize coverage of the assesment logic.

@pid_eins The shocking thing is that this was a requirement for Carrier Grade Linux two decades ago already.
When it comes to reliability and availability as part of dependable computing, our (distributed or not) systems have somewhat regressed as they were scaled up.