So, if you ask me what my takeaway from the Crowdstrike issue is, I'd say: boot counting/boot assessment/automatic fallback should really be a MUST for today's systems. *Before* you invoke your first kernel you need have tracking of boot attempts and a logic for falling back to older versions automatically. It's a major shortcoming that this is not default behaviour of today's distros, in particular commercial ones.

Of course systemd has supported this for a long time:

https://systemd.io/AUTOMATIC_BOOT_ASSESSMENT/

Automatic Boot Assessment

And it's a shame that commercial distros do not hook into that, and the boot stack of them hasn't changed in more than a decade, is laughably bad at security (unsigned initrds, ffs!) and robustness, and the if you have boot assessment enabled at all turn it into a fantastic DoS (by showing you a boot menu instead of reverting to a working boot choice).
@pid_eins I'm not disagreeing. It makes me wonder how you would categorize/assess/mitigate the security and operations risk of having a system that's supposed to be on one kernel fall back to a previous one?
@poleguy the way automatic boot assessment with systemd works is that on each boot we make one of three assesments: "good", "bad", "dontknow". If we make the "bad" assessment we'll count down the entry's counter (and if it ist zero we give up on it in the future). if we make the "good" assessment we'll drop the counter entirely from the entry, marking it as good for basically all eternity. If we do "dontknow" we don't do a thing

@poleguy this means that a bad actor can play games with us until the point we managed to do one boot that worked correctly, but from that point on, we'll never regress anymore.

I like to believe that that's quite a sensible and simple policy that should work for most cases. It balances robustness against chance for attackers to hold off updates indefinitely.

@pid_eins thanks. That does seem reasonable and for remotely managed systems and better than the alternative, which is manual intervention. I worry a smidge about added complexity. I can't shake the feeling that we keep adding layers of complexity to our systems. It feels okay to add complexity that is proportional to the complexity of the problem being solved. In this case it seems sane. However these remotely managed systems all tend to have out of band methods to recover already, no?