Looking at the reports of some systems failing to boot after the latest UEFI DBX update and wondering whether it's another case of https://mjg59.dreamwidth.org/22855.html
mjg59 | Samsung laptop bug is not Linux specific

We ended up writing some hilariously crappy workarounds for Linux to prevent this kind of thing, where firmware fails to boot if the UEFI variable store is too full. We check whether the write would leave under 5KB of free space (as reported by the firmware), and refuse the write if it would. Easy! Except some firmware would never actually increase the available space count if a variable was deleted, so after a while we'd just never be able to write any more variables
I can't remember how I figured this out, but the affected machines would trigger garbage collection if we tried to create a variable bigger than the available space, so Linux just tries to create a giant variable and then deletes it again to force the firmware to actually update the free space counter
Fucking computers, man
@mjg59 Fucking UEFI. The ring -1 (or is it -42 now? does anyone even know or care?) shit nobody asked for.
@dalias @mjg59 UEFI and ACPI solve the problem of booting the same image on a wide variety of machines. From a distro point of view, that's a huge win. Whether it is worth the huge number of tradeoffs is another question.
@alwayscurious @mjg59 Device tree solves that problem. UEFI and ACPI both address a much larger-scope problem (which a lot of us don't want) of having a persistent layer under your trusted OS that you also have to trust, that continues execution after control was supposed to be passed to the OS, and that the OS is forced to interact with to access important functionality.
@dalias @mjg59 Device tree would be a solution if all of the board-specific code reached mainline, but it doesn't. See @mjg59's commentary on the subject.
@alwayscurious @mjg59 Sorry, I wasn't clear. I don't mean it's a way you can do things right now. I mean the problem of "booting same image on diverse hardware" is a much smaller-scope problem domain than what UEFI and ACPI do ("abstraction layer/runtime under your trusted OS kernel"). And some of us are very very unhappy that we're expected to accept the latter in order to get the former.
@dalias @alwayscurious @mjg59 yeah we've had enough ACPI problems in recent years that we're starting to see that as a serious problem
@dalias @alwayscurious @mjg59 ACPI was a big barrier to Linux compatibility back in the 90s, because manufacturers only cared about Microsoft so Linux had to reverse-engineer every single thing and figure out how to do it in an open context
@dalias @alwayscurious @mjg59 then everyone basically got their shit together and there wasn't a ton of new stuff that was actually critical, for about twenty years
@dalias @alwayscurious @mjg59 ... but just last month we ran into a problem where Linux is compatible with a motherboard and an m.2 card, but not when they're used together, because this new super-ACPI thing has them talking directly to each other in opaque, non-standard ways and Intel and AMD are at war
@ireneista @dalias @alwayscurious @mjg59 when I started seeing ACPI errors again, on 2023 machines, I was like "did I time travel what the fuck I thought we were past this"
@amy @mjg59 @dalias @alwayscurious yeah there's some new spec we've forgotten the name of, which lets devices other than the mobo get in the game as well. it seems fantastically ill-advised to us.
@ireneista @mjg59 @dalias @alwayscurious

Interesting. I agree that seems .. bad. I wonder if it's at all connected to Intel's new bizarre webcam stuff where the webcam no longer pre-processes frames and the processing is done in userspace. This is fine on windows where they can rampantly fuck with machines but Linux has been ... mm. bad.
@amy @mjg59 @dalias @alwayscurious oh, wow, fascinating. we remember when that happened with modems back in the day...
@amy @mjg59 @dalias @alwayscurious there was never really a satisfying solution to the "winmodem" problem. modems stopped being things people used directly; everyone learned how Fourier transforms work; and CPU speeds increased, so the problem went away but the market forces that led to it weren't addressed.
@amy @mjg59 @dalias @alwayscurious (from what we can tell it really is true that the free software community at large understands Fourier and wavelet transforms a lot better today than back then)
@amy @mjg59 @dalias @alwayscurious we wouldn't necessarily object to offloading image processing to the CPU; what we object to is having that process governed by the camera
@ireneista @mjg59 @dalias @alwayscurious yes - a standard interface to allow many different kinds of user space image processing on CPU. I think Intel has made lip service to this idea but it's doubtful they'll make good on actually making it happen.
@amy @mjg59 @dalias @alwayscurious they have near-zero commercial incentive to make it happen, yeah :/
@ireneista @amy @mjg59 @dalias and apparently lots to not make it happen, judging by the reluctance of various vendors.
@alwayscurious @amy @mjg59 @dalias it costs money and corporations in general would prefer not to spend money. as far as we can tell, that's the most significant factor. there may be other reasons too.
@amy @[email protected] @mjg59 @dalias @alwayscurious oooooooh I hate, hate, hate that new webcam crap. Like congrats, you messed up the one reliable thing on Linux, thanks.
@amy @[email protected] @dalias @alwayscurious @mjg59 huh, wait. so me seeing ACPI errors wasn't just "Audrey made a mistake somewhere"?

huuuh.
@dalias @alwayscurious @mjg59 sorry, "at war" is overly dramatic and buys into capitalist framing too much. but you get the idea
@dalias @alwayscurious @mjg59 and, like... yeah. you have a good point. this layer below the OS is a problem.
@dalias @alwayscurious @mjg59 oh, yes, and at no point did manufacturers actually start caring about Linux, so................
@ireneista @dalias @mjg59 A law requiring hardware to have documentation that is both human-readable (for human use) and machine-readable (so drivers can be generated from it automatically) is the only way out I can think of.
@dalias @alwayscurious Device Tree inherently can't solve that problem without constraining hardware choices far more than ACPI does, but neither ACPI nor UEFI have any fundamental requirement for any SMM and you can just not use runtime services if that's a concern
@mjg59 @dalias I will add that this level of constraint would result in all power management being moved into firmware on a separate microcontroller, which is arguably not a win compared to ACPI tables that can be statically disassembled.
@mjg59 @dalias the only way I can think of to not have an abstraction layer, not constrain hardware design, and have wide hardware design is for regulations to compel power management logic to be upstreamed, which is heavyhanded at the very least.
@dalias @mjg59 I'm one of them, but look at the situation in the Arm world and how many people are stuck running ancient kernels.
@alwayscurious @mjg59 This is largely the fault of Linux being a monolithic mess. Nobody would much care that the board-hardware-support kernel was ancient if the actual OS (filesystems, syscalls, networking, etc.) and non-board specific hardware support were separate layers that could be upgraded independently.

@dalias @mjg59 to solve that, you would need:

  • User expectations that they must use recent install images.
  • An OS that is not a monolithic kernel.
  • Personally, I'm in favor of both.

    @alwayscurious @mjg59 I'm in favor of users not having to use "install images" at all.

    @dalias @mjg59 If there was a way to force vendors to upstream all of their board support code, then device tree would be just as good as UEFI + ACPI for portability, but right now there is no such way.

    I fully agree that UEFI + ACPI are security disasters and that device tree is much better in that regard, but it is also one of the reasons that one can boot Linux on x86 systems that were never intended to run it and usually have a lot of stuff work out of the box without someone having to write drivers first. I'm not aware of good solutions that are also economically feasible in the present market and regulatory environment.

    @alwayscurious @dalias device tree fundamentally doesn't solve the "I want to boot an install image from last year on a device using a thermal sensor released this year" problem, and no amount of upstreaming changes that
    @mjg59 @alwayscurious It does if your thermal sensor driver is independent and not baked into the kernel.
    @dalias @mjg59 the install image from last year still would not have the driver for it.
    @alwayscurious @mjg59 You don't need the "install image". You setup the root fs media via an existing machine and install it in the target device ready to go.
    @dalias @mjg59 That doesn't work if the storage is not removable, and even then you need software to set up the media.

    @alwayscurious @mjg59 The only reason it "works" on x86 is that everyone's essentially running black box proprietary substrate drivers for a bunch of critical stuff.

    In theory these could just put things in a working state then wipe themselves out and pass execution permanently to the real OS. But they don't.

    @dalias @mjg59 the root cause of all of this is that hardware vendors don't write and upstream open source drivers. Even if somehow vendors could be compelled to do that, I'd rather Linux not be the upstream for all this code. Put it in a library that any OS can use.
    @alwayscurious @mjg59 The root cause is that they don't properly document hardware interfaces. A proper document is worth way more than the garbage-quality drivers they write. Making the hardware work minimally then ends up being as simple as hard-coding a sequence of register writes.
    @dalias @mjg59 what about documenting it in a machine-readable form that can be used to autogenerate a driver?
    @alwayscurious @mjg59 Theoretically that would be lovely. But a sufficiently expressive form would essentially become a programming language/virtual machine (like ACPI) that doesn't actually express how to use the hardware to a human except "execute this code on the virtual machine and the the thing comes out, or have fun reverse engineering it if you actually want to know what's happening".
    @dalias @mjg59 did you know that power management nowadays involves hard-realtime control loops that are run on a separate processor?

    @alwayscurious @mjg59 That's good, it means it should operate independently with no control channels between the power management processor and the domain that contains any user data or code except a simple channel for setting power management parameters.

    Right? [insert Padme meme]

    @dalias @mjg59 Nah, that code should be open source and must be trusted. See PlunderVolt for why.
    @alwayscurious @mjg59 What attacks do you have in mind if it has no communication channels, and how would baked-in intentional breakage here be any different from baked-in intentional breakage in the cpu that you also can't see?
    @dalias @mjg59 Fault injection, stealing crypto secrets via power analysis, maybe others. There is no difference between that and a CPU backdoor, but what is the advantage of moving this to a separate processor?

    @alwayscurious @mjg59 I claim it's far weaker than a cpu backdoor because you can't target it. It doesn't have enough information to know when it wants to break things, and it doesn't have any channel to exfiltrate anything; it would have to break things in a way that causes the cpu malfunction to double as exfiltration.

    Advantage of having it on a separate processor is that you get hard realtime without having a hard-realtime rootkit below ring 0 on the real cpu where it *would* have access to all the context to to attacks.

    @dalias @mjg59 Ah, that makes sense. Also, it doesn't interfere with OS scheduling.

    I do want power management to be under OS control but that requires major changes in OS architecture.