Oxide has published my doc on power supply glitch mitigation on our servers, which required some reverse engineering of a vendor fix (that wasn't quite right).

https://rfd.shared.oxide.computer/rfd/630

To me, the most interesting part was this: because of where in our stack the mitigation gets applied, we decided _not_ to persist it in flash/eeprom on the IBC, because this implies a power cycle. We don't power cycle fully. So, we defined a way to test for its presence and re-apply it if required at each update (or power-on).

This implies that the customer gets the mitigation immediately upon updating our firmware, and that we can adjust the mitigation in any future update if we learn more.

630 - BMR491 Glitch Mitigation Plan / RFD / Oxide

Yeah, it's not great that I left an empty Security Considerations section at the end, but I didn't expect this to be published. It's a bit of a work in progress still. 🤷
@cliffle It does scare me a bit that a management controller can do gross changes to the power supply voltages..and write them to flash.

@penguin42 good reason to trust your management controller!

(This is also part of why we don't give the Big CPU direct access to these buses -- it's another layer someone will have to crack before achieving persistent power supply changes or whatever.)

@cliffle Good! Although I assume that's on a management vlan or something which are always fun šŸ™‚
@cliffle @penguin42 If I had to design a full-featured BMC, I would limit access to PMBus to either a separate microcontroller or the Arm secure world. This would not fully block access to the PSU, but it would validate all requests and ensure that they were reasonable.
@alwayscurious @penguin42 that's what we've been shipping! 😁

@cliffle @penguin42 So the service processor filters PMBus access? Nice!

Off-topic, but I wonder if Oxide will ever include an SQL database as a ā€œmanaged serviceā€ (special-case resource). A distributed SQL database with access to node-local storage should have better performance than if it must go through Crucible, and there is no loss of redundancy because that is handled at the database layer. The same goes for object storage, which also likes local access to storage. See Bluestore vs Filestore in Ceph.

@cliffle I'm glad that section is still in the template

The way I think about this, the decision not to persist the upgrade in eeprom avoids "splitting" the server fleet into disjoint models.

If you've shipped a bunch of Foobar Rev A machines, and then an update applies a persistent power supply fix to half of them... well, now you've got a mixed fleet of Rev A hardware and Rev A With A Change hardware. We try really hard to avoid this if we can.

So, being able to apply the mitigation from the service processor means you just have to check our overall software version to check if it's present, not track separate SKUs. And I like that.

@cliffle I just had a peep at the open position descriptions and man oxide are doing some cool stuff

@cliffle This is fantastic write-up.

NGL, I'd worry the supplier to start shipping R1D as R1C all of a sudden (without telling you about the "upgrade") and that your assumption the inner firmware configuration stays the same be violated.

There are so many things to track, this is crazy. This is so tedious, I don't even know how you all have achieved that.

@baloo so the good news is, applying the R1C mitigation to a fixed unit has no impact -- you just lose a safety feature that we've already built redundant coverage for.

We've started getting the R1D parts in, and while the manufacturer didn't change the part number (frustrating!), they appear to work as expected.

@cliffle that was more of a comment on this:
```
The main advantage of this is flexibility. If we want to apply a different or more complex configuration change in a future firmware version, we don’t need to worry about "undoing" the previous persistent mitigation, and we don’t need to distinguish between BMR491s that have had the mitigation applied vs those newly installed in manufacturing.
```

And that you sound like hoping for a single "code path" for the mitigation.

I was saying that having R1C and R1D would create two code paths, But sounds like you already hit that.

@baloo ...yeah, I'm interested in having as few weird code paths as possible, but no fewer. Skipping the mitigation on fixed parts is desirable and worth spending the additional complexity* on.

*the additional complexity: https://github.com/oxidecomputer/hubris/blob/40077e040d5f218c78a74eb500d70d6c072ac628/drv/i2c-devices/src/bmr491.rs#L231

hubris/drv/i2c-devices/src/bmr491.rs at 40077e040d5f218c78a74eb500d70d6c072ac628 Ā· oxidecomputer/hubris

A lightweight, memory-protected, message-passing kernel for deeply embedded systems. - oxidecomputer/hubris

GitHub