Oxide has published my doc on power supply glitch mitigation on our servers, which required some reverse engineering of a vendor fix (that wasn't quite right).

https://rfd.shared.oxide.computer/rfd/630

To me, the most interesting part was this: because of where in our stack the mitigation gets applied, we decided _not_ to persist it in flash/eeprom on the IBC, because this implies a power cycle. We don't power cycle fully. So, we defined a way to test for its presence and re-apply it if required at each update (or power-on).

This implies that the customer gets the mitigation immediately upon updating our firmware, and that we can adjust the mitigation in any future update if we learn more.

630 - BMR491 Glitch Mitigation Plan / RFD / Oxide

The way I think about this, the decision not to persist the upgrade in eeprom avoids "splitting" the server fleet into disjoint models.

If you've shipped a bunch of Foobar Rev A machines, and then an update applies a persistent power supply fix to half of them... well, now you've got a mixed fleet of Rev A hardware and Rev A With A Change hardware. We try really hard to avoid this if we can.

So, being able to apply the mitigation from the service processor means you just have to check our overall software version to check if it's present, not track separate SKUs. And I like that.

@cliffle I just had a peep at the open position descriptions and man oxide are doing some cool stuff