I’ve created a data center-scale discrete event model that answers questions about how spares, data center technicians, and different repair automations affect the overall availability of rack-scale GPU systems at scale.
Bummed that it has no value to anyone in the HPC community tho.




