@TindrasGrove lots of topics shape different parts of this conversation. You can't prevent failure, so the focus is to prevent a service disruption. RTO and RPO for data. Error budgets for application reliability. Broadcast domains, STP, routing protocol design and ECMP for networking. There are different types of architectural considerations for infrastructure vs logical systems design. You really have to pick one piece and drill down, but can't focus only on that one area. Zoom out and everyone will tell you it's a disaster recovery discussion. But that's short-sighted. A better solution is to avoid disaster and design to prevent it = disaster avoidance. Move one step more and the topic widens to business continuity. This is really the starting point. You have to define what's critical, interdependencies, and how much loss you can sustain, then build the mitigation strategies. Not having this defined and a playbook for business continuity is exactly what should keep CxOs up at night.