Why Distributed Systems Fail (And How Elite Engineers Prevent It) #DistributedSystems #SystemDesign #SoftwareEngineering

Most production outages don’t happen because software breaks. They happen because systems fail badly. Learn the real engineering behind building resilient distributed systems: circuit breakers, retry storms, load shedding, fault isolation, chaos engineering, and AWS-scale resilience patterns. A must-read deep dive for software engineers, architects, and engineering leaders building systems that must stay online. #DistributedSystems #Microservices #SystemDesign #ResilienceEngineering #Java #AWS #SoftwareArchitecture

https://atozofsoftwareengineering.blog/2026/05/11/why-distributed-systems-fail-and-how-elite-engineers-prevent-it-distributedsystems-systemdesign-softwareengineering/

If the Transaction Coordinator crashes after collecting votes but before sending its decision, participants remain blocked until recovery. This blocking behavior is a known limitation of 2PC in distributed fintech systems.

#DistributedSystems #Fintech #FaultTolerance

Artemis II fault tolerance

Communications of the ACM had a fascinating post about how NASA built Artemis II’s fault tolerant computer. 3 fascinating excerpts. (1) Eight modules with several back up scenarios: “Or…

A Learning a Day

If a lock holder crashes, the lock must expire via time-to-live. If the TTL is too short, a slow process may lose its lock while still executing, allowing a second process to proceed — a correctness failure in fintech.

#DistributedLocking #FaultTolerance #Fintech

Consensus requires a quorum — a majority of nodes — to agree before a value is committed. This majority requirement is what allows the system to continue despite node failures.

#Consensus #FaultTolerance #Fintech

BSDCan https://www.bsdcan.org/2026/ Talk Friday 2026-06-19: 15:45 - 16:35 DMS 1110
Geographically fault-tolerant SSH on OpenBSD
Rob Keizer
https://www.bsdcan.org/2026/timetable/timetable-Geographically-fault-tolerant-SSH.html
To register https://www.bsdcan.org/2026/registration.html @bsdcan #openbsd #ssh #faulttolerance
BSDCan - BSDCan

BSDCan is a technical BSD conference held in Ottawa, Ontario, Canada.

BSDCan

𝗬𝗮𝗳𝗮𝗻 𝗛𝘂𝗮𝗻𝗴 [Advisor: Guanpeng Li] will defend his doctoral thesis entitled "Data-efficient and Fault-tolerant Exascale Computing" tomorrow Thursday 4/16 at 3pm.

Deets at https://bit.ly/huang_4_16

#FinalExam #PhDLife #UIowaGrad26 #HPC #FaultTolerance

So apparently, NASA's secret sauce for building a "fault-tolerant" computer involves getting blocked by #Cloudflare while trying to access the article. 🚫✨ Who knew #cybersecurity was just a fancy way of saying you can't read about computers? 🤖🔒
https://cacm.acm.org/news/how-nasa-built-artemis-iis-fault-tolerant-computer/ #NASA #FaultTolerance #ComputerScience #HackerNews #ngated
How NASA Built Artemis II’s Fault-Tolerant Computer

Communications of the ACM
How NASA Built Artemis II’s Fault-Tolerant Computer

Communications of the ACM

For those with more than a passing interest in Information Systems security (OK not everyone🤣) , this might prove to be interesting. Although as commented,

’build the governance first, get the key parties committed, define the trust roots, enforce the rules – is precisely the kind of process that works in Switzerland and struggles almost everywhere else.’

https://www.theregister.com/2026/03/17/switzerland_bgp_alternative/

#IT #Security #FaultTolerance #ETH #Switzerland

Switzerland built a secure alternative to BGP. The rest of the world hasn't noticed yet

Feature: SCION: Proven in banking and healthcare, slow to spread everywhere else

The Register