You should map every external system dependency and classify each one by controllability and failure impact with the same priority creating mapping. For a technology B2C family business, the dependency mapping might look like this. The lead engineer maps every external system dependency. The mapping is a document. The document is a matrix. The matrix has two axes. Axis one. Controllability. The controllability axis is from low to high. Low means the team has no control (17/63)
. High means the team has full control. Axis two. Failure impact. The failure impact axis is from low to high. Low means the failure causes minor disruption. High means the failure causes major disruption. (18/63)
The matrix has four quadrants. Quadrant one. Low controllability, high failure impact. This quadrant is the danger zone. The danger zone requires immediate action. Quadrant two. High controllability, high failure impact. This quadrant is the investment zone. The investment zone requires building internal solutions. Quadrant three. Low controllability, low failure impact. This quadrant is the monitoring zone. The monitoring zone requires watching. Quadrant four (19/63)
. High controllability, low failure impact. This quadrant is the maintenance zone. The maintenance zone requires routine care. (20/63)
The lead engineer maps the three dependencies. Dependency one. Fitbit API. The Fitbit API has low controllability. The low controllability is because Fitbit controls the API. The Fitbit API has high failure impact. The high failure impact is because data sync errors affect all users. The Fitbit API is in quadrant one. The quadrant one is the danger zone. The danger zone requires immediate action. Dependency two. Stripe API. The Stripe API has low controllability (21/63)
. The low controllability is because Stripe controls the API. The Stripe API has high failure impact. The high failure impact is because transaction failures affect revenue. The Stripe API is in quadrant one. The quadrant one is the danger zone. The danger zone requires immediate action. Dependency three. Facebook API. The Facebook API has low controllability. The low controllability is because Facebook controls the API. The Facebook API has medium failure impact (22/63)
. The medium failure impact is because sharing failures affect engagement but not core functionality. The Facebook API is in quadrant three. The quadrant three is the monitoring zone. The monitoring zone requires watching. (23/63)
The mapping is complete. The completion of the mapping creates clarity. The clarity reveals that two of the three dependencies are in the danger zone. The two danger zone dependencies require immediate action. The immediate action is anticipation. For a Lean team of two to five, the dependency mapping should be a matrix. The matrix should have two axes. The matrix should have four quadrants. The mapping should be done immediately (24/63)
. For Lean, the dependency mapping should be part of the team's value stream mapping. The mapping is a value stream activity.
2. Build an Internal Anticipation Layer That Detects External Dependency Failures Before They Reach Users (25/63)
Honda built an internal anticipation layer at Honda. The internal anticipation layer was a system. The system detected external failures. The detection of external failures before they reached users prevented user impact. The prevention of user impact created stability. The stability built Honda. (26/63)
You should build an internal anticipation layer that detects external dependency failures before they reach users with the same stability creating layer. For a technology B2C family business, the internal anticipation layer might look like this. The lead engineer builds an internal anticipation layer. The internal anticipation layer is a monitoring system. The monitoring system is a set of automated checks. The automated checks run every sixty seconds (27/63)
. The every sixty seconds checks test the three external dependencies. (28/63)
Check one. Fitbit API health check. The Fitbit API health check sends a test request. The test request is a data sync. The data sync is for a test user. The test user is a fake account. The fake account has fake data. The fake data is synced. The syncing of the fake data tests the API. If the sync fails, the check triggers an alert. The alert is sent to the team. The team is notified. The notification is immediate. The immediacy of the notification creates speed (29/63)
. The speed of the response prevents user impact. (30/63)
Check two. Stripe API health check. The Stripe API health check sends a test transaction. The test transaction is a one cent charge. The one cent charge is for a test card. The test card is a Stripe test card. The test card is charged. The charging of the test card tests the API. If the charge fails, the check triggers an alert. The alert is sent to the team. The team is notified. The notification is immediate. The immediacy of the notification creates speed (31/63)
. The speed of the response prevents user impact. (32/63)
Check three. Facebook API health check. The Facebook API health check sends a test share. The test share is a post. The post is for a test page. The test page is a Facebook test page. The test page is shared. The sharing of the test page tests the API. If the share fails, the check triggers an alert. The alert is sent to the team. The team is notified. The notification is immediate. The immediacy of the notification creates speed. The speed of the response prevents user impact. (33/63)
The monitoring system is built. The building of the monitoring system takes one week. The one week of building creates an anticipation layer. The anticipation layer detects failures. The detection of failures before they reach users prevents user impact. Last month, the monitoring system detected a Fitbit API failure. The detection was at three AM. The three AM detection was before the peak hours. The peak hours are six AM to eight AM (34/63)
. The before peak hours detection gave the team three hours. The three hours of time allowed the team to implement a workaround. The workaround was a manual sync option. The manual sync option allowed users to sync their data manually. The manual sync option prevented user impact. The prevention of user impact saved the company twelve thousand dollars. The twelve thousand dollars was the cost of the lost subscriptions that would have happened without the monitoring system. (35/63)
For a Lean team of two to five, the internal anticipation layer should be a monitoring system. The monitoring system should run automated checks. The automated checks should run at least every sixty seconds. The monitoring system should trigger alerts. For Lean, the internal anticipation layer should be part of the team's build measure learn cycle. The layer is a measure activity. (36/63)
3. Create Internal Fallback Systems for Every Danger Zone Dependency So the Product Keeps Working When the External System Fails
Honda created internal fallback systems at Honda. The internal fallback systems were backups. The backups ensured that the product kept working. The keeping working of the product when the external system failed created resilience. The resilience built Honda. (37/63)
You should create internal fallback systems for every danger zone dependency so the product keeps working when the external system fails with the same resilience creating fallback. For a technology B2C family business, the internal fallback systems might look like this. The lead engineer creates internal fallback systems. The internal fallback systems are for the two danger zone dependencies. The two danger zone dependencies are the Fitbit API and the Stripe API. (38/63)
Fallback one. Fitbit API fallback. The Fitbit API fallback is a local data store. The local data store is on the device. The device is the user phone. The user phone stores the workout data locally. The local storing of workout data ensures that the data is available even when the Fitbit API is down. The availability of data when the Fitbit API is down prevents data loss. The prevention of data loss creates continuity. The continuity of data creates user trust (39/63)
. The user trust reduces churn.
Fallback two. Stripe API fallback. The Stripe API fallback is a queue. The queue is a transaction queue. The transaction queue stores failed transactions. The storing of failed transactions ensures that no transaction is lost. The no loss of transactions creates reliability. The reliability of transactions creates revenue protection. The revenue protection reduces losses. (40/63)
The fallback systems are built. The building of the fallback systems takes two weeks. The two weeks of building creates resilience. The resilience ensures that the product keeps working. The keeping working of the product when the external system fails prevents user impact. Last month, the Stripe API had a rate limit failure. The rate limit failure happened at seven AM. The seven AM failure was during peak hours. The peak hours are six AM to eight AM (41/63)
. The during peak hours failure affected one hundred and forty seven transactions. The one hundred and forty seven transactions were queued. The queuing of the one hundred and forty seven transactions ensured that no transaction was lost. The no loss of transactions created reliability. The reliability of transactions prevented revenue loss. The prevention of revenue loss saved the company eighteen thousand dollars (42/63)
. The eighteen thousand dollars was the cost of the lost subscriptions that would have happened without the fallback system. (43/63)
For a Lean team of two to five, the internal fallback systems should be built for every danger zone dependency. The fallback systems should ensure that the product keeps working. The fallback systems should be built within two weeks. For Lean, the internal fallback systems should be part of the team's build measure learn cycle. The fallback systems are a build activity. (44/63)
4. Run a Feedback Loop Every Iteration to Review External Dependency Performance and Iterate on the Anticipation and Fallback Systems
Honda ran a feedback loop at Honda. The feedback loop reviewed external dependency performance. The review identified improvements. The improvements were implemented. The implementation of improvements created better systems. The better systems created better outcomes. The better outcomes built Honda. (45/63)
You should run a feedback loop every iteration to review external dependency performance and iterate on the anticipation and fallback systems with the same improvement creating feedback loop. For a technology B2C family business, the feedback loop might look like this. The lead engineer runs a feedback loop. The feedback loop is a meeting. The meeting is thirty minutes. The meeting happens every two weeks. The every two weeks meeting is an iteration (46/63)
. The iteration reviews the external dependency performance. The review is based on data. The data is from the monitoring system. (47/63)
The monitoring system has three metrics. Metric one. Number of external dependency failures. The number of external dependency failures is the count. The count is the total number of failures in the last two weeks. Metric two. Number of user impacting failures. The number of user impacting failures is the count. The count is the total number of failures that reached users. Metric three. Mean time to detect. The mean time to detect is the average time (48/63)
. The average time is between the failure and the detection. (49/63)
The three metrics are reviewed. The review identifies improvements. Last iteration, the review revealed that the mean time to detect was four minutes. The four minutes was too long. The too long detection time meant that users were impacted for four minutes before the team knew. The four minutes of user impact caused frustration. The frustration caused churn. The improvement was to reduce the check interval from sixty seconds to fifteen seconds (50/63)
. The reduction of the check interval from sixty seconds to fifteen seconds reduced the mean time to detect from forty five seconds to twelve seconds. The twelve seconds was a seventy three percent improvement. The seventy three percent improvement reduced user impact. The reduction of user impact reduced churn. The reduction of churn saved the company eight thousand dollars. The eight thousand dollars was the cost of the churn that would have happened without the improvement. (51/63)
The feedback loop is run every iteration. The every iteration running of the feedback loop creates continuous improvement. The continuous improvement creates better systems. The better systems create better outcomes. For a Lean team of two to five, the feedback loop should happen every iteration. The feedback loop should review at least three metrics. The feedback loop should identify at least one improvement (52/63)
. For Lean, the feedback loop should be part of the team's build measure learn cycle. The feedback loop is a learn activity.
Closing on Engineering Over Hoping (53/63)
Soichiro Honda did not build Honda by complaining about external suppliers and blaming external partners and waiting for external fixes and hoping that external systems would become more reliable. He built it by mapping every external system dependency and classifying each one by controllability and failure impact. The three external dependencies were mapped to a matrix with two axes and four quadrants (54/63)
. The Fitbit API and the Stripe API were identified as danger zone dependencies that required immediate action.
He built an internal anticipation layer that detected external dependency failures before they reached users. The monitoring system ran automated checks every sixty seconds. The three AM detection of the Fitbit API failure gave the team three hours to implement a workaround. The manual sync option prevented user impact and saved twelve thousand dollars. (55/63)
He created internal fallback systems for every danger zone dependency so the product kept working when the external system failed. The local data store ensured workout data was available even when the Fitbit API was down. The transaction queue ensured no transaction was lost when the Stripe API had a rate limit failure. The queuing of one hundred and forty seven transactions saved eighteen thousand dollars. (56/63)
He ran a feedback loop every iteration to review external dependency performance and iterate on the anticipation and fallback systems. The thirty minute meeting every two weeks reviewed three metrics. The review revealed that the mean time to detect was four minutes. The reduction of the check interval from sixty seconds to fifteen seconds reduced the mean time to detect to twelve seconds. The seventy three percent improvement reduced churn and saved eight thousand dollars. (57/63)