# Creating Effective Incident Response Through Innovation and Collaboration

A retail platform company running XP across eight teams has a problem with slow incident response. The company runs an e-commerce platform connecting independent retailers with consumers. It has 920 employees and has been operating for thirteen years. (1/21)

The product development organization for the new order fulfillment system has 74 people across eight teams. Each team has nine or ten people. When a production incident occurs, the average resolution time is four hours and thirty minutes.

That downtime means retailers can't process orders. Consumers abandon their carts. The company loses transaction fees, costing $189,000 per quarter, which is 29 percent of the quarterly revenue from the order fulfillment system. (2/21)

Ibuka's principle was simple: Put the right people in the room. Then let them solve it. For a retail platform enterprise, the incident response problem is the same. When a production incident occurs, the wrong people are in the room. The right expertise is missing. Incidents get resolved slowly.

Ibuka's innovation through collaboration says: put the right people in the room. Then let them solve it. That approach makes incident response faster and saves time and money. (3/21)

## The Core Principle

Ibuka's innovation through collaboration was built on a simple insight. The best way to create effective incident response across multiple teams is to stop routing incidents through a single on-call person who has to figure everything out alone and escalate slowly. (4/21)

Instead, create a cross-functional incident squad for every major incident. Include one person from each relevant team. Give that squad the authority and access to resolve the incident immediately without waiting for approvals.

The right mix of expertise is in the room from minute one. Resolution time drops from four hours and thirty minutes to under one hour. (5/21)

When Ibuka developed the first Sony Walkman, he didn't let the audio team work alone. He put the audio team, the mechanical team, and the industrial design team in one room. Every problem had the right mix of knowledge. Problems got solved fast.

Ibuka applied the same thinking to every product, including the Trinitron television. Each time, he put cross-functional teams in one room and let them solve it. That approach built Sony.

## Four Steps to Apply Innovation Through Collaboration (6/21)

### Step 1: Put the Right People in the Room

Ibuka put the right people in the room at Sony. Every problem had the right mix of knowledge. Problems got solved fast.

You should put the right people in the room by creating an Incident Squad Roster. It lists one designated responder from each of the eight teams. It assigns a squad lead for every major incident before the incident starts. (7/21)

For a retail platform enterprise, the roster is a one-page document with eight rows. Each row is a team.

Row one is the order processing team. Alice is the designated responder. She's a senior developer with five years of experience.

Row two is the payment team. Bob is the designated responder. He's a senior developer with six years of experience.

Row three is the inventory team. Carol is the designated responder with four years of experience. (8/21)

Row four is the shipping team. Dave is the responder with seven years of experience.

Row five is the database team. Eve is the responder with five years of experience.

Row six is the network team. Frank is the responder with eight years of experience.

Row seven is the security team. Grace is the responder with six years of experience.

Row eight is the monitoring team. Henry is the responder with five years of experience. (9/21)

The squad lead is assigned based on the incident type. If it's an order processing incident, Alice leads. If it's a payment incident, Bob leads, and so on.

The roster is stored in a shared location so everyone can access it. When an incident occurs, the right people get paged. The right mix of expertise is in the room immediately. (10/21)

Last quarter, this took two days of effort to create the one-page document. After the roster was in place, the right mix of expertise was available for every incident. That saved the company $56,000 compared to the previous approach.

### Step 2: Let Them Solve It

Ibuka let teams solve it at Sony. Problems got resolved fast. That's what built the company. (11/21)

Give the Incident Squad full authority to make changes to production without waiting for approval. The squad can resolve the incident immediately.

The full authority has three parts.

First, the squad can deploy code to production. That means bugs get fixed immediately.

Second, the squad can roll back deployments. Bad deployments get undone right away.

Third, the squad can modify configuration. Settings get changed on the spot. (12/21)

Document the authority so the squad knows what they can do. When people know their boundaries, they act fast instead of hesitating.

Last quarter, the Incident Squad was given full authority. The average resolution time dropped from four hours and thirty minutes to fifty-five minutes. That saved $63,000.

### Step 3: Create a Shared Incident Room

Ibuka created shared spaces at Sony. Communication was fast. Problems got solved fast. (13/21)

Set up a dedicated digital war room that the Incident Squad uses during every major incident. All communication happens in one place. No time gets lost to scattered messages.

For a retail platform enterprise, the digital war room is a Slack channel called incident-war-room. It has three sections.

Section one is the incident description. The squad lead posts the first message describing what's happening so everyone is on the same page. (14/21)

Section two is investigation updates. Squad members post what they're finding so everyone knows the current status.

Section three is resolution steps. The squad lead documents each step so there's a clear record of what was done.

Last quarter, the war room took one day to set up. Using it during every major incident meant all communication happened in one place. The average resolution time dropped by an additional twenty minutes, saving $38,000. (15/21)

### Step 4: Iterate by Running a Feedback Loop

Ibuka iterated at Sony. The company kept getting better. That's what made it endure.

Run a feedback loop after every major incident. Review the incident response. Update the Incident Squad Roster, the authority guidelines, and the war room process based on what worked and what didn't.

The feedback loop is a thirty-minute meeting with three parts. (16/21)

The first ten minutes are for reviewing the incident response. The squad discusses what happened and identifies what worked and what didn't.

The second ten minutes are for updating the Incident Squad Roster. The squad changes the roster if needed.

The third ten minutes are for updating the authority guidelines and the war room process. (17/21)

Last quarter, the feedback loop ran eight times across eight major incidents. Three updates were made. The designated responder was changed for two teams. The squad was given authority to restart services. A new Root cause section was added to the war room.

These three updates dropped the average resolution time by another fifteen minutes, saving an additional $32,000.

## Closing on Putting the Right People in the Room (18/21)

Masaru Ibuka didn't build Sony by siloing expertise and letting engineers work alone. He built it by putting the right people in the room and letting them solve it.

Last quarter, the combined savings from these four steps totaled $189,000. That's the entire amount the company was losing to slow incident response. (19/21)

For a retail platform enterprise running XP with multiple teams of fifty-plus people, creating effective incident response requires the same innovation through collaboration. Create the Incident Squad Roster this week. Give the squad full authority. Set up the digital war room. Run the feedback loop after every major incident. (20/21)

The company stops losing $189,000 per quarter because a retail platform enterprise learned to create effective incident response from a collaboration pioneer who proved that the best way to resolve incidents is to stop routing through one person and start putting the right people in the room and letting them solve it.

#IncidentResponse #DevOps #SiteReliability #CrossFunctionalTeams #AgileEngineering #ProductionIncidents #TeamCollaboration #TechLeadership #ContinuousImprovement #XP (21/21)