Skip to content
Back to blog Incident Management That Actually Works

Incident Management That Actually Works

SREDevOps

Every company has an incident management process. Few have one that works.

The typical pattern: something breaks, people panic, someone heroically fixes it at 3am, everyone goes back to sleep, and the same incident happens again in six months.

Good incident management is different. It’s calm, structured, and focused on learning. Here’s how to build it.

The Incident Lifecycle

Incidents have four phases:

Detection. Something is wrong. Ideally, your monitoring catches it before customers notice.

Response. People engage. Someone owns the problem. Communication begins.

Resolution. The immediate problem is fixed. Service is restored.

Learning. Post-mortem happens. Action items prevent recurrence.

Most teams focus on response and resolution. The best teams invest heavily in detection and learning.

Detection: Find Problems First

The worst way to learn about an incident is from an angry customer. Invest in detection:

SLO-based alerting. Alert when your Service Level Objectives are threatened. Not when CPU hits 80%. When users experience errors.

Synthetic monitoring. Automated tests that simulate user journeys. If the checkout flow breaks at 2am, you know before morning.

Anomaly detection. Sudden changes in traffic, error rates, or latency patterns. Something’s different - investigate.

Customer feedback loops. Support tickets, social media mentions, status page comments. Sometimes users see things monitoring misses.

Detection speed matters. The faster you know, the faster you respond, the less damage done.

Response: Structure Over Chaos

When an incident starts, the natural response is chaos. Everyone joins a call. Multiple people investigate the same thing. Nobody knows what’s happening.

Structure prevents this.

Clear roles. At minimum:

  • Incident Commander: owns the response, makes decisions
  • Technical Lead: directs investigation and remediation
  • Communications: updates stakeholders

For small teams, one person can wear multiple hats. But the roles should be explicit.

Dedicated channel. Create an incident channel immediately. All communication happens there. No side conversations.

Regular updates. Every 15-30 minutes, the Incident Commander posts a status update. Even if the update is “still investigating.” Silence breeds anxiety.

Decision log. Write down major decisions and why. “Rolled back deployment at 14:32 because error rate increased after deploy.” This helps the post-mortem and future incidents.

Communication Templates

Don’t write status updates from scratch during an incident. Use templates:

Initial notification:

🔴 Incident: [Brief description] Impact: [Who/what is affected] Status: Investigating IC: [Name] Channel: #incident-YYYY-MM-DD-[slug]

Regular update:

⏱️ Update [time]: Current status: [What’s happening] Actions taken: [What we’ve done] Next steps: [What we’re doing next] ETA: [If known]

Resolution:

✅ Incident Resolved Duration: [X hours/minutes] Resolution: [What fixed it] Post-mortem scheduled: [Date/time]

These templates save cognitive load when you’re stressed and tired.

Severity Levels

Not all incidents are equal. Define severity levels:

SEV1 / Critical: Complete outage. Core functionality unavailable. All hands on deck. Customer communication required.

SEV2 / Major: Significant degradation. Major feature unavailable or very slow. Team response required. Customers notified.

SEV3 / Minor: Limited impact. Non-critical feature affected. Single responder can handle. No customer notification.

SEV4 / Low: Minimal impact. Cosmetic issues. Handle during business hours.

Severity determines response speed, communication requirements, and escalation paths. A SEV3 at 3am can wait until morning. A SEV1 cannot.

Resolution: Fix Now, Perfect Later

During an incident, the goal is restoring service. Not fixing root cause. Not writing elegant code. Restoring service.

This means:

  • Rollback first, investigate second
  • Apply workarounds even if they’re ugly
  • Throw resources at the problem (scale up, add capacity)
  • Disable problematic features

You can clean up later. Right now, users are suffering.

The exception is when the quick fix makes things worse. If rollback causes data loss, don’t rollback. Use judgment.

Post-Mortems: Learning Over Blame

Post-mortems are where most incident processes fail. They become blame sessions, box-checking exercises, or they simply don’t happen.

Good post-mortems:

Happen quickly. Within 48-72 hours while memory is fresh.

Are blameless. Focus on systems, not individuals. “The deployment process allowed this” not “Bob deployed bad code.”

Include everyone involved. The people who responded have crucial context.

Ask “why” repeatedly. Surface root causes, not proximate causes. The server crashed - why? Memory leak - why? No memory limits - why? We forgot to add them - why? No checklist for new services.

Generate concrete action items. Vague outcomes like “be more careful” are useless. “Add memory limits to deployment template” is actionable.

Are shared widely. Post-mortems are learning opportunities for the whole organisation. Don’t hide them.

Post-Mortem Template

Keep it simple:

Summary: One paragraph describing what happened.

Timeline: Chronological list of key events with timestamps.

Impact: Duration, affected users, business impact.

Root cause: The underlying reason this happened.

Contributing factors: Other things that made it worse.

What went well: Things that worked during response.

What could improve: Things that didn’t work.

Action items: Specific tasks with owners and deadlines.

One to two pages is enough. Don’t write a novel.

Action Item Hygiene

Post-mortems generate action items. Most action items never get done.

Fix this:

Assign owners. Every action item has one person responsible.

Set deadlines. “When we have time” means never.

Track centrally. Action items go in your issue tracker, not a forgotten doc.

Review regularly. Check action item progress in team meetings.

Close the loop. When an action item is done, update the post-mortem.

An action item that prevents recurrence is worth more than a hundred that don’t get done.

On-Call Health

Incident response depends on healthy on-call rotations.

Reasonable load. More than two pages per week is too many. Fix the problems or add headcount.

Adequate rest. If someone is paged at 3am, they don’t work a full day. Let them recover.

Fair compensation. On-call is work. Pay for it - extra money, time off, or both.

Rotation size. At least 4-5 people in rotation. Smaller rotations burn people out.

Training. New on-callers should shadow experienced ones. Don’t throw people into the deep end.

Burned out on-call responders make worse decisions and eventually quit. Protect them.

Metrics That Matter

Track incident management effectiveness:

MTTD (Mean Time to Detect): How long until we know there’s a problem?

MTTR (Mean Time to Resolve): How long until service is restored?

Incident frequency: Are incidents increasing or decreasing?

Recurrence rate: How often do we have the same incident twice?

Action item completion rate: Are we actually following through?

Improve these metrics over time. If MTTR isn’t decreasing, something’s wrong with your process.

Start Small

If you have no incident process, don’t implement everything at once.

Week 1: Define severity levels. Create a Slack channel naming convention.

Week 2: Write a simple post-mortem template. Do a post-mortem for your next incident.

Week 3: Define the Incident Commander role. Start using it.

Week 4: Track basic metrics. MTTR at minimum.

Iterate from there. Good incident management evolves over time.

The goal isn’t perfect process. It’s continuous improvement in how you detect, respond to, and learn from incidents.

Found this helpful?

Comments