What is MTTR? Critical Incident Recovery Metrics to Reduce Downtime

Whether it’s a scheduled maintenance or an unexpected outage, downtime is time your solutions are out of action and unavailable for use. Long or frequent periods of downtime have significant costs to the company, and ultimately undermine customer trust.

So what is MTTR? And how can improving MTTR reduce downtime? Below are four key metrics to get you started.

Uptime

The % of time which a company’s solutions are in action and available for use.

Uptime is calculated as [Total Time - Downtime] / [Total Time] within a given period.

Before answering the question, “what is MTTR”, we first need to understand the importance of uptime, and why it matters. In today’s connected world, consumers tend to expect 24/7/365 availability. Best-in-class technology companies typically target 99.99% uptime. Put differently, that means being available for their customers all but 52.6 minutes over the course of a year. Depending on your industry and service level agreements (SLA’s), you may need to target an even higher percentage uptime.

Uptime is an important measure of operational availability. It answers the question, “Can our clients trust us to be there when we say we will?”

Mean Time to Resolution (MTTR)

The average time it takes to resolve an outage and restore service to end-users.

MTTR is calculated as [Total Downtime] / [# of Incidents] within a given period.

So what is MTTR? Mean time to resolution (MTTR) includes every step of the recovery process, from initial notice to root cause analysis to fix and deployment. Note that different companies may define MTTR slightly differently. A temporary patch to get the customer back up and running may be considered a resolution, even if the root cause requires a more long-term fix.

MTTR is an important metric because it’s a measure of operational resilience. It answers the question, “How long does it take the company to bounce back from an incident?”

Mean Time to Acknowledge (MTTA)

The average time it takes for a new open incident to be acknowledged.

MTTA is calculated as the [Total Time to Acknowledge] / [# of Incidents] within a given period.

Mean time to acknowledge (MTTA) measures the first step in the recovery process, acknowledgement. Once someone acknowledges the incident alert, the rest of the recovery process can begin. Acknowledgment not only signifies a significant milestone within MTTR, but also assigns ownership to whoever acknowledged the incident. Ownership can be passed from one individual to the next, but incident response best practices suggest keeping a clear owner/lead to drive the recovery process at all times.

MTTA is an important metric because it’s a measure of operational responsiveness. It answers the question, “How long does it take the company to begin working towards a resolution?”

Mean Time Between Failures (MTBF)

The average time from one incident to the next.

MTBF is calculated as [Total Time - Downtime] / [# of Incidents] within a given period.

It’s one thing to resolve issues quickly. It’s another to prevent them from happening in the first place. MTBF acts as a counterbalance to MTTR. It ensures your teams are getting smarter, not just faster, about incident resolution.

MTBF is an important metric as it’s a measure of operational reliability. It answers the question, “How often do our systems break?”

Putting It Together - An Example

Let’s say you measure your numbers over a 30-day (720 hours) period, and you get the following;

  • 5 Outages
  • 10 Hours of downtime
  • 180 minutes total time to acknowledge

What’s your Uptime, MTTR, MTTA, and MTBF?

  • Uptime = [Total Time - Downtime] / [Total Time] = [720 - 10] / [720] = 98.61%
  • MTTR = [Downtime] / [# of incidents] = 10/5 = 2 hours
  • MTTA = [Total Time to Acknowledge] / [# of incidents] = 180/5 = 36 minutes
  • MTBF = [Total Time - Downtime] / [# of incidents] = [720 - 10] / [5] = 142 hours

How might we interpret these results?

The 98.61% uptime is lower than our targeted best-in-class uptime of 99.99%, so we have some room for improvement. We’ll need to dive into the other metrics to figure out where we’re falling short. A 2 hour MTTR isn’t horrible, but it’s not great either. We need to take a look at the distribution here. Are we consistently taking 2 hours, or was there one extreme outlier?

36 minutes MTTA is just unacceptably long. We should be getting this to single digits, if not sub 5 minutes. That would reduce total downtime by over 2 hours each month.

MTBF is currently just under 6 days, which feels too frequent. We should investigate the incident data and see if we can identify any trends or recurring outage patterns.

The metrics here give us a quick pulse on our incident recovery process, where we need to improve, and where we need to do some further investigation. As you build out your process, metrics, and review cycles, don’t forget to segment your incidents by severity for greater clarity.

Summary

In this article we answered the question, “What is MTTR?” We also reviewed the other key metrics to incident response management, including downtime, uptime, MTTA, and MTBF. Start monitoring these incident response KPI’s today and get ahead of downtime!