Links

Common Metrics

Common incident metrics you need to reduce downtime.

Downtime

Downtime - is the time your service is unavailable for use.
Whether it's a planned maintenance or an unexpected outage, downtime is when your services are unavailable. Downtime is ultimately costly to fix and breaks customer trust. Simply put, downtime is expensive.

Uptime

Uptime - is the % of time which a company’s services are available for use.
Uptime is calculated as [Total Time - Downtime] / [Total Time] within a given period.
Uptime is an important measure of operational availability. It answers the question, “Can our customers trust us to be there when we say we will?”

Mean Time To Acknowledge (MTTA)

Mean Time To Acknowledge - the average time it takes from when an incident is identified to when an alert is acknowledged.
Mean Time to Acknowledge (MTTA) is calculated as the [Total Time to Acknowledge] / [# of Incidents] within a given period.
MTTA is an important metric because it’s a measure of operational responsiveness. It answers the question, “How long does it take the company to begin working towards a resolution?”

What is MTTR?

MTTR is can actually have different meanings depending on it's context. The R can stand for repair, recovery, or resolve. When communicating with others, it's important everyone understand which MTTR is being discussed.

Mean Time to Repair (MTTR)

Mean Time to Repair (MTTR) - is the average time it takes to repair a system. This includes the repair time and any testing time.
Mean Time to Repair (MTTR) is calculated as the [Total Time Spent Repairing] / [# of Repairs]
Mean Time To Repair is a metric that support and maintenance teams use to keep repair times on track. The goal is to keep this number as low as possible by improving the efficiency of repair processes. It answers the question, "How long does it take the company to troubleshoot and repair the system?"

Mean Time to Recovery (MTTR)

Mean Time To Recovery (MTTR) - is the average time it takes to recover from a system failure. This includes the time the system begins to fail to the time that it becomes fully operational again.
Mean Time To Recovery (MTTR) is calculated as [Total Downtime] / [# of Incidents] within a given period.
Mean Time To Recovery answers the question "How quick can we restore service to our customers?". It expresses the average downtime and is a good metric for assessing the speed of the overall recovery process for your systems.

Mean Time to Resolve (MTTR)

Mean Time To Resolve - the average time it takes to fully resolve an incident. This includes the time spent detecting, diagnosing, repairing, and learning so that the failure won't happen again.
Mean Time to Resolve (MTTR) is calculated as [Full Resolution Time] / [# of Incidents] within a given period.
MTTR is an important metric because it’s a measure of operational resilience. It answers the question, “How long does it take the company to recover from an incident and implement systems and processes so the incident doesn't happen again?”

Mean Time Between Failures (MTBF)

Mean Time Between Failures - the average time between repairable failures of a service.
MTBF is calculated as [Total Time - Downtime] / [# of Incidents] within a given period.
It’s one thing to resolve issues quickly. It’s another to prevent them from happening in the first place. MTBF acts as a counterbalance to MTTR. It ensures your teams are getting smarter, not just faster, about incident resolution.
MTBF is an important metric as it’s a measure of operational reliability. It answers the question, “How often do our systems break?”

Error Budget

Every DevOps and IT Operations team knows that incidents will happen. There's no such thing as 100% guaranteed uptime because it's statistically not possible. Industry-standard says 99.9% uptime is very good, and 99.99% is excellent.
Even if your team is good at avoiding downtime or resolving incidents, it could mean you are not taking enough risks. So instead of setting user expectations too high (or too low), industry experts recommend setting an error budget.
Error Budget - The maximum amount of time that a system can fail without contractual consequences.
So for example, if your service promises 99.9% uptime, your team has 8 hours and 45 minutes of acceptable downtime per year. How you choose to spend the downtime is up to you, but preferably it would be used to innovate and take risks.
The benefit of an error budget approach is that it encourages teams to minimize real incidents and maximize innovation.
Uptime
Yearly Allowed Downtime
Monthly Allowed Downtime
99%
87h, 39m
7h, 18m
99.5%
43h 49m, 45s
3h, 39m
99.9%
8h, 45m, 57s
43m, 50s
99.95%
4h, 22m, 48s
21m, 54s
99.99%
52m, 35s
4m, 23s
99.999%
5m, 15s
26s