Comment on page
Common incident metrics you need to reduce downtime.
Downtime - is the time your service is unavailable for use.
Whether it's a planned maintenance or an unexpected outage, downtime is when your services are unavailable. Downtime is ultimately costly to fix and breaks customer trust. Simply put, downtime is expensive.
Uptime - is the % of time which a company’s services are available for use.
Uptime is calculated as [Total Time - Downtime] / [Total Time] within a given period.
Uptime is an important measure of operational availability. It answers the question, “Can our customers trust us to be there when we say we will?”
Mean Time to Acknowledge (MTTA) is calculated as the [Total Time to Acknowledge] / [# of Incidents] within a given period.
MTTA is an important metric because it’s a measure of operational responsiveness. It answers the question, “How long does it take the company to begin working towards a resolution?”
MTTR is can actually have different meanings depending on it's context. The R can stand for repair, recovery, or resolve. When communicating with others, it's important everyone understand which MTTR is being discussed.
Mean Time to Repair (MTTR) - is the average time it takes to repair a system. This includes the repair time and any testing time.
Mean Time to Repair (MTTR) is calculated as the [Total Time Spent Repairing] / [# of Repairs]
Mean Time To Repair is a metric that support and maintenance teams use to keep repair times on track. The goal is to keep this number as low as possible by improving the efficiency of repair processes. It answers the question, "How long does it take the company to troubleshoot and repair the system?"
Mean Time To Recovery (MTTR) - is the average time it takes to recover from a system failure. This includes the time the system begins to fail to the time that it becomes fully operational again.
Mean Time To Recovery (MTTR) is calculated as [Total Downtime] / [# of Incidents] within a given period.
Mean Time To Recovery answers the question "How quick can we restore service to our customers?". It expresses the average downtime and is a good metric for assessing the speed of the overall recovery process for your systems.
Mean Time to Resolve (MTTR) is calculated as [Full Resolution Time] / [# of Incidents] within a given period.
MTTR is an important metric because it’s a measure of operational resilience. It answers the question, “How long does it take the company to recover from an incident and implement systems and processes so the incident doesn't happen again?”
Mean Time Between Failures - the average time between repairable failures of a service.
MTBF is calculated as [Total Time - Downtime] / [# of Incidents] within a given period.
It’s one thing to resolve issues quickly. It’s another to prevent them from happening in the first place. MTBF acts as a counterbalance to MTTR. It ensures your teams are getting smarter, not just faster, about incident resolution.
MTBF is an important metric as it’s a measure of operational reliability. It answers the question, “How often do our systems break?”
Even if your team is good at avoiding downtime or resolving incidents, it could mean you are not taking enough risks. So instead of setting user expectations too high (or too low), industry experts recommend setting an error budget.
Error Budget - The maximum amount of time that a system can fail without contractual consequences.
So for example, if your service promises 99.9% uptime, your team has 8 hours and 45 minutes of acceptable downtime per year. How you choose to spend the downtime is up to you, but preferably it would be used to innovate and take risks.
The benefit of an error budget approach is that it encourages teams to minimize real incidents and maximize innovation.