Comment on page
SRE Availability Metrics
Understanding metrics and how they impact your platforms availability are fundamentals of Site Reliability Engineering.
How available is your website, service, or platform? What must you monitor and measure to ensure availability? How do you translate uptime into availability? This chart has numbers that every Site Reliability Engineer (SRE) should know. Below the chart, you will find answers to commonly asked questions about SRE and associated metrics.
There is a saying in the NFL that goes, “A player’s best ability is his availability”. The same thing is true for websites, applications, and platforms. You can have a great website or the “best” cloud platform, but if it is not available for your customers when they need it, then your business and your reputation will suffer.
In this day and age availability is everything and it comes with a cost. Availability comes in many different forms, like redundancy, load balancing, multiple data centers, and engineering response, to name a few. To calculate availability we typically look at how long service was unavailable during a specified period of time, taking into account planned maintenance and other planned downtime.
Industry jargon refers to the number of “9’s” related to availability. For instance, one 9 would be 90%, while five 9’s would be 99.999%
Metrics have become the lifeblood of many organizations. Deciding what to and what not to monitor can be just as important as the monitoring tools themselves (Prometheus, Grafana, systems, etc.). In many instances, there can be an overwhelming urge to gather metrics on every available function, potentially leading to information overload. To keep monitoring manageable and actionable, consider the following methods when determining your needs.
For hardware-related monitoring consider the USE Method.
- Utilization (% time that the resource was busy)
- Saturation (amount of work resource has to do, often queue length)
- Errors (count of error events)
For services-related monitoring consider the RED Method.
- Rate (the number of requests per second)
- Errors (the number of those requests that are failing)
- Duration (the amount of time those requests take)
For Kubernetes related monitoring of services consider the Four Golden Signs
- Latency (time taken to serve a request)
- Traffic (how much demand is placed on your system)
- Errors (rate of requests that are failing)
- Saturation (how “full” your service is)
Site Reliability Engineering (SRE) dates back to 2003 when Google assigned a team of software engineers to design a concept that would make certain Google websites were efficient, scalable, and reliable. The concepts they used were so successful that other technology companies, like Netflix and Amazon, began using similar concepts as well as improving upon them. In short order, SRE became its tower within the IT architecture domain. SRE is meant to work in concert with DevOps but focuses on such things as capacity planning and disaster recovery and response. Ultimately, SRE focuses on the automation of operations endeavoring to remove the human element so that sites, applications, and platforms can be optimized.
Understanding how availability impacts the delivery of your chosen platform starts with knowing what those numbers look like. For instance, the difference between 2 9’s and 5 9’s goes from days to minutes, per year. Therefore choosing the proper methodology such as RED, USE, or the Four Golden Signs will allow you to deliver high availability for your specific service. A good starting point to help you define your SRE operations can be found here, Google’s guide to SRE Operations