Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
A DevOps and SRE Incident Management Guide
Incident management is the process used by DevOps and IT Operations teams of responding to an unplanned event or service interruptions (incidents) to restore service to normal as quickly as possible while minimizing business impact.
Incident is a broad term describing events that disrupt or reduce the quality of a service. Some examples of incidents:
A business application going down (complete outage)
A very slow web application (degradation in performance)
A piece of software functionality that is broken (software bug)
Incidents can vary widely in severity, but usually require immediate response from on-call teams. An incident is resolved when the affected service resumes service and is restored to its intended state.
Incident management is a critical process for any organization that aims to provide a reliable service to its customers. Service outages can come with significant costs. Having a well-defined incident management process can help minimize those costs. The benefits of a well-defined process include:
Faster response and resolution time
Reduced costs and/or revenue losses
Better communication both internally (response teams) and externally (customers)
Continuous learning and improvement
Incident management processes can be slightly different depending on the size, type, and maturity of a company, but in general these are the steps:
The key to good incident management is having a good process, clear communication, and a calm head.
Incidents can come from anywhere. In most cases, incidents will come from monitoring and alerting tools but could also be manually reported by an employee or a customer. No matter the source, the first two steps are 1) the incident is identified 2) the incident is logged in the incident management system.
Typically incident management systems will include:
The source of the incident (monitoring system or person).
The date and time the incident was first reported.
A description of the incident (including screenshots and/or logs).
A unique identification number for the incident.
Not all incidents are created equally. Start by assessing the incident's impact on the business. A couple of things to consider:
How many people are impacted?
What are the potential financial, security, and compliance implications?
Incidents should be assigned a severity that quickly and clearly communicates impact. Compare this incident to all other open incidents and determine its relative priority.
Initial Diagnosis - Ideally, your front-line support team or primary on-call can see the incident from detection to resolution, but if they can't they should escalate to any additional teams. At this point, all pertinent information should be logged and communication channels in a well-known place (like a chat tool channel or a video conference bridge) should be established.
Escalate - At this point, the next team takes the logged data and continues the diagnosis. If this team can't diagnose the incident, the escalation process continues.
Communicate - The team should regularly communicate and share status updates with impacted internal and external stakeholders.
Investigation and Diagnosis - Investigation continues until the who, what, where, and why of the incident. Teams might need to bring in outside resources to help with the incident.
In the resolve step, the responding team(s) implement a repair for what was identified during the Investigation and Diagnosis phase. One or more repairs will result in the affected service returning to normal. The incident is usually considered "over" when the customer impact ends. At this point, the measurement for Mean Time To Recover (MTTR) ends.
Lastly, it's important to ensure whatever caused the incident doesn't happen again in the future. A root cause analysis and postmortem should be written for all major incidents. Learning from an incident could show opportunities for improvement and/or automation in the technology and or processes. At this point, the measurement for Mean Time To Resolve (MTTR) ends.
The following are common categories of tools used throughout the incident lifecycle:
Monitoring - automated systems to alert you if something is wrong with your system.
Incident Tracking - a tool to serve as a central location to document and track incidents across multiple services.
Alerting System - a tool supporting on-call schedules and reliable notifications to always notify the right person on your team.
Chat Room - real-time text communication is key for diagnosing and resolving incidents as a team.
Video Call - to host a "war room" call with any responders that need to be involved.
Status Page - to communicate incident updates both internally and externally.
An effective on-call schedule is key to minimizing downtime and sustaining a healthy on-call culture.
On-call - is the practice of designating specific people to be available during specific times to be able to respond to an incident (even if outside of normal working hours).
On-call schedule - is a schedule that ensures the right person is always available to quickly respond to incidents and outages.
On-call is a critical responsibility for many IT, DevOps, and Support Operations teams that maintain services demanding 24/7 availability.
Team members take turns "being on-call", to provide coverage around the clock or outside of business hours. The person on-call is empowered to identify, respond, and resolve any interruptions to service availability.
An effective on-call schedule ensures customers are confident they they'll get a quick and consistent support for any potential incidents. It minimizes risk for missed issues, and keeps employees from burnout.
Benefits of a sustainable on-call schedule:
Well rested individuals who perform better
Improved team culture
Higher employee retention and satisfaction
Better customer support
Increased bottom line
Faster response times
Better work-life balance
Less burnout
When creating an on-call schedule there is no one size fits all model. Each organization and team are different, and your on-call schedules should reflect that. Companies with locations around the world will operate very differently than teams in a single location. For on-call schedules to be effective, they need to be tailored to your organization, team, and responsibilities.
Every team is different and so will be their priorities. Talk to your team members to understand individual needs and situations. Understand how your team works. There might be a consensus on what on-call should look like. For example, a team might agree to weekly rotation schedule where individuals are on-call seven days in a row, maybe it's just one. Work with your team, to figure out what works best for everyone.
For management, ensure on-call duties are being balanced among team members and make sure to provide individuals plenty of training. Make sure to clearly defined on-call responsibilities and get buy in from your the teams themselves. Lastly, always be listening to your team, iterating and improving your on-call schedule and processes based on their feedback.
Responsibilities during on-call should be clearly defined and documented.
A couple of questions to consider:
How will the team assign on-call shifts (daily, weekly, follow-the-sun, ...)?
What is the maximum amount of time a user can be on-call during any given period?
Will individuals be on-call overnight?
If on-call overnight, is there flexibility to work from home the next day? Can the engineer start work later if they need to catch up on sleep?
Are there differences between working hours and non-working hours responsibilities and response times? What is considered urgent?
What are an individual's SLAs/SLOs when being on-call?
How will the team address dynamic schedules such as vacations and personal time?
What is compensation model for individuals that go on-call?
Can individuals do "regular" work while on-call? If so, how are their deliverable dates affected?
A well documented on-call plan that spreads responsibilities out fairly across a competent team can go a long way to prevent burnout, confusion, and frustration. It can also reassure new recruits that your organization has it's on-call management under control. With a documented plan you can be completely transparent during the interview process and make sure candidates are ready for the commitment to on-call work.
Life doesn't stop just because a person is on-call. To prevent an incident from going unsolved and possibly causing damages, it's a good idea to have a secondary (or "back-up") on-call responder.
Secondary on-call takes a lot of the stress off the primary on-call, knowing they have backup they can contact and are not the single point of failure. For the business, this adds a layer of redundancy in the on-call process.
Teams are not static things, and your on-call schedule shouldn't be either. Your organization and team should be continually reviewing, refining, and improving your on-call schedules and processes.
Focusing on incident metrics is a good place to start, but you'll also want to improve what directly influences the well being of on-call engineers:
Total number of alerts - Is the current number of alerts manageable for your team size? Should your team refine the definition of an alert? Or maybe add more team members?
Reducing false positives - How many alerts were not actionable or even an issue? How can the false positives be prevented? (automation, changing alert conditions, ...)
De-duplicating related alerts - Can duplicated alerts be grouped? Are engineers already aware of the issue?
In general, the fewer alerts a person on-call receives, the less likely they are to develop alert fatigue.
Each organization and team is different. Large companies with locations around the world will operate very different from small companies with a single location. There's not a one size fits all approach for on-call. Use best practices at a starting point, then talk to your teams to tailor your own on-call schedules and processes.
There are times when schedule changes will need to be made (personal emergencies, change of plans, vacations). People may need to swap shifts. Maybe the current rotation just isn't working for the team. Don't be afraid to revisit the schedule and modify. Letting a team have the flexibility to make changes will improve the overall team spirit and empower team members to support each other.
Alerting rules need to be designed properly and then continuously refined to avoid on-call teams being overwhelmed with alerts. Knowing whether an alert is worth waking up a developer in the middle of the night or can wait until morning can make the difference between happy engineers with fast response times or alert fatigued teams who dread the on-call responsibility.
Relying on any one small group or person to handle your full on-call needs is a recipe for burnout. From a business perspective, it is also risky to have a single point of failure.
People need time off. Teams should share the responsibility of being on-call. Consider the "you build it, you support it" setup. This way, the engineers building the service are incentivized to ship stable, supportable, code.
A healthy work life balance increases loyalty and commitment to employers. An unhealthy work-life will do the opposite. As you work with your team to tailor you on-call schedules, make sure to set realistic expectations of what it means to be on-call.
Not all incidents are created equal - categorizing incidents based on their impact will help your team resolved incidents faster.
Incident severity levels measure the impact an incident has on the business. Severity levels are useful for quickly understanding and concisely communicating the impact of an incident.
Incidents can be classified by severity, usually a "SEV" definition. Severities rank from SEV-1 to SEV-5. The lower the severity number, the more impactful the incident. Anything above a SEV-3 should automatically be considered a "major incident".
Always assume the worst - If you are unsure which severity an incident should be, treat it as the higher one.
So what is MTTR, MTTA, and MTBF? In this article, we will explore these 3 acronyms as well as how to calculate other common incident recovery metrics.
Whether it’s scheduled maintenance or an unexpected outage, downtime affects every aspect of your business and comes with significant costs. Understanding recovery metrics, how they are calculated, and what you can do to improve them will help you maintain SLAs, improve uptimes, and provide better services.
So, what are these commonly used incident management metrics?
Uptime and downtime are two metrics used consistently to help determine the availability, reliability, and overall performance of services. These two metrics are closely linked and directly affect each other and your business.
Downtime - is the time your service is unavailable for use.
Whether it's a planned maintenance or an unexpected outage, downtime is when your services are unavailable. Downtime is ultimately costly to fix and breaks customer trust. Simply put, downtime is expensive.
Uptime - is the % of time in which a company’s services are available for use.
Uptime is calculated as [Total Time - Downtime] / [Total Time] within a given period.
Uptime is an important measure of operational availability.
Uptime answers, “Can our customers trust us to be there when we say we will?”
Before answering the question, “What is MTTR?” we must first understand the importance of uptime and why it matters. In today’s connected world, consumers tend to expect 24/7/365 availability. Best-in-class technology companies typically target 99.99% uptime. Put differently that means being available for their customers for all but 52.6 minutes over the course of a year. Depending on your industry and service level agreements (SLAs), you may need to target an even higher uptime percentage.
MTTR can have different meanings depending on its context. The R can stand for repair, recovery, or resolve. When communicating with others, it's important everyone understand which MTTR is being discussed.
Mean Time To Recovery (MTTR) - is the average time it takes to recover from a system failure. This includes the time the system begins to fail to the time that it becomes fully operational again.
Mean Time To Recovery (MTTR) is calculated as [Total Downtime] / [# of Incidents] within a given period.
Mean Time To Recovery answers, "How quickly can we restore service to our customers?"
Mean Time To Recovery expresses the average downtime and is a good metric for assessing the speed of your systems' overall recovery process.
Mean Time To Resolve - the average time it takes to fully resolve an incident. This includes the time spent detecting, diagnosing, repairing, and learning so that the failure won't happen again.
Mean Time To Resolve (MTTR) is calculated as [Full Resolution Time] / [# of Incidents] within a given period.
MTTR is an important metric because it measures operational resilience.
Mean Time To Resolve answers, “How long does it take the company to recover from an incident and implement systems and processes so the incident doesn't happen again?”
Mean Time To Repair (MTTR) is the average time it takes to repair a system, including the repair time and any testing time.
Mean Time To Repair (MTTR) is calculated as the [Total Time Spent Repairing] / [# of Repairs]
Mean Time To Repair is a metric that support and maintenance teams use to keep repair times on track. The goal is to keep this number as low as possible by improving the efficiency of repair processes.
Mean Time To Repair answers, "How long does it take the company to troubleshoot and repair the system?"
Mean Time To Acknowledge - the average time it takes from when an incident is identified to when an alert is acknowledged.
Mean Time to Acknowledge (MTTA) is calculated as the [Total Time to Acknowledge] / [# of Incidents] within a given period.
Mean time to acknowledge (MTTA) measures the first step in the recovery process: acknowledgment. Once someone acknowledges the incident alert, the rest of the recovery process can begin. Acknowledgment not only signifies a significant milestone within MTTR but also assigns ownership to whoever acknowledged the incident. Ownership can be passed from one individual to the next, but incident response best practices suggest keeping a clear owner/lead to drive the recovery process at all times.
MTTA is an important metric because it’s a measure of operational responsiveness.
Mean Time To Acknowledge answers, “How long does it take the company to begin working toward a resolution?”
Mean Time Between Failures - the average time between repairable service failures.
MTBF is calculated as [Total Time - Downtime] / [# of Incidents] within a given period.
It’s one thing to resolve issues quickly. It’s another to prevent them from happening in the first place. MTBF acts as a counterbalance to MTTR. It ensures your teams are getting smarter, not just faster, about incident resolution.
MTBF is an important metric because it measures operational reliability.
Mean Time Between Failures answers the question, “How often do our systems break?”
Let’s say you measure your numbers over a 30-day (720 hours) period, and you get the following;
5 outages
10 hours of downtime
180 minutes total time to acknowledge
What’s your Uptime, MTTR, MTTA, and MTBF?
Uptime = [Total Time - Downtime] / [Total Time] = [720 - 10] / [720] = 98.61%
MTTR (recovery) = [Downtime] / [# of incidents] = 10/5 = 2 hours
MTTA = [Total Time to Acknowledge] / [# of incidents] = 180/5 = 36 minutes
MTBF = [Total Time - Downtime] / [# of incidents] = [720 - 10] / [5] = 142 hours
The 98.61% uptime is lower than our targeted best-in-class uptime of 99.99%, so we have room for improvement. We’ll need to dive into the other metrics to figure out where we’re falling short. A 2-hour MTTR (recovery) isn’t horrible, but it’s not great, either. We need to take a look at the distribution here. Are we consistently taking 2 hours, or was there one extreme outlier?
A 36-minute MTTA is unacceptably long. We should reduce it to single digits if not sub 5 minutes. That would reduce total downtime by over 2 hours each month.
MTBF is currently just under 6 days, which feels too frequent. We should investigate the incident data and see if we can identify any trends or recurring outage patterns.
The metrics here give us a quick pulse on our incident recovery process, where we need to improve, and where we need to do some further investigation. As you build out your process, metrics, and review cycles, don’t forget to segment your incidents by severity for greater clarity.
Every DevOps and IT Operations team knows that incidents will happen. There's no such thing as 100% guaranteed uptime because it's statistically impossible. The industry standard says that 99.9% uptime is very good, and 99.99% is excellent.
Even if your team is good at avoiding downtime or resolving incidents, it could mean you are not taking enough risks. So, instead of setting user expectations too high (or too low), industry experts recommend setting an error budget.
Error Budget - The maximum time a system can fail without contractual consequences.
So, for example, if your service promises 99.9% uptime, your team has 8 hours and 45 minutes of acceptable downtime per year. How you spend your downtime is up to you, but preferably, it should be used to innovate and take risks.
The benefit of an error budget approach is that it encourages teams to minimize real incidents and maximize innovation.
In this article, we answered the question, “What is MTTR?” We also reviewed the other key metrics of incident response management, including downtime, uptime, MTTA, MTBF, and error budgeting. Now that you understand these metrics, you may be wondering how you can start monitoring your systems and respond to outages quickly. PagerTree has put together a list of the Top 7 Best APM Tools to help you get started with system monitoring. We have also compiled a list of the Top 5 oncall management software to help your team get notified 24/7.
Learn the differences between Service Level Agreements (SLAs), Service Level Objectives (SLOs), and Service Level Indicators (SLIs), and the purposes they serve.
Regardless of whether your service is free or paid, your customers expect a certain level of quality and availability, that's why it's important to establish clear expectations with both customers and your internal team. Doing so will help foster healthy relationships between service providers and customers, while also providing your team with measurable goals and deliverables to maintain high performance. This is where Service Level Agreements (SLAs), Service Level Objectives (SLOs), and Service Level Indicators (SLIs) come into the equation.
SLAs, SLOs, and SLIs all refer to the promises companies make to provide specific service levels to their customers but at different levels. These terms might sound technical, but they're essentially about setting clear expectations about how services are delivered and maintained, ensuring reliability, satisfaction, and continuous improvement. But what exactly are they, and what's the difference?
Service Level Agreements: “The Customer Promise” - Sets the expectations between the service provider and the customer, describing the products or services guaranteed to be delivered.
Service Level Objectives: “The Internal Goal” - are the objectives that must be achieved internally for each service activity, function, and process to meet the service levels promised in the SLA.
Service Level Indicators: “The Actual Measured Performance” - The specific measurable metrics companies utilize to measure different aspects of the level of services they are providing to their customers.
These terms may seem vague at first glance, but each serves a specific purpose in maintaining the relationship between service providers and customers. Let's break down each term individually and see how they are related to and differ from one another.
A Service Level Agreement (SLA) is a formal agreement between a service provider and the customer that outlines the expected level of service. No service, large or small, has 100% availability, that is why SLAs set expectations upfront so customers know what they are getting while also holding the service provider accountable for maintaining the level of service they have promised. SLAs also outline the consequences for breaching the level of service promised which could include refunds, credits, or even legal action.
What is an SLA?: The promise made by the service provider to the customer regarding services, performance, and consequences if service levels are breached.
Who Writes an SLA?: Typically written by the legal department with input from product managers regarding actual performance.
Who sees the SLA?: SLAs are customer-facing agreements.
Real-world example of an SLA: PagerTree, an OnCall software solution, promises customers a 99.9% monthly availability.
Depending on the customer and provider's needs, SLAs can include as few or as many high-level components as desired. When writing an SLA, it is important to keep it as simple and clear as possible. When writing SLOs, you will have the opportunity to break down your SLA into specific measurable objectives.
A Service Level Objective (SLO) is a specific, measurable deliverable that internal teams use to meet the commitment made in the SLA. It represents the target level of service that a team commits to achieving internally. For instance, an SLO could specify 99.9% uptime or an average response time of 250 milliseconds. It defines the operational targets needed to meet or exceed the service levels agreed upon in the SLA.
What is an SLO?: SLOs are internal goals set to meet or exceed the promise of the Service Level Agreement.
Who writes an SLO?: SLOs are typically written by product managers to meet SLA requirements.
Who sees the SLO? Typically, SLOs are for internal use by the teams that need to achieve these objectives.
Real-world example of an SLO: PagerTree promises customers 99.9% uptime but internally has a goal of 99.99% uptime. This is an internal error budget of almost 8 hours of uptime per year.
SLOs correspond directly with your SLA, giving teams the key metrics and deliverables they need to focus on to meet the performance outlined in the SLA. SLOs are also key for budgeting in planned and unplanned downtimes, which is referred to as error budgeting.
A Service Level Indicator (SLI), is a specific, quantifiable, and measureable metric of the service that is provided. Specifically, SLIs are the metrics that you monitor to determine if your SLOs are being met. SLIs are crucial for maintaining and improving service quality because they provide a foundation for evaluating performance. They help teams identify issues, understand user experience, and make data-driven decisions to meet service level commitments outlined in SLAs.
What is an SLI?: An SLI is the actual measured metric of the service provided.
How do I monitor SLIs?: SLIs can be monitored and measured with a host of tools, including Prometheus, Datadog, and many more.
Who sees the SLI?: SLIs are for use by both internal teams and customers to determine if promises made in the SLA have been met.
Real-world example of an SLI: PagerTrees monitors its systems with both internal and external monitoring software, making this data available to customers.
SLI metrics are wholly dependent on your SLAs and SLOs because they are the actual measured performance of your promised service. Service providers should aim to keep SLIs above both their SLAs and SLOs, though with a built-in error budget. Some SLIs may fall below internal SLOs.
SLAs are customer-facing documents typically written by legal teams and product managers. They contain the service provider's “Promise” to the customer regarding service and quality of service.
SLOs, on the other hand, are internal documents typically written by product managers that contain the internal “Goals” for the service to meet. This goal typically leaves room for feature testing as well as planned and unplanned downtimes.
SLAs and SLOs are both projections of the level of service that should be provided. They are both written by teams, usually based on historical performance data.
SLIs are the actual “Performance” of the service provided to the customer. SLIs are monitored and measured through tools like Prometheus and should be available to both internal teams and customers to ensure SLAs and SLOs are being met.
The graph below shows the Service Level Indicator (blue), Service Level Agreement (yellow), and Service Level Objective (purple), along with examples A, B, and C.
The SLA shows a “Promised” performance of no more than 300ms response time between the customer and service provider. The SLO shows an internal goal of 250ms response time, giving the service provider a 50ms error budget.
Example A: The SLI line is below the SLO and SLA, ranging from 180ms to 250ms response times. The "Performance" of the service being provided outperforms the SLO (Internal goal of 250) and the SLA (Customer Promise of 300). Example B: The SLI line is between the SLO and SLA, ranging from 250ms to 300ms response times. The "Performance" of the service being provided meets the “Promise” outlined in the SLA but is missing the internal “Goal” set in the SLO. The difference between the SLO and SLA (250ms-300ms) is called the error budget. Service providers give themselves error budgets to allow teams to adjust and improve performance before breaching an SLA, as well as to test experimental features and to account for planned/unplanned outages.
Example C: The SLI line has surpassed the SLO and SLA, ranging from 301ms to 340ms response times. The "Performance" of the service being provided is underperforming the “Promise” made in the SLA, and the internal “Goal” set in the SLO. This indicates that the service provider is in breach of the SLA, and the consequences for being in breach outlined in the SLA can come into effect. These consequences can range from refunds to legal action.
In this article we will explore the meaning of data aggregation, learn about data aggregators, and provide tools to help you with data aggregation
With over 328 million terabytes of data created daily, it’s no wonder data aggregation tools are becoming increasingly important in almost every industry. In this article, we will define data aggregation, explain how it works, and explain why it is important for you and your business. We will also offer a few tools and solutions to help you and your business with data aggregation.
Data aggregation is the process of collecting, processing, and presenting, typically large data sets from multiple sources into more specific, easily digestible summaries. Simply put, data aggregation helps businesses sift through large amounts of data to find the information they need, presenting that data in a consumable way.
A simple example of data aggregation is when you summarize the daily expenses of your business and then combine that data into a monthly summary. This approach helps you avoid dealing with 365 separate line items for expenses, and you can easily view your expenses for the entire month. You can then create an average daily expense if needed without manually entering 30 days' worth of data to obtain the information you need.
A data aggregator is a tool or service that collects data from one or multiple sources, combines it, and presents it in a simplified, cohesive format. Data aggregators are utilized in various industries globally to improve decision-making, reduce labor overhead, and consolidate information for a more comprehensive perspective.
In short, data aggregators:
Collect Data: Data aggregators pull information from one or many sources.
Process Data: After collection, data aggregators merge data into more cohesive datasets.
Present Data: After merging, data is organized and presented in an easier-to-read format.
Data aggregation is used for a variety of purposes across different industries, helping organizations make sense of large data sets and derive meaningful insights. Here are some use cases of data aggregation:
Business Intelligence: By aggregating data from various sources, businesses can get a comprehensive view of their operations, customer behavior, and market trends. This helps in making informed decisions, planning strategies, and optimizing processes.
Improving Data Quality and Efficiency: Aggregation helps in cleaning and refining data, which reduces redundancy and enhances the quality of the data. This process simplifies data analysis and improves the efficiency of data storage and management.
Performance Monitoring: Organizations use data aggregation to monitor and analyze performance metrics across different departments or sectors. This is crucial for assessing the productivity, efficiency, and effectiveness of various business operations.
Risk Management: In sectors like finance and healthcare, data aggregation is crucial for risk assessment and management. By analyzing aggregated data, companies can identify potential risks and vulnerabilities early, allowing for proactive measures to be taken.
Marketing and Customer Insights: Aggregating data about customer interactions, preferences, and behaviors helps in crafting targeted marketing strategies. This can lead to better customer engagement, improved service delivery, and enhanced customer satisfaction.
Data aggregation is a fundamental process in data management and analysis, involving three main stages: collection, processing, and presentation. Each stage plays a critical role in transforming raw data into actionable insights.
Collection: The first step in data aggregation is collecting data. This stage involves gathering data from multiple databases, systems, or external sources. Effective collection requires comprehensive systems to ensure data is accurately and consistently retrieved.
Processing: Once data is collected, the next step is processing. This stage involves cleaning and organizing the data to ensure it is useful for analysis. Processing may include filtering out irrelevant data, correcting errors, and resolving inconsistencies.
Presentation: The final stage of data aggregation is presentation. This stage involves translating the processed data into a format that is easy to understand and actionable for decision-makers. This often means visualizing the data in charts, graphs, or tables that highlight the key insights from the data aggregation process.
Data aggregation is a crucial process that enables businesses to gain a holistic view of a particular subject matter. This process provides valuable insights that can be used to make informed decisions. By identifying significant trends and patterns, data aggregation helps to optimize resource management. Comprehensive data analysis enhances operational efficiency and streamlines processes.
An example of data aggregation can be illustrated using PagerTree, an oncall management solution that streamlines the process of handling alerts. In environments where IT and support teams receive numerous notifications from various monitoring tools, the risk of alert fatigue is high due to the sheer volume of alerts that may not be immediately actionable or relevant.
PagerTree addresses this challenge by aggregating alerts into single notifications. Here’s how it works:
Collection of Alerts: PagerTree integrates with various monitoring systems and tools that generate alerts. These could be about system outages, performance anomalies, or other critical events.
Data Aggregation Process: Instead of sending each alert individually to the oncall team, PagerTree aggregates these alerts based on predefined criteria such as alert type, severity level, the system affected, or time of occurrence. This process involves analyzing the context and content of each alert to determine how they should be grouped together or aggregated.
Notification Delivery: PagerTree sends a consolidated notification to the user or team. This notification provides a comprehensive but succinct overview of the situation, allowing the recipient to quickly understand the scope and scale of the issue without having to process each alert individually.
Action and Response: With a clearer, aggregated view of alerts, oncall teams can prioritize their responses more effectively, address critical issues promptly, and reduce downtime or service disruptions.
This example is just one of many use cases for data aggregation.
Data aggregation tools, also known as data aggregators, play a key role in presenting large amounts of data in a consumable and beneficial way. Some data aggregators can be designed for specific industries and use cases, while other data aggregators are designed to be more generalized and all-encompassing.
Here are a few data aggregators:
Power BI (Microsoft): This data aggregator is designed for end-to-end business intelligence and aggregated data visualization.
Google Data Studio (Looker Studio): Useful for creating visual representations from aggregated data.
Matillion: Powerful tool for complex data aggregations, offering extensive query capabilities.
Qlik: A leading tool for business analytics with many tools to assist in data aggregation.
Data aggregation is crucial for enabling organizations to transform raw data into actionable insights. By using proper aggregation techniques and tools, businesses can enhance their decision-making processes, boost operational efficiencies, and maintain a competitive edge in their industries. As data continues to expand in both volume and complexity, the importance of data aggregation will only become more significant.
DevOps is a partnership between software development and IT teams that emphasizes communication, collaboration, integration, and automation.
DevOps is a set of tools, practices, and philosophies that integrate and automate the work of software development and IT operations teams to improve and shorten the software development cycle.
DevOps represents a change in the mindset for IT culture. DevOps focuses on incremental development and rapid delivery of software. Success relies on the ability to create a culture of accountability, improved collaboration, and joint responsibility for business outcomes.
DevOps encourages shared responsibilities. Development and Operations staff are both responsible for the success or failure of a product. Developers are expected to do more than just build and hand off to operations -- they are expected to share the responsibility of a product over it's lifetime, adopting a "you build it, you run it" mentality.
Development and Operations teams work together as a single functional team that communicates, shares feedback, and collaborates throughout the entire software development and deployment cycle.
Continuous improvement is the practice of focusing on customer needs, reducing waste, and optimizing for speed, cost, and ease of delivery.
DevOps teams use short feedback loops with end users to develop products and services tailored to their needs. With shorter feedback loops, DevOps teams get immediate visibility into how end users interact with a software system, enabling them to develop further improvements.
Plan - Teams identify the business needs and collect user feedback. They explore, organize, and prioritize ideas to be worked on during this sprint.
Code -Teams write the code for the tasks they have prioritized. Using tools like git, code is stored in a central repository to be worked on collaboratively.
Build - Once the developers finish their task, they commit code to the central repository to be packaged by build tools like Maven, Gradle, or Docker.
Test- Automated tests check code to make sure it works correctly. Tools like Selenium, JUnit, and MiniTest can all be used to run tests in parallel and to ensure software quality. Additionally, during this phase, the packaged software can be pushed to a testing (or staging) environment for user acceptance tests, performance testing, security testing, etc.
Release - The build is marked as "release" and then stored in a central image repository. A central image repository ensures there is always a releasable version. The team schedules the deployment based on the organizations needs.
Operate - The release is now live and in use by customers. Teams may use software like feature flags, to slowly release new features to customers.
Monitor - Data is collected from customer behavior, application performance, etc. The ability to observe can help identify bottlenecks affecting performance or user adoption. Feedback is then used to start the next planning stage.
Developers regularly merge code changes into a central repository, after which automatic builds and tests are run. The key goal is to find and address bugs quicker, improve software quality, and reduce the time to validate and release new software updates.
Code changes are automatically built, tested, and prepared for release. Continuous Delivery comes after continuous integration and deploys all the code changes to a staging and/or production environment after the build stage.
Developers and system administrators use code to automate system configurations and operational tasks. The use of code makes configuration changes repeatable and standardized.
DevOps is an agile approach to organizational change that seeks to bridge traditionally siloed divides between teams and establish new processes that facilitate increased transparency and greater collaboration. The goal being to align teams, people, and processes toward a more unified customer focus.
Teams that practice DevOps can release deliverables more frequently, with higher quality and stability. With more speed you can innovate for customers faster and better adapt to changing markets.
At its core, DevOps is the collaboration between development and operations teams, who share responsibilities and combine work. Fewer handoffs and code that is designed for the environment for which its runs, makes teams more efficient and saves time.
DevOps teams can perform security audits and security testing during automated workflows to integrate security into the end product. Automated deployments can prevent unauthorized access to production systems.
Docker is an open-source platform for developing, shipping, and running applications. Docker enables developers to automate the deployment of applications inside lightweight, portable containers.
Docker has revolutionized the way we build, ship, and run applications. In this article, we'll delve into the fundamental aspects of Docker, exploring its purpose, benefits, security considerations, essential tools, and associated terms.
Ultimately, Docker provides a consistent environment across different infrastructures, making developing, testing, and deploying software applications easier.
The primary purpose of Docker is to simplify the software development and deployment process.
Docker eliminates the "it works on my machine" problem by "containerizing" applications, ensuring you can ship, test, and deploy your application in any environment without worrying about incompatibility issues, regardless of the underlying machine's configuration settings.
To fully understand Docker, you must familiarize yourself with key tools and terms.
Docker Registry: A Docker Registry is a storage and distribution service for Docker images. It allows users to upload, store, and share Docker images privately or publicly. DockerHub is a popular example of a Docker Registry.
Docker Hub: Docker Hub is a cloud-based registry service provided by Docker that hosts a vast collection of public and private Docker images. It allows users to share, distribute, and collaborate on containerized applications and services.
Docker Desktop: Docker Desktop is a desktop application with an easy-to-use interface for building, running, and managing Docker containers on Windows and macOS operating systems. It includes the Docker Engine, CLI tools, and other utilities.
Kubernetes: Kubernetes is an open-source container orchestration platform for automating containerized applications' deployment, scaling, and management. It provides features such as service discovery, load balancing, auto-scaling, and rolling updates to ensure the reliability and scalability of applications running in containers.
Docker combines virtualization and container technology to provide an isolated sandbox environment. This environment facilitates the creation of lightweight containers, streamlining the application development and deployment processes.
Docker uses a client-server architecture.
The Docker client talks to the Docker daemon, which does the heavy lifting of building and running your Docker containers. The Docker client and daemon can run on the same or different system. The Docker client and daemon communicate using a REST API, over UNIX sockets or a network interface.
A Docker registry is a storage and distribution service for Docker images. It is a centralized repository where Docker images can be uploaded, stored, managed, and shared. Docker registries enable developers to publish their images privately or publicly, allowing others to access and deploy them in their environments.
Docker Daemon - The Docker daemon (dockerd
) listens for Docker API requests and manages Docker objects such as images, containers, networks, and volumes.
Docker Client - The Docker client (docker
command) is the primary way users interact with Docker. When a user types a command like docker run
, the client sends these commands to the Docker Daemon (dockerd
) to be executed.
Docker Registries - Docker registries store Docker images (Docker Hub is the most popular public registry). When a user uses the docker pull
or docker run
command, Docker pulls the required image(s) from a configured registry. Using docker push
will push an image to the configured registry.
Docker simplifies the process of packaging, deploying, and running applications through a streamlined workflow that begins with a Dockerfile and culminates in a running container. Let's explore each step of this process in detail:
Containers and virtual machines have similar resource isolation and allocation benefits but function differently because containers virtualize the operating system (OS) instead of the hardware. Containers are more portable and efficient.
Containers: Containers share the host operating system's (OS) kernel and are lightweight, with minimal overhead. They provide fast startup times and efficient resource utilization.
Virtual Machines: Virtual machines run on top of a hypervisor and have their own guest operating system. They are heavier than containers, with higher resource overhead and slower startup times.
Portability: Docker containers can run on any platform that supports Docker, providing consistency across environments.
Scalability: Docker enables horizontal scaling, allowing applications to handle increased workloads by adding more containers.
Security: Docker containers are isolated from each other and the host machine, making them very secure.
Isolation: Containers isolate applications and their dependencies, preventing conflicts and ensuring reproducibility.
Consistency: Docker streamlines the CI/CD process, reducing the likelihood of deployment errors caused by environment differences.
Version Control: Docker images can be version-controlled, enabling developers to track changes and roll back to previous versions if needed.
Docker allows you to ship, test, and deploy your applications in any environment without worrying about incompatibility issues, regardless of the machine's configuration settings.
Docker simplifies the software development lifecycle by streamlining the process of building, shipping, and running applications. It improves productivity, accelerates time to market, and enhances collaboration among development teams.
Docker is suitable for a wide range of use cases, including:
Developing and testing applications in isolated environments.
Continuous integration and continuous delivery (CI/CD) pipelines.
Scaling applications horizontally to handle varying workloads.
Containerizing legacy applications for modernization and portability.
Severity | Description | Example |
---|---|---|
Uptime | Yearly Allowed Downtime | Monthly Allowed Downtime |
---|---|---|
A key practice of DevOps is to automate as much of the development and deployment lifecycle as possible. Automation is a key element of a that helps reduce human error while increasing productivity.
Deploy- The packaged code is deployed to the production servers. Using tools, like Terraform and Chef, software can safely and predictably be deployed without downtime to end users.
Monitoring applications' performance and logs to measure the impact and better understand how changes affect their end users. Monitoring and logging can show insights into the root causes of problems. Monitoring becomes more important as . Performing real-time analysis helps organizations more proactively monitor their services.
can help you manage your development, staging, and production environments in a repeatable and efficient manner.
Using practices like and , DevOps teams can more reliably ensure the quality of application updates while maintaining a positive experience for end users.
is an open-source platform for developing, shipping, and running applications.
Docker provides the ability to package and run an application in a loosely isolated environment called a . The isolation and let you run many containers simultaneously on a given host. Containers are lightweight and contain everything needed to run the application, so you don't need to rely on what's installed on the host.
: A Dockerfile is a text file that contains instructions for building a Docker image. It specifies the base image, environment variables, and commands to run during image creation.
: Docker images are read-only templates that contain instructions for creating a container. A Docker image is a snapshot or blueprint of the code, runtime, configurations, libraries, and dependencies required inside a container for an application to run.
: Docker containers are lightweight, portable, and self-sufficient execution environments that run isolated applications. They encapsulate an application and its dependencies, allowing it to run consistently across different environments.
: Docker Compose is a tool for defining and running multi-container Docker applications. It uses a YAML file to configure the services, networks, and volumes required for a multi-container application and simplifies the process of managing complex Docker deployments.
- A Dockerfile serves as the recipe for building a Docker image. It contains instructions that specify how to construct the image, including the base image to use, environment variables, dependencies to install, and commands to execute. The Dockerfile provides a declarative, version-controlled blueprint for creating consistent and reproducible images across different environments.
- Once a Dockerfile is created, the next step is to build the Docker image using the docker build
command. This command reads the instructions in the Dockerfile and executes them sequentially, layer by layer, to construct the image. Each instruction in the Dockerfile creates a new image layer, which is cached for subsequent builds to improve build performance.
During the build process, Docker pulls the necessary base image layers from a registry (e.g., Docker Hub) and executes the instructions in the Dockerfile to customize the image. Once the build is complete, Docker generates a new image with a unique identifier known as a digest.
- With the Docker image built, the final step is to run a container based on that image using the docker run
command. This command creates a new container instance from the specified image and starts it according to the configuration defined in the Dockerfile.
When running a container, Docker provisions resources such as CPU, memory, and network interfaces based on the container's configuration. The containerized application runs within this isolated environment, leveraging the dependencies and runtime environment specified in the Docker image.
Once the container runs, it can be managed, monitored, and scaled using Docker's command-line interface (CLI) or container orchestration tools such as Docker Swarm or Kubernetes.
Efficiency: Docker containers share the host operating system's kernel, resulting in lower overhead and faster startup times than traditional virtual machines (see below).
is a top priority for Docker, and the platform offers several features to ensure the integrity and isolation of containers. These include:
Namespaces and Control Groups: Docker utilizes Linux kernel features such as namespaces and to isolate containers from each other and the host system.
Image Signing and Verification: Docker supports and verification to ensure that only trusted images are used in production environments.
Security Scanning: Tools like can be used for automated vulnerability scanning to identify container image vulnerabilities and provide remediation recommendations.
Role-Based Access Control (RBAC): Docker Enterprise Edition offers to control access to Docker resources based on user roles and permissions.
SEV-1
Critical incident with very high impact.
A customer facing service is completely down for all customers.
SEV-2
Critical incident with significant impact.
A customer facing service is down for a subset of customers.
SEV-3
Minor incident with low impact.
Partial loss of functionality causing inconvenience to customers.
SEV-4
Minor issues requiring action, but not affecting customer ability to use the service.
Slower than average load times.
SEV-5
Cosmetic issues or bugs not affecting customer ability to use the service.
Application text is misspelled.
99%
87h, 39m
7h, 18m
99.5%
43h 49m, 45s
3h, 39m
99.9%
8h, 45m, 57s
43m, 50s
99.95%
4h, 22m, 48s
21m, 54s
99.99%
52m, 35s
4m, 23s
99.999%
5m, 15s
26s
A Docker image is a read-only template with instructions for creating a Docker container.
Docker images are the building blocks of containerized applications, providing a standardized and portable way to package and distribute software.
Docker images are read-only templates that contain everything needed to run a containerized application. This includes the application code, runtime, libraries, dependencies, and configuration files. Images are created from a Dockerfile, which specifies the steps to build the image layer by layer.
Frequently, an image is derived from another image and customized. For instance, you might create an image based on the Ubuntu image but enhance it by installing the NGINX web server, your application, and the configuration details for its execution.
You can create your own images or utilize those created by others and share them in a registry. To create your own image, you compose a Dockerfile using a straightforward syntax to outline the required steps for its construction and execution. Each directive in a Dockerfile generates a layer within the image. Only the changed layers are rebuilt after modifying the Dockerfile and recompiling the image. This efficiency contributes to Docker images' lightweight, compact, and swift nature, distinguishing them from other virtualization technologies.
While Docker images serve as the blueprint for containers, containers are the runtime instances of those images. Containers are ephemeral, isolated environments that run the application specified in the image. In essence, images are static, immutable artifacts, while containers are dynamic, running instances that can be started, stopped, and destroyed.
Docker images are stored in repositories known as Docker registries. These registries can be public or private and serve as centralized locations for storing and sharing Docker images. Docker Hub is a popular public registry, while organizations often use private registries for proprietary or sensitive images.
Alpine images refer to Docker images based on the Alpine Linux distribution. Alpine Linux is renowned for its minimalism and small footprint, making Alpine images significantly smaller than their counterparts based on other Linux distributions. These lightweight images are ideal for reducing container size and improving resource efficiency.
Optimizing Docker images is crucial for enhancing performance, reducing resource consumption, and accelerating container deployment. Several strategies can help optimize images:
Use Minimal Base Images: Start with a minimal base image, such as Alpine, to minimize the image size and reduce dependencies.
Leverage Multi-Stage Builds: Use multi-stage builds to separate build dependencies from the final application image, resulting in smaller, more efficient images.
Remove Unnecessary Files: Remove unnecessary files, dependencies, and build artifacts from the image to reduce bloat and improve security.
Layer Caching: Leverage layer caching during the image build process to speed up subsequent builds by reusing cached layers.
Optimize Dockerfile Instructions: Optimize Dockerfile instructions to minimize the number of layers and reduce image size. Use techniques like combining multiple commands into a single RUN instruction and cleaning up temporary files.
A Dockerfile is the blueprint for building Docker images, providing a declarative and reproducible way to define the environment and dependencies for containerized applications.
A Dockerfile is a text document that contains all the commands a user could call on the command line to assemble an image. A Dockerfile adheres to a specific format and set of instructions, which you can find in the Dockerfile reference.
Dockerfile allows users to specify various commands to build the image and configure the container environment. Some common instruction keywords are:
FROM - The FROM
instruction creates a new build stage from a base image. It's usually one of the first lines in a Dockerfile.
WORKDIR -WORKDIR
sets the working directory for any subsequent RUN
, CMD
, ENTRYPOINT
, COPY
, and ADD
instructions in the Dockerfile. It simplifies file path references within the Dockerfile and improves readability.
COPY - The COPY
instruction in Dockerfile copies files and directories from the host machine to the image filesystem. It is commonly used to add application code, configuration files, and dependencies to the image.
RUN - The RUN
command in a Dockerfile is used to execute commands during the image build process. When you include a RUN
instruction in your Dockerfile, Docker will execute the specified command within the container's filesystem at build time.
CMD - The CMD
command in a Dockerfile is used to specify the default command to run when a container based on the image starts. Unlike the RUN
command, which executes commands during the image build process, the CMD
command sets the default command that will be executed when the container is launched.
EXPOSE - The EXPOSE
keyword in a Dockerfile is used to document which ports a container listens on during runtime. It does not actually publish the port or make it accessible from outside the container. Instead, it serves as a form of documentation for developers, administrators, and container orchestration tools to understand which ports are intended to be used by the application running inside the container.
ENV -ENV
sets environment variables within the container. Environment variables can be used to pass configuration settings, specify runtime parameters, or customize the behavior of applications running in the container.
A full list of instruction keywords can be found in the Dockerfile reference.
The following example shows a Dockerfile that containerizes a NodeJS application.
To build an image from a Dockerfile, use the docker build
command followed by the path to the directory containing the Dockerfile. Docker builds the image layer by layer, executing each instruction in the Dockerfile and caching intermediate layers for faster subsequent builds.
Tagging Dockerfile builds provides a way to version and identify images, making managing and distributing them easier across different environments. Tags typically consist of an image name and version number or identifier.
Use Minimal Base Images: Start with a minimal base image to reduce image size and minimize dependencies.
Optimize Layers: Combine related commands into a single RUN
instruction to reduce the number of layers and improve build performance.
Leverage Caching: Utilize layer caching to speed up build times by caching intermediate layers during subsequent builds.
Cleanup: Remove unnecessary files and dependencies after installing packages to reduce image size and improve security.
Security: Regularly update base images and dependencies to patch security vulnerabilities and ensure the integrity of the image.
The official Docker documentation provides extensive best practices for Dockerfiles.
A Docker container image is a lightweight, standalone, executable package of software that includes everything needed to run an application: code, runtime, system tools, system libraries and settings.
Docker containers are lightweight, portable, and self-contained environments that encapsulate an application and its dependencies.
Docker containers provide a consistent runtime environment across different systems, enabling applications to run seamlessly in various environments, regardless of the host machine's configuration.
Docker containers run on top of a shared OS the host system provides.
Docker containers leverage operating system (OS) level virtualization to isolate applications from the underlying host system. Each container shares the host operating system's (OS) kernel but has its own filesystem, processes, network interfaces, and resource limits. This isolation ensures that containers remain independent and do not interfere with each other or the host system.
Containers: Containers are instances of Docker images running as isolated processes on a host system. They include the application code, runtime, libraries, and dependencies required to run the application.
Images: Images are read-only templates used to create containers. They contain all the files and configurations needed to run an application. Images are typically built from a Dockerfile and can be shared and reused to create multiple containers.
Docker volumes are the preferred way to store persistent container data since they provide efficient performance and are de-coupled from the Docker host.
Applications that require data persistence, such as databases, file storage systems, or stateful applications, typically rely on persistent storage to store and retrieve data across container restarts or redeployments.
To ensure the persistence of data beyond the container's lifecycle, Docker offers two persistent solutions:
Volumes are the preferred way to store container data since they provide efficient performance and are de-coupled from the Docker host.
Volumes are independent of the container's lifecycle and can be easily managed, backed up, and replicated. Additionally, volumes can be attached to multiple containers simultaneously to enable sharing of data and files.
With bind mounts, changes made to files or directories within the container are reflected on the host and vice versa. Bind mounts provide flexibility but are tightly coupled to the host filesystem.
Docker volumes are stored in a location managed by Docker, typically within the Docker data directory (default: /var/lib/docker
) on the host machine. The specific location depends on the Docker storage driver and configuration settings.
Yes, multiple containers can mount to the same Docker volume simultaneously. This allows multiple containers to share data and collaborate on a common dataset stored in the volume.
Yes, Docker volumes are persistent. They exist independently of the container's lifecycle and are preserved even if the associated container is removed. This makes Docker volumes suitable for storing data that needs to persist across container restarts or redeployments.
In Docker Compose, you can define volumes using the volumes
section in your docker-compose.yml
file. This allows you to manage volumes and volume mounts declaratively, making it easier to define and configure storage requirements for multi-container applications.
Docker networking refers to the ability for containers to connect to and communicate with each other, or to non-Docker workloads.
Creates an internal bridge network on the Docker host, allowing containers to communicate with each other.
Each running container is assigned its own IP address.
Provides network address translation (NAT) for outbound traffic and internet connectivity.
Removes network isolation between the container and the Docker host.
Containers share the network namespace with the host, using the host's network interfaces directly. For example, if you run a container that binds to port 80 will bind to <host_ip>:80
Offers improved networking performance but reduces isolation.
Disables networking for the container.
Useful for scenarios where network access is not required or should be restricted.
Enables communication between containers across multiple Docker hosts (Docker Swarm).
Offers IPsec encryption at the level of the Virtual Extensible LAN (VXLAN). Note: Encryption imposes a noticeable performance penalty, so test this option before using it in production.
Don't attach Windows containers to encrypted overlay networks.
Overlay network encryption isn't supported on Windows.
Docker Swarm does not report an error when a Windows host attempts to connect to an encrypted overlay network, but networking for the Windows containers is affected in the following ways:
Windows containers cannot communicate with Linux containers on the network.
Data traffic between Windows containers on the network isn't encrypted.
Provides high-performance, native connectivity for containers.
Allows each container to have its own unique IP address on the host network.
Suitable for scenarios requiring high throughput and low latency.
Enables each container to have its own MAC address and IP address on the host network.
Offers network connectivity similar to physical hosts.
Ideal for applications requiring direct host-like networking capabilities.
The Macvlan driver is helpful, especially for legacy applications or applications that need to monitor network traffic.
You can create a Docker network using the docker network create
command, specifying the network driver type and any additional configuration options.
You can run a Docker container in a specified network using the --network=<network_name>
flag using the docker run
command.
To connect a running container to a network, you can use the docker network connect
command specifying the container ID or name and the network name.
You can disconnect a container from a network using the docker network disconnect
command. Containers are immediately disconnected and do not need to be restarted.
List all your Docker networks with the docker network ls
command.
Docker Compose is a tool for defining and running multi-container applications.
Simplicity: Docker Compose abstracts away the complexity of managing multiple containers, providing a simple and intuitive way to define and run applications.
Consistency: With Docker Compose, you can define your application's configuration declaratively, ensuring consistency across different environments.
Scalability: Docker Compose enables you to scale your application effortlessly by defining and running multiple instances of your containers with a single command.
Development Environments: Docker Compose is widely used for setting up development environments, allowing developers to spin up their application stack quickly and consistently.
Testing Environments: Docker Compose facilitates the creation of isolated testing environments, enabling automated testing of multi-container applications (think GitHub Actions to build and test your application as part of it's CI/CD process).
Production Deployments: While Docker Compose is primarily used for development and testing, it can also be leveraged for deploying small-scale production environments or prototyping solutions.
The easiest and recommended way to install is to install Docker Compose as part of the Docker Desktop installation package, which provides a seamless experience for managing both Docker and Docker Compose on your local machine.
docker compose build
- Build or rebuild services.
docker compose up
- Create and start containers.
docker compose down
- Stop and remove containers and networks.
docker compose restart
- Restart service containers.
docker compose ps
- List containers.
docker compose run
- Run a one-off command on a service.
docker compose exec
- Execute a command in a running container.
docker compose top
- Display the running processes.
The default path for a Compose file is compose.yaml
(preferred) or compose.yml
that is placed in the working directory. Compose also supports docker-compose.yaml
and docker-compose.yml
for backward compatibility with earlier versions. If both files exist, Compose prefers the canonicalcompose.yaml
.
Below is a sample Compose file:
docker compose run -e
in the CLISet environment variables as an explicit flag on the docker compose run
command.
Set environment variables from the command line environment when using docker compose up
.
environment
attribute in the Compose file.Set environment variables directly inside compose.yaml
.
--env-file
argument in the CLIenv_file
attribute in the Compose fileSpecify the environment variable file from compose.yaml
.
.env
file placed at the root of your project directoryThe .env
file should be placed at the root of the project directory next to compose.yaml
.
The .env
file is the default method for setting environment variables in containers.
ENV
directive.Volumes in Docker Compose enable you to persist data generated by your containers across container restarts or deployments. They provide a reliable mechanism for managing and sharing data between containers.
docker compose up
will automatically create volume(s) if they do not already exist.
docker compose up
will mount the volume(s).
docker compose down
will not remove or destroy the volume(s).
The following example shows how a volume can be connected to multiple containers simultaneously.
Running docker compose up
will create the dbdata
volume if it was not already created in Docker Engine. The dbdata
volume would then be mounted to the backend
container at /etc/data
and to the backup-service
container at /var/lib/backup/data
.
The external attribute tells Docker Compose the volume already exists and is managed outside of the Docker Compose lifecycle. If docker compose up
is run, and the volume doesn't exist, Docker Compose will return an error.
So as an example, let's imagine your app is in a directory called myapp
and has the following compose.yaml
:
Running docker compose up
would result in:
A default network named myapp_default
being created.
web
and db
service containers would be created and connected to the myapp_default
network.
Containers could then look up services based on their names (web
or db
). For example, the connection string to the Postgres container would look like: postgres://db:5432
.
Port mapping in Docker Compose enables you to expose container ports to the host system (HOST_PORT
) or to other containers (CONTAINER_PORT
) within the same Docker network. It allows external access to containerized services and facilitates communication between containers.
Docker Compose port mapping is specified by the pattern: HOST_PORT:CONTAINER_PORT
The HOST_PORT
is how services outside the network can connect to the service.
The CONTAINER_PORT
is how services inside the network can connect to the service.
Let's use the following Compose file as an example:
Applications or users (outside the Docker Compose network) could use the HOST_PORT
connect to the service. The connection URL to the database might look like postgres://localhost:8001
if started locally.
Services (inside the Docker Compose network) would use the CONTAINER_PORT
to connect to the service. The connection URL to the database would be postgres://db:5432
.
test: An array of strings specifying the command for checking health.
interval: A duration of how often to run the check after the container has started.
timeout: A duration of time to consider the test a failure.
retries: Number of consecutive failures of the health check for the container to be considered unhealthy.
start_period: A duration of the allowed initialization time for the container. Failed health checks during this period will not count against the retries until after the first successful check.
start_interval: A duration of how often to run the check during the container initialization.
Dockerfiles and Docker Compose files serve complementary roles in the containerization process.
Docker Compose:
Scope: Docker Compose is primarily focused on simplifying the management of multi-container Docker applications on a single host or development environment.
Ease of Use: Docker Compose provides a simple and intuitive way to define, run, and manage multi-container applications using a single YAML file (compose.yaml
).
Features: Docker Compose offers features such as service definition, container networking, volume management, and dependency management, making it well-suited for development and testing environments.
Scaling: While Docker Compose supports scaling of container instances, it is limited compared to Kubernetes and is typically used for smaller-scale deployments.
Scope: Kubernetes is a powerful container orchestration platform designed for deploying, managing, and scaling containerized applications across clusters of machines.
Complexity: Kubernetes has a steeper learning curve than Docker Compose due to its extensive feature set and complex architecture.
Scaling: Kubernetes excels at scaling containerized applications across multiple nodes in a cluster, providing features such as automatic scaling, self-healing, and rolling updates.
High Availability: Kubernetes offers built-in support for high availability and fault tolerance, with features like pod replication, load balancing, and service discovery.
Production Deployments: Kubernetes is well-suited for production deployments of containerized applications, offering advanced features for managing large-scale, mission-critical workloads.
Docker Compose is ideal for local development, testing, and small-scale deployments where simplicity and ease of use are paramount.
Kubernetes is better suited for production deployments, large-scale applications, and environments requiring high availability, scalability, and advanced orchestration features.
Ultimately, the choice between Docker Compose and Kubernetes depends on factors such as the size and complexity of your application, your deployment environment, and your organization's requirements for scalability, availability, and automation.
A crash course of the Prometheus monitoring system. Learn why Prometheus is a popular choice for monitoring modern infrastructure.
Prometheus boasts many features that make it a powerful monitoring and alerting tool for modern infrastructure. Here are some key features:
Alerting: Prometheus features a built-in alerting system that allows users to define alerting rules based on specific conditions and thresholds. When these conditions are met, Prometheus can trigger alerts, notifying users or external systems via various channels such as email or webhooks.
Service Discovery: Prometheus supports service discovery mechanisms that automatically discover and monitor targets in dynamic environments. This includes integrations with Kubernetes, Consul, and other service discovery systems, as well as static configuration options.
Data Retention and Storage: Prometheus offers configurable retention periods for stored metrics data, allowing users to define how long data should be retained for analysis and alerting purposes. It stores data locally in its time-series database, making it easily accessible for querying and visualization.
Prometheus follows a "pull-based" model. It periodically "scrapes" metrics from configured targets, stores them in its time-series database, allows querying and visualization of metrics data, and supports alerting based on defined rules.
Prometheus Server: The core component that collects and stores time-series data based on metrics scraped from instrumented targets. It includes a time-series database (TSDB) for storing metrics data.
Prometheus Configuration: Administrators configure Prometheus to scrape specific targets for metrics data. This configuration includes details such as scrape intervals, targets, and other settings.
Service Discovery: Prometheus supports various service discovery mechanisms, such as static configurations, DNS-based discovery, Kubernetes service discovery, etc., to dynamically discover and monitor targets.
Exporters: Exporters are agents that run alongside services or systems to expose metrics in a format that Prometheus can scrape.
Multi-Dimensional Numeric Time Series: Prometheus is particularly well suited for recording multi-dimensional numeric time series.
Reliability: Prometheus is distributed as a single binary with no dependencies, making it very reliable, even when other parts of your infrastructure are down.
Machine Learning or AI-based anomaly Detection: Prometheus supports basic alerting and aggregation, but other tools will offer more advanced analytics capabilities.
Docker Swarm is a container orchestration tool that enables the management and deployment of containerized applications at scale.
Docker Swarm is a container orchestration tool that enables the management and deployment of containerized applications at scale.
Simplicity: Docker Swarm offers a straightforward setup and management experience, making it accessible to developers and operations teams.
High Availability: Docker Swarm provides built-in high availability features, ensuring that applications remain accessible even in the event of node failures.
Cost-effectiveness: Docker Swarm helps reduce infrastructure costs by enabling the efficient use of resources and supporting dynamic scaling based on demand.
Even when Docker is in Swarm mode, you can still run standalone containers and swarm services on any host in the swarm. However, only swarm managers can control the swarm, while Docker daemons can manage standalone containers. Daemons can be managers, workers, or both in a swarm.
Yes. You can use the --compose-file
flag to have Docker Swarm deploy a Docker Compose stack.
A Docker host refers to a physical or virtual machine (e.g., a server or a cloud instance) on which the Docker Engine is installed and running.
A Docker node refers to a member in a swarm mode cluster. Every Swarm node must be a Docker host, but not every Docker host is necessarily a member of a swarm cluster.
Docker Swarm:
Swarm operates at the level of entire clusters, allowing you to manage multiple Docker hosts as a single entity.
Swarm is designed for production environments where high availability, scalability, and resilience are critical.
Swarm supports declarative service definitions, meaning you specify the desired state of your services, and Swarm works to maintain that state.
Swarm includes built-in support for service discovery and load balancing.
Docker Compose:
Docker Compose is a tool for defining and running multi-container Docker applications. It allows you to define the services, networks, and volumes for your application in a single YAML file, known as a compose.yaml
file.
Compose is typically used for development and testing environments where you must spin up multiple containers that work together as part of your application stack.
It operates at the level of individual applications or projects, allowing you to define the relationships between containers within a single application.
Compose is lighter and simpler to use than Swarm, making it ideal for local development and testing workflows.
While Compose can run on a single host, it does not provide the same level of scalability and fault tolerance as Swarm.
Docker Swarm is a container orchestration tool for managing clusters of Docker hosts in production environments, while Docker Compose is a tool for defining and running multi-container Docker applications, primarily used in development and testing environments.
Architecture:
Docker Swarm is built into the Docker Engine, providing a simple and easy-to-use clustering solution for Docker containers. It uses a manager-worker architecture, where manager nodes control the cluster and schedule workloads onto worker nodes.
Scalability and Features:
Kubernetes is known for its scalability and extensive feature set, including advanced scheduling, service discovery, load balancing, rolling updates, auto-scaling, and more. It is designed to manage large-scale production environments with thousands of containers.
Docker Swarm is simpler and lighter than Kubernetes, making it easier to set up and manage for smaller-scale deployments. It provides basic features for container orchestration but may lack some of the advanced capabilities Kubernetes offers.
Ecosystem and Community:
Kubernetes has a larger and more mature ecosystem, and it is widely adopted in the industry. A vibrant community supports it and has a rich set of third-party tools and integrations.
Docker Swarm has a smaller ecosystem than Kubernetes, with fewer third-party tools and integrations available. However, it benefits from tight integration with Docker tools and workflows, making it more approachable for users already familiar with Docker.
Ease of Use:
Docker Swarm is designed to be easy to set up and use, especially for users already familiar with Docker. It provides a simple and intuitive user interface for managing container clusters.
Kubernetes has a steeper learning curve than Docker Swarm, but it offers more flexibility and control over container orchestration. It may require more effort to set up and manage, particularly for users new to Kubernetes concepts.
Docker Swarm is a simpler and more lightweight container orchestration solution suitable for smaller-scale deployments and users already familiar with Docker, while Kubernetes is a more powerful and feature-rich platform designed for large-scale production environments with complex requirements. The choice between Docker Swarm and Kubernetes depends on factors such as the size and complexity of your deployment, your familiarity with Docker and Kubernetes, and your specific use case requirements.
By default, data in is only preserved for the duration of the container's lifespan; once the container is removed or destroyed, the data becomes inaccessible. Thus, persistent storage becomes necessary when you need to retain data beyond the lifespan of a container.
are dedicated storage units managed by Docker. Only Docker containers can access volumes.
allow you to mount a directory or file from the host machine into a container. Bind mounts can be accessed by both Docker processes and non-Docker processes.
One common scenario where you would use a bind mount is when you are developing a and want to make code changes on your host machine that are immediately reflected within the container without rebuilding the image.
, or temporary filesystems, are temporary storage areas created in a container's memory space. They are helpful for storing transient data or temporary files within a container (think log files or caching). tmpfs
is ephemeral and does not persist data across container restarts. Additionally, you can't share tmpfs
mounts between containers.
To , you can use the -v
or --volume
flag with the docker run
command, specifying the volume name and mount path within the container.
Alternatively, you can define volume mounts in a Docker Compose file using the volumes
section (see below).
extend Docker's native volume functionality by enabling integration with external storage systems or cloud providers. Storage drivers allow you to use specialized storage solutions, such as network-attached storage (NAS), block storage, or cloud storage, as Docker volumes.
A is a virtualized network that allows to communicate with each other and external networks. It provides isolation, security, and flexibility for , enabling efficient data exchange and container connectivity.
Docker provides applicable to various scenarios.
The is the most common driver type. It uses a software bridge that lets containers connected to the same bridge network communicate while providing isolation from containers that aren't connected to that bridge network.
Default network driver in Docker. When you first install docker, a (called bridge
) is created automatically. Newly started containers connect to it unless otherwise specified.
The allows containers to share the host's network stack without isolation. Containers are not allocated their own IP address, and port bindings will be published directly to the host's network interface.
The completely isolates a container from the host and other containers.
are distributed networks that span multiple Docker hosts. When using overlay networks, Docker transparently handles routing traffic to and from the correct Docker hosts and the correct destination containers. When , overlay networks allow containers to communicate securely.
The gives users complete control over both IPv4 and IPv6 addressing. The IPvlan driver is helpful when you are integrating containerized services with an existing physical network, need high-performance, or want fine-grained control over IP address assignment.
The allows containers to appear as physical devices on your network. It works by assigning each container on the network a unique .
The Macvlan network type requires you to dedicate one of your host's physical interfaces to the virtual network. The network should be appropriately configured to support the potentially large number of MAC addresses that could be created by running many containers. See the warnings provided in .
are the optimal choice for most scenarios encountered in Docker. Containers within this network communicate using individual IP addresses and DNS names. Moreover, they possess access to your host’s network, granting them connectivity to the internet and your local area network (LAN).
excel when direct binding of ports to your host’s interfaces is necessary, without concern for network isolation. They enable containerized applications to network like they are running directly on your host.
become essential when communication between containers on different Docker hosts is required (think Docker Swarm). These networks facilitate the establishment of distributed environments, ensuring high availability across the network.
are an advanced option, catering to specific requirements concerning container IP addresses, tags, and routing. They offer fine-grained control over network configurations, making them suitable for complex network setups.
networks should be used when containers must emulate a physical device on your host’s network. This functionality proves beneficial, particularly when running applications tasked with monitoring network traffic.
In , you can define networks in the networks
section of your docker-compose.yml
file. This allows you to for multi-container applications.
Docker leverages your host’s network stack to establish its networking infrastructure. This involves to route traffic to your containers efficiently, ensuring isolation between Docker networks and your host.
, the standard Linux packet filtering tool, dictates how traffic traverses your host’s network stack. Docker networks add filtering rules that route matching traffic to your container's application. Docker automatically configures these rules and eliminates the need for manual interaction with iptables.
Each Docker container is allocated its own leveraging a Linux kernel feature to create isolated virtual network environments. Additionally, containers generate virtual network interfaces on your host, enabling communication beyond their namespace using your host’s network.
While Docker's networking implementation involves complex and low-level details, it abstracts them away from end users, delivering a seamless container networking experience that is predictable and efficient. For complete documentation, see .
is a powerful tool that simplifies the management and orchestration of Docker applications. It allows you to define and run multi-container applications using a single , streamlining the development, deployment, and scaling of containerized environments.
Installation Instructions:
Docker Compose reads a (commonly named compose.yaml
) that defines the configuration of your multi-container application. You then interact with your Compose application through the .
like docker compose up
are used to create and manage the containers specified in the Compose file, handling tasks such as container creation, , , and service dependencies. You can then use docker compose down
to stop and remove the created containers and networks.
Docker Compose provides for managing your multi-container application. Some common commands are listed below:
The Docker Compose file (compose.yaml
) serves as the blueprint for your multi-container application. It defines the services, networks, volumes, and other configurations required to run your application stack. The Docker compose file follows the .
Environment variables provide a flexible way to pass configuration information to containers at runtime. Docker Compose provides 7 different ways to . They are listed below with examples in the .
Environment variables should not be used to pass sensitive information (like passwords). should be used instead.
Pass an from the command line.
Use the ENV
directive in the .
Docker Engine manages and has the following behavior:
Docker Compose supports several , but an important one worth mentioning is the external
attribute.
Docker Compose simplifies by automatically creating a default network for your application stack. This network allows containers to discover and communicate with each other using service names defined in the compose.yaml
file. The default network name is named based on the "project name" (root directory) the compose.yaml
resides.
Docker Compose supports for containers, allowing you to define conditions for determining if a container is still working. Health checks can help detect unresponsive applications even though the process is still running.
are used to define the contents and build process of individual Docker
are used to define and manage applications, orchestrating the deployment and management of multiple containers as a cohesive application stack.
While both Docker Compose and are popular tools for managing containerized applications, they serve different purposes and are designed for different use cases.
- Download links for Docker Desktop.
- Run Docker directly in your browser.
- Reference, Guides, and Manuals for Docker.
- Repository for official images.
- PagerTree Docker Commands Cheat Sheet and other resources.
- Official Docker CLI Cheat Sheet
- A curated list of Docker resources and projects.
- A curated list of Docker Compose samples.
- A comprehensive open-source book teaching fundamentals, best practices, and intermediate Docker functionalities.
is an open-source systems monitoring and alerting toolkit. It is designed to monitor the health and performance of systems and applications in dynamic cloud-native environments.
Prometheus collects metrics data from targets, such as servers, , databases, or even applications instrumented with Prometheus client libraries. are identified and organized by key-value pairs called . Metrics can then be queried using , which is then used by other systems like Grafana for visualization or Alert Manager for alerting and notifications.
Prometheus is widely recognized for its simplicity, reliability, and scalability, making it a popular choice for monitoring modern infrastructure. Its active community and ecosystem of and further contribute to its appeal as a comprehensive monitoring solution for today's dynamic IT environments.
Time Series Data Collection: Prometheus collects data, which is crucial for monitoring the behavior of systems and applications over time.
Multi-dimensional Data Model: Metrics in Prometheus are identified using key-value pairs called . This multi-dimensional data model allows for flexible and efficient querying and aggregation of metrics.
PromQL: is a powerful query language that enables users to perform complex queries on collected metrics data. PromQL allows for tasks such as metric selection, filtering, aggregation, and mathematical operations.
Scalability: Prometheus is designed to be highly scalable and capable of handling large volumes of metrics data across distributed environments. It supports , allowing multiple Prometheus servers to collaborate and aggregate metrics data from different sources.
Exporters and Integrations: Prometheus has a rich ecosystem of and that allow users to monitor a wide range of systems and applications. Exporters are agents or libraries that expose metrics in a format that Prometheus can scrape, enabling monitoring of third-party services, databases, and custom applications.
Grafana Integration: Prometheus integrates seamlessly with , a popular open-source visualization tool. Grafana allows users to create customizable dashboards and visualizations of Prometheus metrics data, enabling detailed monitoring and analysis.
: In scenarios where direct scraping is not feasible (e.g., short-lived jobs), the Pushgateway allows applications to push metrics to Prometheus, which are then scraped by the Prometheus server.
: This component handles alerts generated by Prometheus. It manages the routing, grouping, and notification of alerts to various integrations such as email and webhooks.
Visualization and Querying: Prometheus provides a built-in expression browser for querying and graphing metrics data. However, it can also integrate with visualization tools like for more advanced visualization and dashboarding capabilities.
Monitoring Distributed or Cloud Native Applications: Prometheus is well suited for monitoring containerized environments like Kubernetes. Its service discovery mechanisms make it easy to monitor dynamic systems. Its querying and alerting functionality can notify you when your begin to breach thresholds.
Require 100% Accurate Metrics: Because of server restarts and the way Prometheus data, some metrics may be lost. Use cases like "per-request billing" are unsuitable for Prometheus.
Logging or Tracing: Prometheus exclusively deals with numeric metrics. A logging solution like is better suited for logs.
High Cardinality Metrics: If your monitoring requirements involve a very high number of unique metric dimensions or labels (known as high cardinality), Prometheus may struggle to handle the volume efficiently. In such cases, other solutions like or , which are designed for handling high cardinality data, might be more appropriate.
Long-Term Data Storage: While Prometheus is excellent for real-time monitoring and short-term analysis, it's not optimized for long-term data retention. If your use case requires storing metrics data for extended periods (months or years), you might need to integrate Prometheus with long-term storage solutions like , , or remote storage backends.
This is mostly a marketing claim. You can reliably run Prometheus with tens of millions of active series. If you need more than that, there are several options for .
Here is a .
The function will automatically handle server restarts ("counter resets"). Check out the , where we discuss how values after a reset are adjusted for resets.
and must be passed a range vector and operate on the and datatypes.
The cannot be averaged across a cluster. Summaries are computed on the metric producer and cannot be averaged on the Prometheus server.
It allows users to create and manage a cluster of hosts, known as a Swarm, and deploy applications across the seamlessly.
Docker Swarm provides a simple yet powerful solution for automating the of containerized applications in production environments.
Scalability: With Docker Swarm, you can easily scale your applications horizontally by adding or removing from the cluster.
Resource Efficiency: Docker Swarm optimizes resource utilization by efficiently scheduling and distributing tasks across nodes in the .
A Docker Swarm is a group of Docker that work together. Some nodes act as to handle membership and tasks, while others act as to run services. Each node can be a manager, a worker, or both.
When you create a in a swarm, you specify its desired state, including the number of replicas you want, the and resources it needs, and which ports it should use. Swarm makes sure the service stays in that state. For example, if a worker goes down, Docker Swarm automatically moves its to other nodes. A task is just a running container that's part of a swarm service and is managed by a manager, not on its own.
Swarm services have an advantage over standalone because you can change their settings, like which networks or volumes they're connected to, without restarting them manually. Swarm updates the configuration, stops old tasks, and starts new ones to match.
Just like you can use to set up containers, you can use Docker Swarm to define and run stacks of swarm services.
A Docker host refers to a physical or virtual machine (e.g., a server or a cloud instance) on which the is installed and running.
The term "host" is not specific to Swarm but rather the entire Docker ecosystem (see the ).
A node is an instance of the Docker engine participating in the swarm. You can run one or more nodes on a single .
The term "node" is Swarm specific (see ).
A cluster is a group of Docker (2 or more) running together as a single virtual host.
Manager are responsible for managing and coordinating the activities of a Docker Swarm , including deployment, distribution, fault tolerance, and security.
Worker are responsible for executing and communicating with the Swarm manager to receive task assignments, report their status, and request updates.
are the primary unit of work in Docker Swarm. They define an application's desired state, including the number of replicas, networking configuration, and resource constraints.
represent individual service instances running on a worker node. Docker Swarm manages the distribution and execution of tasks across the cluster.
Docker Swarm uses to facilitate communication between services running on different nodes in the cluster. Overlay networks provide transparent network connectivity and support and service discovery.
in Docker Swarm enable persistent storage for containerized applications. They allow data to be shared and preserved across container restarts or redeployments.
You can to Docker Swarm using files (ex: docker stack deploy --compose-file compose.yaml stack_name
) or the Docker CLI. Services define the desired state of an application, including the number of replicas and resource constraints.
Docker Swarm makes it easy to horizontally by adjusting the number of replicas. You can scale services up or down based on demand to meet performance and capacity requirements. (ex: docker service scale stack_name=5
)
Docker Swarm allows you to apply easily (ex: docker service update --image myapp:lastest stack_name
). If the update fails, the deployment will halt. Rolling updates ensure applications can maintain high availability.
Docker Swarm uses to enable communication between services running on different nodes in the cluster. Overlay networks provide a transparent and efficient way to connect across the Swarm.
Docker Swarm supports volume management for in containerized applications. Volumes allow data to be shared and preserved across container restarts or redeployments, ensuring data integrity and availability.
Docker Swarm and are tools provided by Docker for managing containers, but they serve different purposes and operate at different levels of abstraction.
Docker Swarm is a container orchestration tool for deploying and managing a cluster of Docker hosts. It enables you to create a cluster of Docker hosts and deploy across them, providing features like scaling, load balancing, service discovery, and rolling updates.
Docker Swarm and are both container orchestration platforms, but they have different architectures, features, and use cases. Here are some key differences between Docker Swarm and Kubernetes:
Kubernetes, on the other hand, has a consisting of several components such as the API server, scheduler, controller manager, and etcd for maintaining cluster state. It follows a master-node architecture, with a control plane (master) managing one or more worker nodes.
Learn about the Prometheus time series data model. Understand what metrics and labels are. Learn best practices for naming conventions and base units.
Prometheus stores all data as time series: streams of numeric values sampled at ongoing timestamps.
Every time series is uniquely identified by its metric name and optional labels.
In Prometheus, everything revolves around metrics. A metric is a feature (i.e., a characteristic) of a system being measured. Typical examples of metrics are:
http_requests_total
http_request_size_bytes
system_memory_used_bytes
node_network_receive_bytes_total
In Prometheus, metric labels are sets of key-value pairs that help categorize and differentiate subdimensions.
Typical examples of labels for the http_requests_total
metric include:
method: GET|PUT|POST|DELETE
status: 100..599
path: /api/v4/alerts
CAUTION: Remember that every unique combination of key-value label pairs represents a new time series, which can dramatically increase the amount of data stored. Do not use labels to store dimensions with high cardinality (many different label values), such as user IDs, IP addresses, email addresses, or other unbounded values.
Samples form the bulk data and are appended to a series over time.
Timestamps - 64-bit integers in millisecond precision.
Sample values - 64-bit floating point numbers.
Metric names should follow these rules:
Use a single-word application prefix ("namespace") for the application's domain (e.g., pagertree_notifications_total, process_cpu_seconds_total, http_request_duration_seconds).
Use a single unit specified in base units.
Use a suffix describing the unit (e.g., pagertree_notifications_total, node_memory_usage_bytes, process_cpu_seconds_total).
Represent a logical thing being measured (e.g., number of notifications sent, bytes of data transfer, request duration).
Prometheus does not have any hard-coded units. Base units should be used for better compatibility. The following lists some metrics families with their base units. The list is not exhaustive.
Learn what PromQL is and how to use it to query Prometheus. Learn how to select, filter, and aggregate time series.
PromQL is the query language of the Prometheus monitoring system. It is a central feature of Prometheus that enables dashboarding, alerting, and ad-hoc querying of the collected time series data.
PromQL allows you to select, aggregate, and otherwise transform and compute on time series data in a flexible way. The result of an expression can either be shown as a graph, viewed as tabular data in Prometheus's expression browser, or consumed by external systems via the HTTP API.
PromQL is used only for reading data.
Learn about Prometheus time series, vectors, instant vectors, range vectors, ranges, and filters.
A set of related time series is called a vector.
Instant Vector - a set of time series where every timestamp maps to a single data point at that “instant”.
Imagine evaluating the expression http_requests_total
at a given timestamp. http_requests_total is an instant vector selector that selects the "latest sample" for any time series with the metric name http_requests_total
. More specifically, "latest" means "at most 5 minutes old and not stale", relative to the evaluation timestamp. So this selector will only yield a result for series that have a sample at most 5 minutes before the evaluation timestamp and where the last sample before the evaluation timestamp is not a stale marker (an explicit way of marking a series as terminating at a certain time in the Prometheus TSDB).
Range vector - a set of time series in which every timestamp maps to a “range” of data points recorded some duration into the past.
A range query works exactly like many completely independent instant queries that are evaluated at subsequent time steps over a given range of time. Of course, this is highly optimized under the hood, and Prometheus doesn't actually run many independent instant queries.
For example, http_requests_total[5m]
would return all the data points falling in a 5-minute window at the evaluation timestamp.
Instant vectors can be graphed, but range vectors cannot.
Instant vectors can be compared, and arithmetic operations can be performed on them, but range vectors cannot.
You can change any instant vector selector into a range vector selector by appending a duration specifier [<number><unit>]
. For example, [5m]
for a 5-minute range.
Valid duration units:
ms
- milliseconds
s
- seconds
m
- minutes
h
- hours
d
- days
y
- years
Matcher Types:
=
: Equals
!=
: Not Equals
=~
: Regular Expression Match
!~
: Regular Expression Not Match
The following example would select all the metrics with the name "http_requests_total" that have a job
label matching exactly demo
and a path
label starting with /api
.
Regular expression matches are fully anchored. A match of path=~"/api"
is treated as env=~"^/api$"
. You can test your regex matches here using the Golang flavor.
Did you know that the metric names in Prometheus are actually stored as labels? The __name__
label actually stores the metric name. This can be useful when trying to dynamically match metric names.
Learn about the four different Prometheus metric types: counter, gauge, summary, and histogram.
Counters track values that can only increase, like HTTP request counts or CPU seconds used.
Functions that are commonly used with counters are:
Gauges track values that can increase or decrease, like temperatures or disk space.
Usually, gauges do not need to be operated on by functions before they can be graphed.
Summaries calculate client-side-calculated quantiles from observations, like request latency percentiles. They also track the total count and total sum of observations.
NOTE: Summaries cannot be aggregated across labels or multiple instances.
NOTICE: The client has already reduced summaries into a floating point number.
Histograms track cumulative bucketed counts of observations, such as request durations. They also track the total count and total sum of observations.
Histograms need to be processed by the Prometheus server. They also will always have the label "le" denoting the upper bound of the bucket.
NOTICE: Histograms are observed as counts, similar to the counter metric type.
Both histogram and summary metrics can be used to calculate quantiles, but they have different trade-offs. The most important is that summary metrics cannot be aggregated over dimensions or multiple instances. The official documentation provides an in-depth analysis of the differences.
Learn how Prometheus handles counter resets with rate, irate, and increase functions.
Counter metrics can reset to zero when a scraped process restarts (e.g., the server is restarted). Counter functions automatically handle counter resets by assuming that any decrease in a counter value was a reset. Internally, these functions compensate for the reset by adding the last sample value before the reset to all sample values after the reset.
irate()
is much more responsive than rate()
. It is good for high-resolution metrics. It should not be used for alerting conditions.
increase()
- "absolute increase" - calculates the absolute increase over a given time value, including extrapolation.
Logically, only the increase()
function includes extrapolation because it measures an absolute increase. rate()
and irate()
functions calculate a slope (derivative), which will not change even if extrapolation is included.
Pushgateway allows short-lived jobs to expose their metrics to Prometheus.
In essence, Pushgateway enables Prometheus to collect metrics from jobs or instances that are not long-lived or constantly available. It provides a way for these ephemeral metrics to be stored and queried alongside metrics from other sources in Prometheus.
Pushgateway should only be used to capture the outcome of a service-level batch job. A "service-level" (or "application-level") batch job is not related to a specific machine or job instance. An example of a service-level batch job could be the number of images an application resizes.
Do not try to use Pushgateway to turn Prometheus into a push-based system or do machine-level monitoring. There are several reasons for this:
Pushgateway becomes a single point of failure for all pushed metrics and a potential bottleneck.
You lose the benefits of Prometheus' service discovery support and automatic health monitoring (up
metric).
Pushgateway does not expire metrics automatically, which could lead to Prometheus scaping stale metrics unless they are manually deleted.
Pushgateway exposes metrics about itself as well as the groups of metrics that other jobs have pushed to it under the /metrics
HTTP path. Thus, you can scrape it like any other target.
Add the following to the scrape_configs
section in your prometheus.yml
to scrape the Pushgateway:
Pushing metrics is as simple as making a POST request to your Pushgateway instance using the following URL format:
/metrics/job/<JOB_NAME>{/<LABEL_NAME>/<LABEL_VALUE>}
The following example makes a POST request with the following metric labels {__name__="some_metric",job="some_job"}
.
You can delete metric groups via Pushgateway's web interface or HTTP API.
By default, Pushgateway only stores metrics in memory and does not persist them across restarts. Use the --persistence.file
command line flag to persist Pushgateway metrics across restarts.
With the persistence file option, Pushgateway will write the metrics to disk every 5 minutes. To change this, you can use the --persistence.interval
configuration flag.
By default, Pushgateway runs on port 9091.
No. Pushgateway only stores the latest value of each metric that was pushed to it.
Alertmanager handles alerts generated by Prometheus. It manages the routing, grouping, and notification of alerts to various integrations such as email and webhooks.
Key features of Alertmanager include:
Grouping: Similar alerts can be grouped together to avoid overwhelming the users with redundant notifications.
Inhibition: Prevents certain alerts from firing if another specific alert is already open. This helps prevent flooding with redundant notifications.
Silencing: Administrators can silence certain alerts during maintenance or in response to known issues, preventing unnecessary notifications.
Routing: Alerts can be routed to different destinations based on certain criteria, such as severity level, alert type, or specific attributes.
Example: Your database goes down, and all services can no longer reach it. Prometheus' alerting rules were configured to send an alert for each service that cannot communicate with the database. As a result, many alerts were sent to Alertmanager. Alertmanager groups these alerts into one and sends a single alert/notification.
Example: An alert is firing about an entire cluster that is not reachable. Alertmanager is configured to inhibit all other alerts concerning the cluster if this alert condition is already firing. This prevents duplicate alerts/notifications from being sent that might be downstream from the actual issue.
Silences are a way to mute alerts for a given time. Silences are configured in the web interface of Alertmanager.
Example: Incoming alerts are checked to see whether they match all the equality or regular expression matches of active silence. If they do, no notifications will be sent out for that alert.
The following is an example configuration file:
Do not load balance traffic between Prometheus and Alertmanager. Instead, point Prometheus to a list of all Alertmanagers.
Category | Base Unit |
---|---|
- "rate of increase" - calculates a per-second increase of a counter as averaged over a specified window.
- "instantaneous rate of increase" - calculates a per-second increase over the time window, only considering the last 2 points.
is a component in the ecosystem that allows short-lived jobs to expose their metrics to Prometheus. It is an intermediary between short-lived job instances (which may not be consistently available for scraping) and Prometheus. Here's how Pushgateway works and what it does:
Job Persistence: Pushgateway allows short-lived jobs to push their metrics to it, rather than Prometheus them directly. These jobs may be transient, such as cron jobs or batch jobs that run periodically and do not live long enough to be scraped.
Push Mechanism: Instead of the traditional pull mechanism used by Prometheus, where Prometheus scrapes metrics endpoints on targets, Pushgateway flips the model. Jobs push their metrics to the Pushgateway using .
Grouping and Labeling: Pushgateway allows jobs to attach to the metrics they push. These labels can help group and identify metrics from different instances of the same job or different jobs altogether.
Prometheus Compatibility: Pushgateway is designed to work seamlessly with Prometheus. Prometheus can be from the Pushgateway, just like it would with any other Prometheus target.
Note the scrape option. Because the Pushgateway proxies metrics from other jobs that usually attach their own job
label to a group of metrics, you will want to prevent Prometheus from overwriting any such labels with the target labels from the scrape configuration.
where <JOB_NAME>
is used as the value of the job
, followed by any number of other label pairs.
No. Pushgateway does not support a TTL or expiring metrics. Deleting metrics must be done via the .
If you need aggregation or distributed counters for ephemeral jobs, consider:
is responsible for handling alerts sent by client applications such as and then managing those alerts by grouping, deduplicating, routing, and sending them to various receiver integrations like email, webhook, , etc.
Integration: Supports integration with various notification systems and channels like email, webhook, , etc.
Grouping categorizes alerts with a similar into a single notification. The group is configured by a routing tree in the .
Inhibition suppresses notifications for certain alerts if certain other alerts are already firing. Inhibitions are configured through the Alertmanager .
Alertmanager is configured via command-line flags and a configuration file (YAML format). The can be found in the official docs, and the can be used to help build route trees.
Notifications sent to receivers are constructed via . Alertmanager comes with default templates, but they can also be customized.
By default, Alertmanager starts in high availability mode. To configure the Alertmanager cluster, use the flags.
Time
seconds
Temperature
celcius
Length
meters
Bytes
bytes
Bits
bytes
Percent
ratio
Voltage
volts
Electric current
amperes
Energy
joules
Mass
grams
Integrating Prometheus with remote storage systems allows you to achieve better scalability, durability, outsourcing of storage operations, or additional use cases for your Prometheus metrics.
Remote write allows Prometheus servers to push time series data to remote storage systems or other systems that can accept Prometheus's data format. This feature is particularly useful for long-term storage, data aggregation, or integrating Prometheus with other monitoring or analytics systems.
Scalability - By default, Prometheus stores all data on the local filesystem and thus is limited to the size of the single node's filesystem. Remote storage is especially helpful in scenarios with a large volume of metrics or if you need to retain data for a long period of time.
Long-Term Retention - Remote storage solutions are typically designed to handle long-term retention of metrics data efficiently. By using remote storage, you can retain historical metrics data for weeks, months, or even years without impacting the performance of your Prometheus server.
Cost Effectiveness - Some remote storage solutions, such as cloud-based object storage services like Amazon S3 or Google Cloud Storage, offer cost-effective storage options. Storing metrics data in these services can be more economical than storing it on local disks or in-memory storage solutions.
High Availability - Remote storage solutions often provide built-in redundancy and high availability features, ensuring your metrics data is resilient to hardware failures or other issues. This improves the reliability of your monitoring infrastructure.
Features - Prometheus is focused on storing its own labeled data model but doesn't offer a solution for processing data in other formats or not in time series form. Remote storage solutions may offer advanced querying, aggregation, rate limiting, and analysis capabilities that go beyond what Prometheus itself provides.
Multi-Tenant Support - If you're running a multi-tenant environment or providing monitoring services to multiple teams or customers, remote storage solutions can offer features for isolating and managing metrics data for different users or tenants.
Rather than trying to solve all these limitations, Prometheus allows you to integrate with third-party remote storage solutions that can compete to offer the best tradeoffs around storage features, scalability, redundancy, and cost.
Prometheus allows you to send all, or a subset, of the sampled data to a remote HTTP endpoint using a "remote write" protocol defined by Prometheus. Prometheus only buffers sampled data for a few seconds before forwarding it, so transfer to third-party systems is near real-time (NRT).
When Prometheus performs a remote write, it uses an adapter to send time series data in a format the third-party storage can understand. The data can then be read back (for use with PromQL) later using the remote read feature.
You can configure Prometheus to use remote_write and remote_read at the top-level configuration sections of the Prometheus configuration file.
The remote_write section allows for the definition of one or more URLs to which Prometheus should send sample data.
Additionally, you can configure many additional parameters, such as authentication, relabeling, parallelism, and batch sizes.
The remote_read section allows defining one or more URLs from which to read samples when executing PromQL.
Like the remote write, you can configure additional parameters for filtering and authentication.
Remote storage options include and are not limited to the following:
Remote write and read protocols are based on the protocol buffer ("protobuf") and sent over HTTP. Requests and responses are snappy encoded to reduce over-the-wire size.