Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Learn about what makes a great incident management tool and about 5 alternatives to the market leader, PagerDuty.
TL;DR Here’s the shortlist of the top 5 best PagerDuty alternatives in 2024:
But, PagerDuty can be expensive, has a steep learning curve, and lacks customer support. We’ve compiled a list of the top 5 best PagerDuty alternatives. We specifically looked at features, reliability, and pricing.
PagerDuty offers a robust incident management solution. However, there are many reasons why people might explore alternatives.
Cost - PagerDuty paid plans start at $25/user/month. Cost conscience organizations will be looking for more cost-effective alternatives. (Note: when comparing pricing below, we will compare plans that support Single Sign On)
Feature Set - PagerDuty offers many features. Some might even call it feature bloat. Most organizations only need the core subset, like on-call, escalations, and notifications. Why pay for features you don’t use?
Learning Curve - PagerDuty can be complex to set up and learn. Most teams will want a solution that is easy to set up and simple to understand. Time to go live is critical for small and agile teams.
Customer Service - PagerDuty is the largest and most well-known brand for on-call incident management software. However, their customer service and support are lacking, especially for small accounts. Businesses will be looking for the best customer support possible.
When evaluating PagerDuty alternatives, consider the following criteria to make the best decision possible:
Ease of Use - Is the tool easy to use and understandable? How easy is it to onboard new users? You don’t want to be confused when figuring out who is on-call, especially during an incident.
Integration Capabilities - Ensure the tool integrates with your team’s existing toolset.
Scalability - Can the tool scale with your organization? Price, feature set, and onboarding should be considered when considering scalability.
Support and Training - Does the tool have easy-to-read support and training documentation?
By carefully assessing these criteria, you can identify the best PagerDuty alternative for your team's needs.
Here's our list of the top 5 best PagerDuty alternatives in 2024.
PagerTree’s Tagline: “On-Call. Simplified. - PagerTree empowers teams to share on-call responsibility and respond faster when incidents occur.”
PagerTree’s value proposition: On-call scheduling, escalations, and alert notifications starting at $15/user/month.
OpsGenie has core features like on-call scheduling, escalations, and notifications. Their “Standard” price plan ($23/user/month) includes features like integrations and reports. Advanced features like live call routing are provided for a $10 upcharge per number.
OpsGenie was acquired by Atlassian in 2018. Since then, its reliability has been called into question. In April 2022, OpsGenie had a 2-week (that’s not a misprint, yes, 14 days) outage. Unlucky customers were forced to move to an alternative solution. Otherwise, they had to wait until their account was prioritized for restoration.
OpsGenie Tagline: “On-call and alert management to keep services always on.”
OpsGenie Value Proposition: “Centralized alert management starting at $23/user/month”
iLert offers all the core features like on-call scheduling, escalations, and notifications. Additionally, iLert offers status pages and live call routing for an additional fee (+$5/user/month).
If you are a European customer or have strict data storage requirements, iLert might be for you. iLert is made and supported in Germany. If you are looking for US-based customer support, you could wait 12+ hours for responses.
iLert tagline: “One platform for alerting, on-call management, and status pages.”
iLert value proposition: “On-call management and status pages starting at $24/user/month”
Splunk offers all the core features like on-call schedules, escalations, and notifications. Splunk also bundles enterprise-focused features like Real User Monitoring (RUM), Log Observability, and Application Performance Monitoring (APM). The extra features come at the extra cost of complexity. Splunk can have a higher learning curve than other tools. Splunk doesn’t publish pricing, so you know it will be expensive.
OnPage's primary benefit is that it offers HIPPA-compliant notifications. If you are looking for a healthcare-centric tool, OnPage could be for you. If you are not in the healthcare space, we suggest looking at another tool. OnPage's user interface and on-call scheduler lack modern ease of use. Only their marketing has adapted to target the IT space. Additionally, pricing is only offered in yearly installments.
OnPage tagline: “Rise Above the Clutter® Elevate urgent notifications and facilitate secure team collaboration in critical situations.”
OnPage value proposition: “HIPPA compliant on-call scheduling and notifications starting at $29/user/month paid annually.”
To keep this list short, we only reviewed the PagerDuty alternatives we thought were best. We also compiled a list of a couple of other PagerDuty alternatives. You may find them interesting, but they didn’t quite make the cut.
On-call alert and notification tools focus on the core feature set of on-call scheduling. They handle escalations and notifications.
Incident Management and Analysis tools will generally be integrated into on-call alert and notification tools. They will also provide tools for retrospectives and postmortem analysis.
There are a few open-source alternatives that we want to mention. As always, with open source, these options might (or might not) be supported. They will have to be self-hosted.
In conclusion, choosing the right PagerDuty alternative depends on your team's specific needs. Each alternative mentioned above brings its own strengths to the table. Evaluate your requirements. Explore product features. Make an informed decision that’s best for your team.
Note: This list is based on features, user feedback, and industry trends as of 2024. Always check for the latest updates and reviews before making a decision.
is a leading incident management platform that has been around since 2009. PagerDuty aims to streamline an organization's incident response process. It has features like on-call scheduling, escalation policies, and alerts.
Core Features - The tool, at a minimum, should have on-call scheduling, escalation policies, and notifications. Additional features like and reporting are also a plus.
Reliability - Check the tool’s historical (usually found at https://status.domain.com). The tool should have minimal downtime. The tool’s vendor should also communicate clearly and effectively during an outage.
is the best overall alternative to PagerDuty. It is efficient for on-call scheduling and reliable for notifications. Additionally, it’s the most cost-effective solution on our list at $15/user/month.
PagerTree excels at . These include drag-and-drop on-call scheduling, escalation layers, and reliable multi-channel notifications. Extra features like live call routing and reports are provided at no extra charge. PagerTree has ample documentation, scalable pricing, and a reliable track record. It provides the best all-around functionality at a fraction of the price ().
You can start a 14-day free trial here:
is the best-known competitor to PagerDuty. They offer a streamlined approach to .
is the new kid on the block, but we actually really like this tool. iLert offers a modern user interface and reliable product that you would come to expect from a German startup.
Formerly known as VictorOps, is an incident management tool. It caters to enterprise organizations.
Splunk’s tagline: “Splunk On-Call - Make expensive service outages a thing of the past. Remediate issues faster, reduce , and keep your services up and running.”
offers “incident alert management”. They primarily target the healthcare industry, including hospitals, doctors, and nurses.
and
(archived)
Start your 14-day free trial of PagerTree today!
Welcome to an in-depth exploration of the Linux file system! In this comprehensive guide, we'll demystify the various directories found in a typical Linux distribution, explaining their purposes and functionalities. Whether you're a seasoned sysadmin or a curious newcomer, this article will enhance your understanding of the backbone of Linux's structure and operation.
The /bin
directory is a fundamental part of the Linux file system, playing a crucial role in system functionality. It contains essential user binary files, the basic programs and utilities necessary for the system to operate and for users to interact with it. These binaries include common commands like ls
, cp
, and mv
, which are indispensable for file management, and bash
, the default command-line shell for many Linux distributions. Unlike other directories that house more specialized or user-installed software, /bin
is reserved for these core components, ensuring that the system remains operational and accessible, even in single-user modes or when other file systems are not yet mounted.
Similar to /bin
, but this directory contains applications that only the super user (hence prefixed s
) will need. Applications in this directory need to be run with the sudo
command. Typically this directory contains tools that can install, delete and format. As you can imagine, some of these programs can cause system damage if used improperly.
The /etc
directory is crucial for system configuration. It contains all the configuration files required by the system and other applications. Unlike /bin
or /sbin
, /etc
does not hold executable programs, but rather static configuration files. Here, you'll find everything from user passwords (in /etc/passwd
), to network configurations, to the services at boot. It's like the settings menu of your operating system, but in a folder. You can remember it with "everything to configure".
Short for "device", the /dev
directory is a bit unique. It's where Linux stores device files, representing hardware components or drivers. For example, /dev/sda
typically represents the first hard disk in your system. These are not regular files - they are special files that help the system communicate with its hardware. It's a crucial part of the Linux file system, though not something a regular user would interact with directly.
This is a virtual directory, meaning it doesn’t exist on your disk. It’s dynamically created by the system. The /proc
directory contains information about system resources and the status of the operating system. Each running process has a folder here named by its process ID. You can peek into these folders to see detailed information about each process, but remember, it's mostly for viewing, not modifying.
Stands for "variable files". /var
is where files that are expected to grow are stored. This includes things like logs (/var/log
), mail (/var/mail
), and spool files. It’s a dynamic folder, and its contents change as the system runs. This is where you'll look when troubleshooting or when trying to understand more about what's happening on your system.
Just as it sounds, /tmp
is for temporary files. When applications or the system needs to store a file temporarily, it goes here. These files are usually deleted upon reboot or after a set period. It’s a scratch space for the system and applications.
Short for "Unix System Resources," /usr
is one of the largest directories. It contains additional user applications and their files. Think of it as a secondary hierarchy for user utilities and applications. /usr/bin
, for example, has many user commands, while /usr/lib
contains libraries for /usr/bin
and /usr/sbin
.
The /usr directory used to be where users’ home directories were originally kept back in the early days of UNIX. However, now /home is where users keep their stuff.
This is where the personal folders of each user on the system are located. If your username is “user”, you'll find your personal files in /home/user
. It’s akin to the Users directory in Windows. This is where you'll spend most of your time: your documents, downloads, pictures, and personal data reside here.
The /boot
directory contains the files needed to start up the system - the boot loader, the kernel, and other files needed during the boot process. It's small but essential. Without these files, the system can't start properly.
Similar to /usr/lib
, /lib
contains essential libraries needed for the binaries in /bin
and /sbin
. These libraries are fundamental to the operation of the system and the applications running on it.
An abbreviation for "optional", /opt
is used for storing additional software and packages that are not part of the default installation. Often, third-party applications are installed here. It's a common directory for software that’s not included in the standard distribution.
Short for "mount". This directory is a temporary mount point where system administrators can mount file systems before they are integrated into the system's file structure. Think of it as a place to plug in external resources temporarily.
Similar to /mnt
, /media
is used for mounting removable media like USB drives, CD-ROMs, etc. It’s a more modern version of /mnt
, and many systems automatically mount external drives here.
This stands for "service". /srv
contains data related to services offered by the system. For instance, if your Linux machine is hosting a website, the website data might live here. It's not used in all distributions, but when it is, it's for service data.
And there you have it, a brief rundown of the most common directories in the Linux file system. Each serves a distinct purpose, and understanding them can be crucial for effective system management and navigation.
In this article we will help you understand system monitoring, what you should look for in your system monitoring tool, and give you our top 7 best APM tools.
Monitored systems can include:
Servers
Networks
Applications
Configurations
APM tools allow system administrators to instrument, collect, and analyze crucial operational insights to keep systems operating at their peak.
We have compiled a list of the top 7 industry-leading system monitoring tools using 5 key factors to evaluate and scrutinize APM tools.
Here are the top 7 APM tools using our key criteria:
When evaluating an APM tool, consider these 5 key factors to ensure you are picking the right tool to monitor your systems. Reliability: Monitoring software's primary role is to provide consistent and accurate data on your system's health and performance. The software should have a proven track record of minimal downtime and accurate measurement, reporting, and alerting. Scalability: APM tools must be able to grow with your service, handling an increasing number of devices and metrics without a drop in performance. Integrations: The monitoring system you choose should easily fit into your existing technology stack. Its ability to integrate smoothly with other tools and platforms can significantly enhance its utility and the benefits you gain from it.
Pricing: APM tools range in pricing anywhere from free to hundreds of dollars. Pricing should align with your budget and the value it delivers. Look for transparent pricing models that scale sensibly with your usage. Ease of use: System monitoring software should provide a user-friendly interface, clear documentation, and responsive customer support. This will dramatically improve the experience of setting up and maintaining your monitoring solution.
99.8% availability
700+ built-in integrations
Watchdog- built-in intelligence layer that continuously analyzes billions of data points
“Advertised” starting price of $54/month per host.
PRTGs’ APM tool at a glance:
On-premises and hosted options
Failover node system for high availability
Excellent usability on desktop, mobile, and web.
Maintenance plans to maintain services
$159/month for hosted services
$2,149 one-time perpetual license for a perpetual license
LogicMonitors’ APM tool at a glance:
2000+ integrations
Industry-leading 99.9% availability
Active Discovery feature monitors changes in your environment
Host of prebuilt dashboards
Plans start at $22/month per resource(device)
New Relics pricing starts at $10/month for the first user with every additional costing $99/m.
New Relics’ APM tool at a glance
99.8% uptime
750+ integrations
APM tools give access to all of New Relics tools like AIOps and Security Monitoring
New Relic AI integrated into the system to assist users
Pricing starts at $10/month but quickly scales up.
AppDynamics’ APM tool at a glance:
99.5%
Business-oriented service monitoring
Access to Cisco University
Starting price of $33/month
Elastics’ APM at a glance:
99.96% historical uptime
Self-managed and cloud-managed services
Application dependency mapping
Starting price “as low as $95/month”
AppOptics offers full-stack visibility into your application as well as an auto-instrumenting application service, allowing you to quickly diagnose issues within your environment. They present data in a simple, easy-to-understand way that allows you to find issues quickly and dig deeper into them with more detailed views. AppOptics boasts integrations into multiple oncall software, allowing you to be notified of any issues 24/7.
SolarWinds does not offer a publicly available uptime nor do they have a public historic uptime.
AppOptics service for Application monitoring includes infrastructure system monitoring and starts at a price of $24.99/month per host. They sell hosts in packs of 10, meaning the minimum monthly cost for AppOpicts is $249.90/month.
AppOptics APM tool at a glance:
Simple and efficient data presentation
150+ integration
Built-in oncall alerting integrations
Non-public SLA and historic uptime
$24.99/month per host
Performing ping tests involves checking your internet connectivity. Reliable addresses to ping include:
Failure to receive a response from these addresses may indicate a problem on your end.
Interpreting ping results is crucial. Analyzing server hostnames, response times, Time to Live (TTL), and packet loss provides insights into network performance. Troubleshooting connection issues becomes more effective when armed with this information.
Below is what the ping command will return:
Host name confirmation: The first line displays the server's hostname translated to an IP address. It also confirms an active connection to the server was made.
Bytes sent to server - The number of bytes sent to the server.
Response times - Total roundtrip time for the response to return.
TTL (Time to Live) - The total number of routers the packet will travel through.
Ping Statistics - Overall statistics for the ping test. It includes number of packets sent, received, and lost.
Approximate Round Trip Times - Minimum, maximum, and average times for the ping test. Higher times indicate a poorer quality connection or servers that are far away.
Request Timed Out: There is a problem in establishing a connection. This occurs when the destination host is either non-existent, powered off, or disconnected from the network.
Firewall Impact: Firewalls, based on port numbers and IP addresses, may permit or restrict traffic. In some instances, ping might be blocked as a precaution against potential reconnaissance by malicious actors.
Destination Host Unreachable: The "destination host unreachable" error points to a failure in finding the route to the intended destination. This could emanate from issues within the local host or the default gateway. To resolve this, check your IP settings and verify the default gateway address.
Unknown Host: This error indicates a challenge in translating the hostname to the corresponding IP address, suggesting a potential DNS server problem.
In addition to the above errors, there are instances of packet data loss, where some requests receive replies while others do not. Possible culprits for this issue include malfunctioning network cards, damaged cables, or problems with switches and routers.
A postmortem describing the issue, root cause, and remediation of our outage on July 30, 2023 00:30 - 01:15 (UTC)
During the migration to the Fly.io platforms v2, the provided command (migrate-to-v2
) times out if a Postgres cluster doesn't replicate and failover fast enough.
The migrate-to-v2
command first puts the database in a read-only state. When the timeout occurs, the command fails to remember to put the database back in a writable state.
After the database is read-only, new and existing connections will not be able to write to the database. This caused PagerTree to functionally fail for approximately 45 minutes on July 30, 2023 from 00:30 -> 01:15 UTC.
The Postgres cluster can be put back into a writable state with the following commands:
Note that on lines 9 & 10, the command will only work for the current connection. You need to use line 14 to solve the root cause.
To fix this, you need to connect to the Postgres cluster and tell it to forget the orphan VM.
All times are UTC and any references to communication or actions taken by PagerTree were performed by Austin Miller.
Tuesday, July 25th, 2023 at 15:30 we attempted to migrate our staging Postgres cluster and found errors. We ran into the timeout error (but didn't understand it as the root cause), and once replicated an migrated used the migrate-to-v2 troubleshoot command to kill the Nomad VMs and mark the application as V2. Additionally, we started troubleshooting why the staging database was left in a read-only state. We found the psql command SHOW default_transaction_read_only;
to show the applications database in a read only state. We restarted the database using fly postgres restart -a <postgres_fly_app_name>
and killed any active sessions using select pg_terminate_backend(pid) from pg_stat_activity where application_name like '/app%' or application_name like 'side%'; and
the database went back into a read/write state and we thought everything had been fixed.
Tuesday, July 25 2023 at 16:32 (almost in parallel to the previous bullet point) we reached out to Fly support team stating the issue we had found with the migration of the staging postgres cluster. At 17:06 we replied reporting killing the sessions will fix the issue. Sam Wilson from Fly support responded 5 hours later at 21:37, reporting they were glad we were able to work around it and our app had been successfully migrated to v2 and they were also seeing read/write enabled on the staging primary.
Thursday, July 27 at 03:30 we attempted to perform the migrate-to-v2
command with failure (unspecified error code). Applications were turned off between 03:30->03:40 resulting in 10 minutes of application downtime.
Thursday, July 27, 2023 at 23:15 we were reminded via an automatic email that our production Postgres cluster was still scheduled for an automatic upgrade the following week.
Friday, July 28, 2023 at 22:30 we attempted the migration again without success, but now with an error "Page Not Found". There was approximately 5 minutes of downtime for the PagerTree app.
Friday, July 28, 2023 at 22:52 we emailed the Fly support team with the new "Page not found" error. We asked if they could look into their logs to see what could be happening. We also expressed concern for the automatic migration of apps the following week when the migration command was failing.
Saturday, July 29 at 05:09 Brian Li from the Fly support staff suggested we try deleting a volume from an orphaned VM, then using the LOG_LEVEL=debug
on the migrate-to-v2 command
. We deleted the orphan volume.
Sunday, July 30 at 00:30 we attempted to migrate the cluster with Brian Li's suggestion. The PagerTree application was taken offline and a service outage began. The migration looked to be going smoothly now that the orphaned volume had been deleted.
Sunday, July 30 at 00:35 the migrate command had timed out. At this point the new v2 machines had been created but were still replicating from the leader. We decide to wait until replication had been completed before running the troubleshooting command and deleting the v1 VMs (similar to what we had done with our staging Postgres cluster, bullet #2).
Sunday, July 30 at 00:40 replication had completed, so we tried to bring the PagerTree application back online. We immediately started to see errors in Honeybadger ActiveRecord::StatementInvalid: PG::ReadOnlySqlTransaction: ERROR: cannot execute INSERT in a read-only transaction
The following describes the actions taken between Sunday July 30 at 00:40 and 01:16 without specific timestamps.
We tried killing the connections using select pg_terminate_backend(pid) from pg_stat_activity where application_name like '/app%' or application_name like 'side%';
in hopes that a new connection would be in read/write mode. This failed and at this point, we knew that database was in a read-only mode.
Logging into the Postgres cluster an running SHOW default_transaction_read_only;
on the application database confirmed our suspicion. We tried running SET default_transaction_read_only TO off;
to fix the issue. With a successful command run, we believed the database would now be in a read/write mode. We would later learn this only set the option for the current connection.
We restarted the PagerTree application but again saw the errors regarding the read-only transactions.
We searched the internet how to make the appropriate change to all new connections. After searching for 5 or 10 minutes we found a working solution alter database application_db_name set default_transaction_read_only=off;
We restarted the PagerTree application and confirmed that the database was now in a writable state.
Sunday, July 30 at 01:43 we declared the incident resolved.
Impact - The PagerTree application was down for 46 minutes and impacted all customers and integrations. Incoming requests, alerts, and notifications were all impacted during these 46 minutes.
Root Cause - migrate-to-v2
timeout and database left in read-only state.
Recurrence - This also happened in our staging Postgres upgrade, but we thought a Postgres restart and killing existing connections fixed the issue.
Corrective actions - alter database application_db_name set default_transaction_read_only=off;
is the authoritative fix for the read-only state of the database.
Future Monitoring - We have added writing to the database a check in our monitoring. A write test is performed every minute now.
Tutorial showing how to implement multi-tenant single sign-on (SSO) using Ruby on Rails, Devise, and SAML. Works with identity providers like Okta, Google, Azure, etc.
Recently while scrolling on Twitter I saw this tweet by John Nunemaker.
In this blog post, I want to describe how we implemented multi-tenant SSO at PagerTree to work with any SAML2 identity provider (Okta, Google, Azure, etc.).
STOP HERE - This is not a Copy Pasta™ blog post. Some things are very specific to the PagerTree implementation. You'll need to adapt the code to work for your project. This post is to help do most of the heavy lifting.
This blog post will make a lot of assumptions about its implementation (it's a highly niche implementation).
This implementation uses the emailAddress attribute of SAML as the primary identifier for Users.
We've snipped a lot of PagerTree specific code for the purposes of brevity and staying focused.
Some of the most confusing things in SSO implementation is that there is no "standard" naming convention. I have seen many aliases and synonyms all over the web.
idp
- Identity Provider (IdP) - Your customer's authentication provider (ex: Okta, Google, Azure, etc.)
idp_entity_id
- The unique tenant identifier in the IdP's database.
idp_sso_service_url
- The URL your app needs to redirect the user to with the AuthNRequest. It will be at the IdP's domain.
sp
- Service Provider (SP) - Your app, the one you are building (ex: PagerTree)
sp_entity_id
- A unique tenant identifier in the SP's database.
assertion_consumer_service_url
- The endpoint on the SP where the IdP should send the user after they have authenticated.
authnrequest
- Programmatic authentication request.
slo
- Single Logout
saml
- (Security Assertion Markup Language) is an XML-based standard for exchanging authentication and authorization data between parties, enabling Single Sign-On (SSO) functionality.
If you are not familiar with SSO that's ok, I am going to go over the basic ideas (a full explanation is outside the scope of this article).
If you've ever logged in to an app using your Microsoft, Google, or work account, it likely used SAML to exchange information about your authentication. The IdP is responsible for the authentication of users (aka verifying users are who they say they are).
The basic workflow looks like this:
with The user comes to the SP application (aka your application).
The user provides the SP application with the authentication email (usually their work email).
The SP looks up the user and an IdP configuration this user is associated with. The user is then redirected to the IdP (idp_sso_service_url
) with an authentication request in the format of an AuthNRequest.
At this point, the user either must provide valid credentials to the IdP. Once valid credentials are provided, and the the IdP confirms the user should have access to the SP application, the user is redirected by to the SP application at the assertion_consumer_service_url
.
The SP is then responsible for granting access to the application based on the trusted response.
SP initiated - When a user comes to your app and clicks "Login using SSO" providing you their email address. This is probably the most common workflow and was described above.
IdP initiated - When a user logs in via their "app portal" from the IdP. Not very common, never have used it myself, but we need to support it. It doesn't change the code, but I am including it here for completeness.
We need to add a model to hold each tenant's SSO configuration(s). I will briefly explain what each property is:
account_id
- The tenant this belongs to.
meta
- Free form hash where we can store any future data.
sp_entity_id
- The unique identifier for this configuration.
name
- A user friendly name so they can remember this configuration (ex: "Okta Config", "Okta Dev Config")
vendor
- Enum identifier the IdP vendor. When debugging with customers why their configuration doesn't work, it's helpful to know the vendor (some vendors do some wonky stuff).
metadata_url
- The URL to the IdP's metadata XML.
metadata_xml
- The raw metadata XML (some vendors don't provide a metadata URL). The user should be able to copy and paste it into our app.
settings
- A JSON representation of the parsed XML.
assertion_response_options
- A hash of configurable options (per tenant) that we can pass into the Ruby SAML library.
Our IdPConfig model will hold a SSO configuration. Each account can have many IdPConfigs, but there will only ever be 0 or 1 active IdPConfigs for an account at a time.
A couple of important notes:
Line 78 - We use SecureRandom.hex
and not a UUID. Azure does not like dashes in the sp_entity_id; a hex key will work across all known providers.
Line 95 - We use OneLogin::RubySaml::IdpMetadataParser
to parse the XML provided by the user or the IdP's metadata_url.
The important paths are as follows:
/sso
- Where the user comes in the SP initiated workflow. We ask them for their email here.
/public/saml/consume
- Where the IdP redirects the user to after they have provided their credentials to the IdP. This is the assertion_consumer_url
. The payload of the request will be the assertion of who the user is.
/public/saml/metadata
- A convenience endpoint for users to get information in XML format about the SP. IdP's sometimes will ask for this. Its a programmatic way for the SP to provide the IdP with details like the assertion_consumer_service_url
/public/saml/slo
- The IdP will make a request here if the user is logged out. This is known as single logout. We need to destroy the users session when this URL is called.
You'll need to read through the sessions controller, but I will give a brief summary:
Line 7 - before_action :set_idp_config
- Set the IdP Config for SSO methods.
Line 10 - def destroy
- Override the Devise destroy method. Send the IdP a logout request if our user logsout from our app.
Line 27 - def sso
- Render the SSO page to capture the users email.
Line 82 - def saml_callback
- Process the IdP response. This is the assertion_consumer_service_url
.
Line 91 - if !user
- Create a user if they don't exist in our database but were authenticated by the trusted IdP. This can occur when a SSO administrator adds access to your application and it's the users first time to login to your app.
Line 118 - def saml_metadata
- The convenience method providing metadata that describes the SP configuration.
Line 126 - def saml_logout
- Process the IdP initiated single logout request.
Line 164 - def verify_can_username_password
- SSO users should be forced to use SSO
So in /app/controllers/accounts_controller.rb
we have something like this:
As service providers, we understand that for our service isn't an achievable goal, but we do everything in our power to provide our customers with the best possible service and possible. We implement tools and processes to allow ourselves the ability to respond to issues before they affect our customers. One type of tool we implement is system monitoring tools. Having access to all of our systems in a clean, easy-to-read dashboard helps us see trends and issues before they become serious problems. Understanding our systems and resolving issues before our customers see them helps improve customer satisfaction and service uptime and helps us meet our SLAs, but what is system monitoring? System monitoring, also known as “application performance monitoring” (APM), is the process of tracking and evaluating the performance and health of a service provider’s infrastructure.
is a widely used application performance monitoring tool, thanks to its extensive list of over that make it highly adaptable for any stack. The company promises a , ensuring its tools are available when you require them the most. Datadog is unmatched in the services it provides, encompassing everything from log management to full infrastructure monitoring.
They champion their “code level distributed tracing tool, Watchdog,” which enables users to “detect and resolve root causes faster,” improving application performance and security. This Watchdog AI automates root cause analysis, helps detect anomalies, and minimizes downtime.
The Datadog APM plan advertises a cost of $36/month per , but in order to utilize the APM tool you are required to also have an . These start at $18/month per host. This brings the cost of Datadog (on its face) to $54/month per host at its base level. Datadogs’ APM tool at a glance.
offers multiple solutions for system monitoring and APM tools offering and solutions. In addition to these solutions, PRTG offers different payment methods to meet your needs standing out as one of the only tools to offer a .
The PRTG Monitoring system stands out from many APM tools due to its “” system, which allows 4 simultaneous PRTG core servers to be active on any one machine at a time. This failover system could potentially offer the highest level of availability, and PRTG proclaims its system as “high availability” thanks to this feature. PRTG boasts its “excellent usability” on all platforms, including web, desktop, and mobile, making it one of the few to actually brag about how well its user interface works.
PRTG offers two pricing options: perpetual licensing for its on-premises services and per-month/annual pricing for its hosted services. We will look at one price point for each. : Services start at $159/month and covers (according to PRTG) 50 devices.
: One-time fee starting at $2,149 per license and covers 50 devices.
: To stay up to date with services and tech support PRTG offers maintenance plans that cost ¼ of the original price of your selected plan. This is not required.
When it comes to integrating an APM tool into your system, LogicMonitors' and self-proclaimed “lightning-fast implementation” make it the go-to system monitoring tool for any business looking for fast deployment, and smooth integrations. LogicMonitor’s feature actively monitors changes in your environment, automatically discovering and monitoring new virtual machines, volumes, or devices.
LogicMonitor comes with a host of pre-built dashboards offering immediate insight into the data that is most relevant to you and your industry. With an industry-leading , LogicMonitor is the choice for those looking for the highest level of promised availability from a hosted service.
Pricing for LogicMonitors' Infrastructure monitoring plan starts at (device) with a minimum of 30 resources. Additional add-on plans are available for log retention, AI-powered anomaly detection, and application traces.
New Relics’ APM tool, APM360, not only monitors your applications with a but also gives you access to a slew of New Relic tools, including security monitoring and AIOps. With over and an industry-leading 8 supported programming languages, APM 360 dramatically increases usability for developers. It features an automatic service map that visualizes dependencies within your systems, simplifying the understanding of complex data structures. APM 360 excels in transforming difficult-to-comprehend data into easily digestible visuals, ensuring that insights are accessible and actionable. is one of their newest feature allowing users to find and fix issues, get quick insights, translate data and queries, and instrument their system with ease.
New Relic offers dynamic pricing to meet the needs of every business, large or small. They utilize the to determine pricing on all plans, + = Price.
, the self-proclaimed leader in the APM tool sphere, offers a unique business-oriented system monitoring service to its clients connecting data, development teams, and IT teams to business results. Given their includes “AppDynamics is on a mission to help companies see their technology through the lens of the business” this would make sense. By offering observability across entire organizations, implementing end-to-end observability on a code and transaction level, and presenting data at a level in which C-suite and devs can both understand AppDyanmics upholds that mission.
AppDynamics promises one of the lowest availability rates in the industry: uptime. While this number pales in comparison to LogicMonitors' 99.9% uptime, it gives AppDynamics more room to test new features and expand its offering for customers.
As an added bonus to AppDynamics, all paid plans give users “unlimited standard access” to .
for the AppDynamic APM tool starts at $33/month per core. Users should be aware of additional add-on features when researching their tool.
offers both and services to its customers, giving them the flexibility to choose the service they want. Elastic offers a clean and simple view of Application Performance Monitoring, with clean and easy-to-understand dashboards. Their application dependency mapping helps teams identify application problems quickly by automatically visualizing the relationship between services inside your application ecosystem. Partnered with their , you will gain a level of observability like none other.
While Elastic does not present a public availability percentage, it does have a historically amazing uptime. Within the last six months, its tool has not fallen below uptime, which shows its ability to deliver on uptime.
Elastic APM offers a “” plan for self-hosted monitoring but requires you to contact their sales team for licensing. For cloud-hosted services, Elastics' pricing is advertised as “as low as .” Pricing is based on your cloud production configuration, but Elastic does offer a to help you determine your actual cost.
APM tool, AppOptics, delivers effective and efficient application performance monitoring for the needs of any business, small or large. They boast their cost-effective scaling to match the growth of your businesses.
System monitoring or APM tools are essential for maintaining uptimes, detecting service failures, and evaluating performance. Whether you’re looking for a service with high uptimes, every integration under the sun, or unmatched services, there is a tool that fits your needs. When selecting an APM tool, factors like reliability, scalability, integration, cost, and usability must be considered to meet the current and future needs of your business. System monitors are only one tool in your arsenal to maintain service levels and improve overall performance. Most organizations partner these tools with like to ensure their teams are being notified when incidents occur.
The network test, a core utility since the 80s, plays a crucial role in confirming connectivity between IP-networked devices. In this guide, we'll delve into what the ping command is, how to run a ping network test, common IP addresses to ping, interpreting results, and troubleshooting errors.
Ping is a command available on , that sends data packets to a specific IP address, gauging the existence of connectivity between devices. Originating from sonar technology, where a sound wave is emitted and an echo is awaited, ping measures the round-trip time for data requests, revealing network health and potential issues.
Executing a ping test varies by operating system. For Windows, open the Command Prompt, type "ping," and enter the desired IP address or domain. Mac users can use Network Utility, while Linux users employ the Terminal and command for more in-depth analysis.
Cloudflare ( and 1.0.0.1)
Google DNS ( and 8.8.4.4)
Additionally, we ran into another issue of a . The WAL log is responsible for replica databases to catch up to the leader. If the leader believes a replica has not caught up, it will continue to keep the WAL log around; this can fill up the database's entire hard drive, causing the database cluster to fail.
In the monitoring page of your app, you might see output like this: 2023-07-31T16:52:25.689Z WARN cmd/sentinel.go:276 no keeper info available {"db": <db_id>, "keeper": <keeper_id>}
. What this is trying to tell you is that the leader thinks there is a replica somewhere out there that hasn't been updated. In our case, an orphan (dead) VM was never unregistered.
Monday, July 24, 2023 at 22:04 the PagerTree team was notified by an automated email that Fly.io would need to migrate our staging and production Postgres apps from the deprecated Nomad (v1) to the Machines (v2) platform the following week. It was advertised that the command flyctl migrate-to-v2
, but in our experience, we had run into issues during upgrades on Fly.io. We decided to proactively upgrade the application to be able to address any issues ahead of time.
Thursday, July 27 at 03:58 we reported our findings to Fly. The Postgres cluster was also left in a strange state (ophaned VM) with lots of errors. (This would later be found to be the ). We asked the Fly support team to look into the issue and advise.
Thursday, July 27 at 17:50 UTC we asked for an update on the ticket since we noted our database hard drive filling up and we had not yet received a response from Fly support.
Thursday, July 27, 2023 at 18:38 Nina Vyedin from the Fly support staff responded with a very generic error and referenced us to try running the migrate-to-v2
command again and troubleshooting with .
Sunday, July 30 at 01:16 we posted an update to our that the incident had been recovered and we were monitoring.
Since we had implemented , I thought I could help out. After all, I (turns out it is closer to 400). After sharing a raw gist, I realized a blog post would be more helpful to the community.
is the framework (7.0.4)
gem is used for authentication (4.8.1)
gem for SAML parsing (1.14)
gem for tenant management (0.5.1)
Checkout our on how this looks in practice.
/saml_callback
- Alias for /public/saml/consume
(see below). We had to support some legacy URLs when .
Line 6 - skip_before_action :verify_authenticity_token
- On requests from the IdP, don't verify the .
Line 28 - user = User.find_by_email(email)
- between IdP and SP.
In PagerTree, . However, we don't want users to be able to have a personal account and login via username and password and then switch to an SSO enabled account. For SSO enabled accounts, a user should always be required to authenticate via SSO.
The Multi-Tenant SSO setup is a fairly advanced topic. Having done this several times before, I am sure I missed some things and could likely make other things clearer. If you have any constructive feedback you can . I can't address every comment, but with your input I will try my best to update this content to make it even clearer for others in the community.
PowerShell is a powerful scripting language and command-line shell that is widely used for automation, administration, and managing Windows environments.
PowerShell is a powerful scripting language and command-line shell that is designed specifically for system administration and automation tasks in Windows environments. Whether you're a seasoned sysadmin or just starting with PowerShell, having a cheat sheet of essential commands at your fingertips can greatly enhance your productivity. In this blog post, we will cover some fundamental PowerShell commands, starting from the basics and gradually progressing to more advanced concepts.
The Set-ExecutionPolicy command allows you to manage the script execution policy on your system. It determines whether PowerShell scripts can be run and helps ensure system security. Here's an example of setting the execution policy to allow running scripts.
If you have not yet run Powershell on your computer and are getting errors because of permissions, you likely need to run the Set-ExecutionPolicy command.
PowerShell piping is a powerful feature that allows you to take the output of one command (or cmdlet) and use it as input for another command. It enables you to chain together multiple commands to perform complex operations with ease.
To understand piping, let's consider a simple example. Suppose you want to retrieve a list of running processes on your computer using the Get-Process
command. By default, running Get-Process
will display a table showing various details about the processes. However, what if you only want to see the processes related to a specific application, such as "chrome"?
In a non-piping scenario, you might need to run a separate command to filter the results manually. However, with PowerShell piping, you can achieve this in a more straightforward way. Here's an example:
Let's break down this example step by step:
Get-Process
: This command retrieves a list of all running processes on your computer.
|
: The vertical pipe character (|
) is the piping operator in PowerShell. It takes the output from the left side and passes it as input to the command on the right side.
Where-Object
: This command is used to filter objects based on specific criteria. In this case, we want to filter the processes based on their name.
{ $_.Name -eq "chrome" }
: This is a script block, which is essentially a piece of code enclosed within curly braces. It specifies the condition we want to use for filtering. Here, we're checking if the process name ($_.Name
) is equal to "chrome". The $_
is an automatic variable referencing the current object in the pipeline.
By using the piping operator, we can take the output of Get-Process
and directly pass it to Where-Object
for further processing. As a result, only the processes with the name "chrome" will be displayed.
Piping can be used with multiple commands, allowing you to perform complex operations in a single line. You can chain together as many commands as needed, each building upon the output of the previous one.
Get-Alias retrieves the list of aliases (shortcuts) for PowerShell commands. It helps you understand and use PowerShell shortcuts effectively. Here's an example of retrieving all the aliases:
Where-Object filters objects based on specified criteria. It's handy for selecting specific data from a collection. Here's an example (using the Where-Object alias) of filtering processes based on their name:
To set environment variables within a PowerShell session, you can use the $env:
notation. Here's an example of how to set an environment variable:
Get-Process retrieves information about the running processes on your computer. It's an excellent command for monitoring and managing processes. The top 10 processes utilizing highest CPU:
Stop-Process terminates a running process. It allows you to end processes gracefully or forcefully if needed. Here's an example of forcefully stopping a process using its process ID (PID):
Get-Service retrieves information about services running on your system. It's helpful for managing and monitoring services. Here's an example of retrieving all the running services:
Stop-Service stops a running service. It helps you manage services effectively. Here's an example (using Powershell piping) of stopping a service by its display name:
Get-EventLog allows you to access event logs on your computer. It helps in analyzing system events and troubleshooting. Here's an example of retrieving the Application event log:
Invoke-WebRequest allows you to send HTTP requests and retrieve web content. It's useful for automating web interactions. Here's an example of downloading a file from a URL:
Export-CSV enables you to export PowerShell objects to a CSV (Comma-Separated Values) file. It's useful for storing and analyzing data. Here's an example of exporting process information to a CSV file:
Format-Table allows you to format and display PowerShell output in a tabular form. It's handy for better readability and presentation. Here's an example of formatting the output of the Get-Process command:
Invoke-Command enables you to execute commands on remote systems or run commands in the background. It provides a way to manage and automate tasks across multiple machines. Here's an example of how to execute a command on a remote system:
The ForEach-Object -Parallel command allows for parallel processing of data, making your scripts more efficient. It splits the input into multiple threads and processes them concurrently. Here's an example of how to parallelize a loop to perform actions on multiple computers simultaneously:
PowerShell is a versatile tool for automating tasks, managing systems, and performing various administrative tasks in Windows environments. This cheat sheet covered essential commands, from basic system information retrieval to more advanced concepts like parallel processing and remote command execution. By familiarizing yourself with these commands and their usage, you can become more efficient and effective in your PowerShell scripting journey. Happy scripting!
(Note: While this blog post aims to provide a comprehensive overview of the mentioned commands, it is essential to refer to the official PowerShell documentation for in-depth explanations and additional examples.)
Prometheus is an open-source monitoring and alerting toolkit that has gained significant popularity in DevOps and systems monitoring. At the core of Prometheus lies PromQL (Prometheus Query Language), a powerful and flexible query language used to extract valuable insights from the collected metrics. In this guide, we will explore the basics of PromQL and provide query examples for an example use case.
You have a high availability web app that you maintain. You'd like to have some observability into the traffic of your application. Your environment consists of 3 production web servers and 1 staging web server. Below is a table of instance vectors for your servers.
PromQL allows you to query time series data, which consists of metrics and their corresponding labels. The basic syntax for querying time series is as follows:
Example:
To query the total HTTP requests metric for your fleet of servers, you would use:
The above example would return an instance vector for each server in your fleet.
Instance vector selectors allow you to filter and focus on specific labels to extract relevant metrics. To filter the time series, append a comma-separated list of label matchers in curly braces {}
The above example would return an instance vector for each production server in your fleet.
Additionally, PromQL provides the following label matching operators:
=
: Select labels that are exactly equal to the provided string.
!=
: Select labels that are not equal to the provided string.
=~
: Select labels that regex-match the provided string.
!~
: Select labels that do not regex-match the provided string.
Regex matches are fully anchored. A match of env=~"foo"
is treated as env=~"^foo$"
. You can test your regex matches here using the Golang flavor.
So, to select all of our staging servers, we could use the following query:
PromQL provides various aggregation functions to summarize and aggregate time series data. Here are a few commonly used functions:
sum
: Calculates the sum of all matching time series.
avg
: Computes the average value of matching time series.
min
: Returns the minimum value among all matching time series.
max
: Returns the maximum value among all matching time series.
Example:
To calculate the average HTTP requests across all production instances, you can use:
and The above would first return the instance vectors and then generate the average:
PromQL allows you to work with range vectors, representing time series data over a specified time range. This is particularly useful for analyzing trends and patterns. Here are a few important range functions:
rate
: Calculates the "per-second rate of increase" of a time series over a specified time range.
irate
: Similar to rate
, but calculates the "instantaneous per-second rate of increase" of a time series over a specified time range by only considering the last 2 points.
increase
: Computes the "absolute increase" in a time series value over a specified time range.
Enjoying this content? Check out our full article on Counter Rates and Increases here: https://pagertree.com/learn/prometheus/promql/counter-rates-and-increases
Example:
To calculate the number of HTTP requests you are getting for your entire production fleet.
The above would first return the instance vectors, then calculate the difference between the vector values t1-t0
, then sum them.
PromQL is a versatile and powerful query language that empowers users to extract valuable insights from Prometheus metrics. By mastering the basics covered in this cheat sheet, you'll be well-equipped to explore and analyze your monitoring data effectively. Remember, this blog post only scratches the surface. Experiment with different functions and operators to make the most of PromQL's capabilities.
By keeping this cheat sheet handy, you'll be able to navigate PromQL queries efficiently and unlock the full potential of Prometheus for monitoring and alerting in your systems.
At PagerTree we monitor our systems extensively; here are some of the common queries we use. The metrics (and metric names) we use are provided by the discord/prometheus_exporter gem or our own metric label name.
Query:
Graphed Result:
⭐ Link to full Prometheus Knowledge Hub
New features including reduced pricing, a crisp UI, and better alert aggregation are finally here.
Today I am excited to announce we have officially shipped PagerTree 4.0!
Here are the highlights:
Better UI - Both desktop and mobile, your eyes (and mine) can be relieved.
Multiple Accounts Per User - Users can now be a part of multiple accounts.
This effort has been a year and half in development and I sincerely want to thank each and every one of our customers for the constructive feedback, ideas, and countless hours on Zoom calls. Without you this journey wouldn’t be possible.
If you have any other features you think we could add or improve on, make sure to give us a shout! We love it when customers suggest new ideas.
Sincerely,
Austin
General
Search - Is now powered by ElasticSearch and is way more relevant.
I18n - We now support English, Spanish, French, German, and Dutch languages in the UI.
Tagging - Most models now support tags that can be used when searching.
Better UI - Both desktop and mobile have been redone. Better organization. Clear call to action.
Billing - Now handled through Stripe Billing Portal.
Pricing
Reduced to our original pricing model (Elite $15 - Pro $10 - Basic $0).
We’ve introduced a new “Enterprise” package (Enterprise $25).
Authentication and Security
Alerts
Public Pages - You can now make alerts public to the internet (default: false)
Users
Multiple emails per user
Multiple phones per user
The PagerTree iOS (iPhone) app now supports Critical Alerts bypassing do not disturb and the mute switch!
By default, Critical Alerts will be created for all critical
urgency alerts in PagerTree.
In this blog post you’ll learn how to build a polymorphic select box in Ruby on Rails. Seems trivial, but isn’t. Let me save you some time.
Today I want to show you how to build a polymorphic select box in Ruby on Rails. Seems trivial, but it’s not. Let me show you the way and save you some time.
Now to the PagerTree specific issue at hand. In PagerTree, we have many models (think alerts, broadcasts, ect.). Those objects can be assigned/routed to many different other objects. So for this example, a broadcast message can be sent to users, teams, and stakeholders; we’ll call these “broadcast recipients”.
The solution involves using signed global ids and a utility method to set the broadcast recipients. Today I write this blog post in helps that it can save you time, and you can implement a polymorphic select box in Rails in a clean and secure manner.
My initial setup looked something like this: A broadcast can be created with many broadcast recipients. The broadcast recipient could be a user, team, ect. The broadcast controller only accepts “permitted” params, builds the broadcast and saves it to the database.
It’s a pretty standard setup. I would have expected that Rails would know what do when our end-user selected a user or a team, but it didn’t. What was actually sent back by the form were the id
of the User/Team. Well that’s a big problem. Server-side we don’t know what kind of object it was, and therefore don’t know was it User #1 or Team #1. You get the point.
I also made the assumption, that Rails would automatically just know the type of object, and it probably would have in a normal situation. The problem stems from the fact that the broadcast recipient is a linker model, Rails didn’t know how to populate the polymorphic linker model.
A Global ID is an app wide URI that uniquely identifies a model instance
So essentially, I could make a select box with Global Ids, then the user could pick what they wanted and submit them back to our server. That’s a great start, but what happens if they modify the HTML and inject a different kind of model that shouldn’t be able to receive a broadcast (ex: an integration)? Even worse, what if the user is malicious, and starts injecting random models to see what they can poke around with (ex: AdminUser)? Eeek! We don’t want that.
Luckily though, there is also a version of Global Ids that are signed, namely Signed Global Ids (SGIDs). This would make it really hard for a malicious user to figure out the global id, or to inject their own. What’s even better is that Secure Global Ids are signed with a expire time. This means, they are not forever, and they will never generate the same SGID with repeated calls.
It’s worth noting that expiring SGIDs are not idempotent because they encode the current timestamp; repeated calls to to_sgid will produce different results.
The final solution looks like the following (notice the utility function to set the recipient_users and recipient teams).
We also need to add a helper for our view, that sets the selected property. Because the SGIDs change on every call, we have to manually check if the broadcast had them already selected. Yes, this is not very efficient, but its a sacrifice we are willing to pay for security.
In short, a polymorphic select box using linker models in Ruby on Rails is not so trivial. Using SGIDs and a utility accessor method we can make a polymorphic select box that simple and secure. I hope you’ve found value in this article and it has saved you some time :)
Server Instance | environment | http_requests_total(t0) | http_requests_total(t1) |
---|---|---|---|
Reduced Pricing - has been reduced from $20 to $15.
Better Docs - New docs (with improved search) that are more clear and concise than before.
2FA - will keep your account safer.
We are excited to get this major release shipped, just in time for the holidays. You can check out the full details of the upgrade . Over the coming months we will continue to add features from our .
Documentation - Better documentation. Less redundant, clearer, and more concise.
2FA - has been added.
SSO - has been simplified.
Accounts -
Better
- Alerts can now be commented on by users.
- Open Source - Check out .
- Notes and Attachments Section
- Many scheduling bugs have been fixed
- Better messaging - known as “Communications”
- Easier and Faster Filtering
- Option to enable auto recharge forever
A new version of the PagerTree is available today that adds support for Apple's Critical Alerts.
Critical Alerts are special notifications that , focus, and do not disturb settings to generate an audible notification in emergency situations. Critical Alerts are only available to applications that have applied and been approved by Apple.
If you don’t have time to spare and read the background, jump down to . The solution involves SGIDs and a utility accessor method, plus a form helper.
It’s no secret, as we build out we’ve made the design decision to use as our framework of choice. Why you ask? Ruby on Rails is a stable, battle-tested, and relatively simple MVC framework. Some of the largest companies (, , , and just to name a few) use the Ruby on Rails framework for their application. As we think about the next chapter of PagerTree, the ease for new developers to work on PagerTree should follow standard conventions and tools.
With as mature as Ruby on Rails framework is, one would make the assumption that there was a standard and simple way to implement this, however, what I found that there was not (at least in this use-case, with a linker model, namely our broadcast recipient). I spent a couple hours researching and fumbling through some code when I decided I would ask Chris Oliver for some help on this issue (Chris, also known as on the internet, is the founder of ). As you can see below, I thought this would be a 10 minute call, turns out it turned into a 45 minute call with a solution that was way more complex than I expected.
Enter (I believe this is already included in Rails 6.2+). It looks like gid://YourApp/Some::Model/id
. It has the ability to uniquely identify a model by keeping its type and id.
web_prod_1
production
100
110
web_prod_2
production
200
220
web_prod_3
production
300
330
web_stg_1
staging
10
20
Understanding SRE metrics and how they impact your platform's availability are fundamentals of Site Reliability Engineering.
How available is your website, service, or platform? What must you monitor and measure to ensure availability? How do you translate uptime into availability? This chart has numbers that every Site Reliability Engineer (SRE) should know. Below the chart, you will find answers to commonly asked questions about SRE and associated metrics.
There is a saying in the NFL that goes, “A player’s best ability is his availability”. The same thing is true for websites, applications, and platforms. You can have a great website or the “best” cloud platform, but if it is not available for your customers when they need it, then your business and your reputation will suffer.
In this day and age availability is everything and it comes with a cost. Availability comes in many different forms, like redundancy, load balancing, multiple data centers, and engineering response, to name a few. To calculate availability, we typically look at how long service was unavailable during a specified period of time, taking into account planned maintenance and other planned downtime.
Industry jargon refers to the number of “9’s” related to availability. For instance, one 9 would be 90%, while five 9’s would be 99.999%
Metrics have become the lifeblood of many organizations. Deciding what to and what not to monitor can be just as important as the monitoring tools themselves (Prometheus, Grafana, systems, etc.). In many instances, there can be an overwhelming urge to gather metrics on every available function, potentially leading to information overload. To keep monitoring manageable and actionable, consider the following methods when determining your needs.
For hardware-related monitoring, consider the USE Method.
Utilization (% time that the resource was busy)
Saturation (amount of work resource has to do, often queue length)
Errors (count of error events)
For services-related monitoring consider the RED Method.
Rate (the number of requests per second)
Errors (the number of those requests that are failing)
Duration (the amount of time those requests take)
For Kubernetes related monitoring of services consider the Four Golden Signs
Latency (time taken to serve a request)
Traffic (how much demand is placed on your system)
Errors (rate of requests that are failing)
Saturation (how “full” your service is)
Tom Wilkie of GrafanaLabs did a great talk on these at GrafanaCon EU 2018. For more information on these methodologies watch the video below or check out this article by Grafana Labs.
Site Reliability Engineering (SRE) dates back to 2003 when Google assigned a team of software engineers to design a concept that would make certain Google websites were efficient, scalable, and reliable. The concepts they used were so successful that other technology companies, like Netflix and Amazon, began using similar concepts as well as improving upon them. In short order, SRE became its tower within the IT architecture domain. SRE is meant to work in concert with DevOps but focuses on such things as capacity planning and disaster recovery and response. Ultimately, SRE focuses on the automation of operations endeavoring to remove the human element so that sites, applications, and platforms can be optimized.
Understanding how availability impacts the delivery of your chosen platform starts with knowing what those numbers look like. For instance, the difference between 2 9’s and 5 9’s goes from days to minutes, per year. Therefore choosing the proper methodology such as RED, USE, or the Four Golden Signs will allow you to deliver high availability for your specific service. A good starting point to help you define your SRE operations can be found here, Google’s guide to SRE Operations
You have identified a data breach, now what? In this blog post I’ll teach you how to streamline your incident response during a data breach with best practices.
You have identified a data breach, now what?
Your Incident Response Playbook is up to date. You have drilled for this, you know who the key players on your team are and you have their home phone numbers, mobile phone numbers, and email addresses, so you get to work. It is seven o’clock in the evening so you are sure everyone is available and ready to respond, you begin typing “that” email and making phone calls, one at a time.
There are a number of things wrong with this scenario:
How often do you drill and practice incident response?
Are we lucky enough for these incidents to happen at a decent time of the day?
How long does it take to write “that” email?
How long does it take to contact every person on your list?
The average cost of a data breach in 2020 was $3.86 million, according to a new report from IBM and the Ponemon Institute. – Dan Swinhoe, CSO Online
Regardless of where you identify a data breach on the Cyber Kill Chain or MITRE ATT&CK frameworks internal notification and incident response is crucial, as every second counts.
Manual processes can be the single point of failure in our ever-evolving automated world. Passive communication channels, like email, leave the sender wondering if the recipient has received and read said email. This assumes that a person is sending an email alert. Today many security appliances and platforms are configured to send emails to static email addresses. In most instances, these appliances and platforms are outbound only with no way of confirming delivery or read receipts.
The proliferation of instant messaging applications allows us to look at our screen and see those three dots scrolling which in turn tells our brain the message has been delivered, read, and is being responded to, in real-time. It is this type of technology that continues to transform the digital workplace forcing companies to find solutions that allow us to “work as we live”.
Justine Phillips is a partner with Sheppard Mullin specializing in Privacy and Cybersecurity and a thought leader in Data Breach response, offered some insight into some of the challenges organizations face when responding to data breaches, and other cybersecurity events.
Automation and real-time alert routing saves valuable time to contain and remediate a cyber event. It also gets the right people engaged at the right time to begin the forensic investigation. Many laws impose time-sensitive deadlines and the clock starts running when the event is discovered. – Justine Phillips
Spending on cybersecurity prevention, detection, and incident response has increased exponentially over the last decade, and that trend continues. As good as many of these products are, the number one notification channel continues to be, electronic mail. E-mail is a passive way of communication with the sender often having no idea if the email was received, let alone read. Forget about people changing email addresses or leaving organizations, once an email address has been entered into an appliance or platform it is often forgotten. If your sentry sends up a signal and nobody sees then your incident response will be delayed or worse?
To bridge this gap we should be looking at automation and alert routing platforms. An alert routing platform should have the ability to tie in your monitoring systems along with your preferred channel of communication: Voice, SMS, Push, Instant Messaging, and yes, even email. For instance, PagerTree allows you to take a single email address and transform it from a serial communication channel into a powerful multi-channel mechanism that triggers multiple communications to multiple people across many channels. In addition to multi-channel communications, PagerTree requires users to actively accept or reject a notification. This allows incident commanders to know, in real-time, who has acknowledged or rejected a given alert notification.
Regardless of your chosen alert routing platform, you should be looking for some of the following characteristics:
Configuration options that give you the ability to customize how you reach your team: Voice Call, SMS, Push Notification, Instant Messaging, Email
Easy to use scheduling calendars for one or many team/members
Configure how often a communication channel is repeated
Configure how long to try reaching a team member before moving on to the next person on the schedule
Utilize escalation layers in the event a team member is unavailable
Alerts initiated via email, webhooks, or other custom integrations
Redundant telecommunication channels
Key Performance Indicators (KPI) like Meantime to Respond
API and other integration opportunities
Ultimately, your alert routing platform should provide you with the confidence to move on to the next steps in your incident response playbook and not focus on who has and has not been alerted. This will allow you to focus your team’s efforts on mitigating the data breach or cybersecurity event, mere seconds after the notification process is initiated, saving you time, frustration, and money.
If you are still searching for an intelligent, alert routing platform, check out PagerTree.If you are still searching for an intelligent, alert routing platform, check out PagerTree. Click here to start a fully functional risk-free trial that could help your organization if it faces a data breach.
GitHub Actions are a great way to automate the build and deploy process for your repos.
In this tutorial, I will show you how to build and deploy a Jekyll static site to AWS S3 + Cloudfront using GitHub Actions. At PagerTree we use GitHub Actions to automate the building and deploying of our marketing site pagertree.com.
These days, if you have to do anything manually more than a couple of times, you should probably be automating it. GitHub Actions make it easy to automate software workflows. At PagerTree, we use GitHub Actions to deploy our marketing site in a continuous and reliable way.
For this tutorial, I’ll make the assumption that you are fairly familiar with git and Jekyll and already have a static website hosted on AWS S3 + Cloudfront website setup.
Below I’ve listed what you’ll need for this tutorial. I’ll assume you are dangerous enough to create the following on your own and won’t cover how to create these, as it’s out of the scope of this post.
Jekyll static site
GitHub Account and Repo
Our desired workflow should look something like the following:
On push to our repo’s main branch or when manually clicked in GitHub:
Build the main branch.
Deploy the generated static site files to AWS S3.
Create an AWS Cloudfront invalidation.
This is pretty minimal, and you can get waaay fancier, but for the purpose of this tutorial it should help us understand how to use GitHub Actions.
Your GitHub Actions definitions live in a special directory in your repo (<repo>/.github/workflows/
). Inside this directory, you’ll have all your workflow files (yml format).
Workflows will trigger off events (aka specific activities) that happen in GitHub. There’s quite a few, but for this tutorial we will focus on the push
and workflow_dispatch
events.
In your <repo>/.github/workflows/
directory, create a new file called build_and_deploy.yml
. Copy and paste the following into your newly created GitHub Action workflow:
This workflow file is responsible for building and deploying the site. We’ve named it “CI / CD”. It’s pretty self explanatory, but I’ll explain the process:
When a push is made into the main branch (or manual button click in GitHub), run this workflow.
The job - Use the Ubuntu latest virtual environment (see all environment options here)
Checkout our main branch.
Setup our Ruby environment (see docs) - Installs Ruby (with specified Ruby version if you have a .ruby-version file) and runs ‘bundle install’.
Build the site (with the production environment).
Uploads output files from our _site
directory to our S3 bucket.
Creates a Cloudfront invalidation (so we can see our new site immediately).
Pretty straight forward, but we still need to create a few resources in AWS and configure secrets in our GitHub repository.
We’ll need to create 2 AWS resources, namely an IAM Policy and User.
IAM Policy - will grant restricted access to deploy to our S3 bucket and create an invalidation on our Cloudfront distribution. You’ll attach this policy to the IAM User.
IAM User - will be the credentials the GitHub Action uses to run its aws-cli commands.
Below is the AWS IAM Policy you’ll need to create. You must modify it by replacing a couple of the items below (make sure to replace the ‘<’ and ‘>’ too).
<your-bucket-name>
- Your S3 bucket name (ex: www.acme.com)
<your-aws-account-number>
- The 12 numeric characters of your AWS account.
<your-distribution-id>
- The alpha numeric 14 characters of your associated Cloudfront distribution
In AWS, create a new IAM Policy
Create a new IAM User with programmatic access and attach the IAM Policy you just created above.
Copy the AWS access key ID
and Secret access key
to somewhere safe, as we will need these in our next step
In order to use the special variables like ${{ secrets.AWS_ACCESS_KEY_ID }}
we’ll need to configure them in the GitHub Actions Secrets. To do this:
In GitHub, navigate to Your Repo > Settings > Secrets > Actions
For each secret below, click the New repository secret button, fill out the form, and click Add Secret
AWS_ACCESS_KEY_ID
- What you copied in the previous step as AWS access key ID
.
AWS_SECRET_ACCESS_KEY
- What you copied in the previous step as Secret access key
.
AWS_S3_BUCKET_NAME
- The bucket name you set previously in your IAM Policy (ex: www.acme.com).
AWS_CLOUDFRONT_DISTRIBUTION_ID
- The Cloudfront distribution id you set previously in your IAM Policy.
The easiest way to test your new GitHub action, is to:
Make a small change to your Jekyll site
Commit the change, and push to main.
Navigate to your website (https://www.acme.com), do a hard refresh (Ctrl + F5) and then you should see the changes you just made.
What you should see in the GitHub Actions panel, is a workflow that was created, and the output of the commands that were run.
Note: The first time this runs it could take ~5 minutes. The ‘bundle install’ command for our project took a while, but don’t worry, subsequent builds should use the bundle cache.
That’s it, you’ve now successfully created a GitHub Action to build and deploy your Jekyll static site to S3 and Cloudfront. I hope you found some value in this tutorial, it’s pretty basic, but if your new to GitHub Actions it should provide a valuable launching pad. Make sure to follow me on twitter, and if you haven’t yet, make sure to checkout PagerTree :)
In this guide, I’ll show you how to migrate away from attr_encrypted to the new Active Record encrypts.
Rails 7 has introduced Active Record Encryption, functionality to transparently encrypt and decrypt data before storing it in the database. This is awesome news for any developer who has ever had to encrypt data before storing it.
In this guide, I will walk you through an example to migrate from away from using attr_encrypted gem to the new Rails 7 Active Record encrypts. We will do this using strong migrations and also maintain the ability to perform a database rollback without data loss.
If you are short on time, below is the crux of this article. If you are actually implementing this, I would highly encourage you to read on, this can be a fairly complex migration.
This article is written on 13 April 2021 - Currently Rails 7 is Edge (aka alpha). This tutorial makes certain assumptions based on that. I will publish updates when Rails 7 is officially released.
attr_encrypted and Active Record encrypts are not compatible - you’ll need to use a fork of attr_encrypted
devise - currently needs the patch-2
branch to work with Rails 7
Upgrade to Rails 7
Add dynamic attributes to model
Perform Migrations
Delete attr_encrypted gem dependency
Most applications at some point in time need to encrypt data before storing it to the database (and conversely decrypt it before using it in the application). Historically there have been 2 gems that were fairly popular for this sort of functionality, namely attr_encrypted and lockbox. I personally have preferred lockbox since its still actively maintained and used less columns, but if you are like me you can’t always choose whats handed to you.
Unfortunately, the attr_encrypted gem is no longer maintained has a lot of name clashes with the Rails 7 Active Record encrypts functionality. To work around this, we had to create a fork and rename many of the function calls and properties (namely encrypt, decrypt, ect.). You too will need to use the PagerTree fork of the attr_encrypted gem during your migration process (but don’t worry, you can delete it after your migration).
You’ll first need to upgrade to Rails 7. As of this writing (13 April 2021), Rails 7 is Edge. This tutorial will use syntax and functionality that is currently in alpha.
In your gem file you’ll need to change:
And then make sure to install the new dependencies, and update any others.
At this point, we now have Rails 7 installed, with a compatible version of attr_encrypted.
Following encrypts documentation, you’ll need to add some keys to your rails credentials file.
Copy the output YAML and paste it into your credentials file. It should look something like this:
You’ll need to do this once for each environment (normally development, staging, production).
At this point, the Active Record encrypts should be ready to go.
The next steps will be to migrate any data that was previously using attr_encrypted to use the new encrypts methods. Because we want to be secure and also use strong migrations our process should look like this:
Modify our model to dynamically define attributes to use during our migration
Create a new temporary column for our old encrypted data (attr_encrypted)
Copy old encrypted data to new temporary column, and delete old column
Add a new column for our Rails 7 Active Record encrypts data
Run a migration to programmatically decrypt attr_encrypted temporary column and put it in Rails 7 Active Record encrypts column
Delete temporary column
Seems like a lot of overkill, but we do it this way so we don’t perform any dangerous database activity and make strong migrations happy. This process will also keep our migrations backward compatible and prevent data loss in case we ever need to rollback.
I’m going to make the assumption you are fairly dangerous when it comes to coding, and you are relatively familiar with the rails framework. Please use this example as a guide. You’ll need to make modifications to your own code to make this work for you..
Below, is what I will assume is our starting point. We have a User
model that has an attribute called otp_secret
(stands for “one time password secret”, for two factor authentication).
The otp_secret
property currently uses attr_encrypted. This means in our database we should have the following columns:
We’ll take advantage of the fact that attr_encrypted prefixed its column names with “encrypted”. By copying our data into a temporary column, we can avoid name clashes, and use the encrypts functionality almost transparently (you’ll see below how the names will come full circle).
We need to add some extra code to dynamically define attributes. During the migration, only two of these columns will ever exist at a time, making it so that we can migrate our columns without name clashing.
The temporary column will just hold a copy of our existing attr_encrypted field. We move data here for strong migrations and so the Rails 7 encrypts column doesn’t conflict with the attr_encrypted accessor.
You’ll want to create a new migration that copies the original attr_encrypted column to the one we just created, but you’ll want to make sure you define both up
and down
so that you can have backward compatibility.
Now we’ll add a new column, where we will store the Rails 7 Active Record encrypts data.
It’s important that the column be of type :text
. The rails guides specify that the column should be at least 510 bytes.
In this step, we generate a migration to move data from the attr_encrypted property to the Rails 7 Active Record encrypts property. We have to do this programmatically (and can’t do a shortcut db command) because it is the rails engine is what is actually doing the encrypt
and decrypt
work for us.
Additionally, we do some special reloading of the User
model because of how we have dynamically defined attributes (Again, this is meant just to be temporary while we migrate).
Our last step is to remove our temporary column, so our database is kept nice and clean. Again, we define the up
and down
methods in this migration so we are backward compatible, and if for any reason we can go back in time and re-create our data.
Now you should be able to run all your newly created migrations with one swift command.
You can now safely remove the attr_encrypted dependancy in your gem file. However, be aware that this will break existing process of rails db:create db:setup
(for example in development). You’ll likely want to use rails db:setup
instead so that it loads from the schema file and at some point squash your migrations directory.
I hope you find some value in this tutorial and it can save you time and effort when it comes to migrating away from attr_encrypted. There’s probably a lot I missed on here, so if you have something to add you can reach out to me on twitter and I will update the article with your suggestion.
Some other notes on snags I came across during development.
Problems when creating a database with Devise and Rails 7 Active Record Encrypts
The Rails 7 Active record encrypts seems to break db:create
when used in conjunction with Devise. Didn’t dig too far into this, but Rails complains that the encrypts modifier can’t properly check the database column size. Makes sense since there currently is no database, but it did force me to create a hack on the user model. It didn’t seem to affect other models that didn’t interact with Devise.
I assume this will get fixed at some point and is just a Devise + Edge (alpha) thing.
Ruby on Rails Cheat Sheet - A quick reference guide to common ruby on rails commands and usage.
Table of Contents:
Evaluation and Output
Evaluation can be done with the <% %>
syntax and output can be achieved with the <%= %>
syntax.
You can render partials like so:
Common rails commands. (Note: “rails g” stands for “rails generate”)
Common rake commands for database management and finding routes.
:boolean
:date
:datetime
:decimal
:float
:integer
:primary_key
:references
:string
:text
:time
:timestamp
Before filters are registered via the before_action
and can halt the request cycle.
Application configuration should be located in config/application.rb
with specific environment configurations in config/environments/
. Don’t put sensitive data in your configuration files, that’s what the secrets are for. You can access configurations in your application code with the following:
Application secrets are just that, secret (think API keys). You can edit the secrets file using the following commands rails credentials:edit --environment=env_name
. This will create files in the config/credentials/
folder. You’ll get two files:
environment.yml.enc
- This is your secrets encrypted - This can be put this into git
environment.key
- This contains the key that encrypts the file - DO NOT put this into git.
Additionally, when deploying, the key inside the environment.key
file will need to be placed into the RAILS_MASTER_KEY
environment variable. You can then access secrets in your rails code like so:
A short list of gems, frameworks and education materials that I have found useful in my Rails journey.
Host a publicly available form where customers or associates can input alerts outside the PagerTree ecosystem!
Today, we are excited to announce a new integration - PagerTree Forms!
From the integrations page, click the “Create Integration” button.
Find the Form Integration, then click the logo.
On the new integration form, fill in the integration details.
Click Create
In Chrome or Firefox, paste the CNAME URL into the browser URL bar.
There’s a bounty of options to set including:
Title - What appears in tab title of the browser
Header - Custom form header
Instructions - Custom instructions
Footer Text & Link - Custom link back to your site
Custom CNAMES - Link your own subdomain with self signed certificates
… and many more
This integration was really fun to build and I hope you can take full advantage of this feature today!
In this blog post I will assist you in installing a Ruby on Rails development environment with a simple step-by-step process.
Ruby on Rails is an excellent framework for web application development. For those of you who are new to RoR, like me, you will need to install several different applications (referred to as dependencies) to ensure this runs smoothly.
Here are the packages, tools, and databases we will be installing:
Create a GitHub account - Our preferred vendor that allows us to host git repositories in the cloud.
Here we will be navigating through the steps to get your Ruby on Rails development environment setup and all of the dependencies installed.
You will need to run the commands below in your terminal to install git.
If there are none, you will then run the next commands to generate a new one.
Next, press ENTER.
Next you will be giving it your information.
Once you have the SSH key generated you will need to add it to the ssh-agent to manage. In the command line enter:
To add it, ENTER.
Now rbenv should be installed, but we also need to add some startup scripts to your bash profile, so that your terminal uses rbenv instead of the system wide Ruby version.
For our setup, let’s run the latest and greatest (as of this writing) version of Ruby (3.0.1). To install this version of Ruby, we will use rbenv. Run the following in your terminal:
Share a team’s on-call calendar with the rest of the world!
By default, all calendars are private, so to make use of this feature you must enable it.
Navigate to the Team Page you would like to share the schedule for.
In the top information box, click the globe icon.
Click confirm.
Click your new public team calendar link.
There’s a bounty of options to set including:
Password Protect? - Should this calendar require a password to access it?
Show User Emails? - Should we display user emails?
Show User Phones? - Should we display user phones?
Special Message - Show a special message at the top of the calendar?
and many more…
Start taking advantage of this feature today!
We’ve just added even more chatbots for PagerTree. Connect Slack, Mattermost, Microsoft Teams, and Google chat to PagerTree and get the most out of your on-call rotation!
Uptime | Downtime (Per Year) |
---|---|
column_name | column_type |
---|---|
This takes full advantage of the previous attr_encrypted nomenclature. After the migration, we should be able to access the opt_secret still by using user.otp_secret
. See how that just came full circle
If it worked, congrats If for some reason it doesn’t work, check the error output. It could be a simple syntax error, or something specific to your setup. Here is where I am counting on you to be dangerous and figure out what could have happened.
We’ve been doing some Ruby on Rails development lately, in preparation for , and we wanted to put together a Ruby on Rails Cheat sheet. This is a quick reference guide to common ruby on rails commands and usage.
were one of the most confusing things to me when first starting ruby (not because they are a new concept, but because I found the syntax very hard to read). In newer versions, syntax is very similar to JSON notification. Just know there are two versions of syntax, an older and newer one.
Also, you can have as keys for hashes, and they do not lookup the same values as strings.
Instead of checking for nil or undefined values, you can use the safe navigation operator. There’s a nice article that goes into more depth of explanation.
Command | Description |
---|
Command | Description |
---|
Migration data types. Here is the and a I commonly reference.
Filters are methods that are run “before”, “after” or “around” a controller action. See for details.
Gusto has a really nice article one .
This table references the . Check out the full documentation for other special callbacks like after_touch
.
New Record | Updating Record | Destroying Record |
---|
A couple of basic (and most commonly used) queries are below. You can find the full documentation .
Command Example | Description |
---|
Additionally, you are likely to want to check for an existence of a condition many times. There are many ways to do this, namely present?, any?, empty?, and exists? but the exists?
method will be the fastest. This explains nicely why exists?
is the fastest method for checking if one of a query exists.
- Easy multi-tenancy for rails database models.
- Rails engine for flexible admin dashboard.
- Flexible authentication system.
- Provides “Login As” another user functionality for Devise.
- Generate fake data like names, addresses, and phone numbers. Great for test data.
- Expose a hashid instead of primary id to your users.
- Display friendly client side local time.
- Encryption for database fields (model attributes). Just use Rails 7 native encryption. See and .
- Gold standard pagination gem.
- Rack middleware (before Rails) for blocking & throttling.
- A rails Google Recaptcha plugin - You’ll want this one especially for public facing forms to stop bot crawlers.
- A tiny framework for sprinkles of Javascript for your front end.
- Generate scoped ids (ex: per tenant ids for models, aka friendly id).
- Redis backed background processing for jobs.
- A scheduler for Sidekiq (think running weekly email reports).
- Makes web app feels faster (like single page application).
- A SaaS Framework already supporting login, payment (Stripe, Paypal, and Braintree) and multi-tenant setup.
- A utility first CSS framework. Seems a little verbose at first, but you’ll really learn to love it. Just by reading the code, you’ll know exactly what the screen will look like.
- Ruby on Rails tutorials, guides, and screencasts.
I hope you find some value in this cheat sheet. There’s probably a lot I missed on here, so if you have something to add you can reach out to me on and I will update the article with your suggestion.
are simple (PagerTree hosted) forms that can be made public so your customers can quickly create an alert outside the PagerTree ecosystem.
PagerTree Forms also support so you can host them on your own domain (ex: https://support.example.com). The CNAME option is secured via HTTPS using self signed certificates.
PagerTree Forms are available on our . If you don’t already have an account, .
For full details check out .
Today we will install Ruby on Rails (RoR) on a Debian Linux operating system ( LTS). With that said, RoR is compatible with other operating systems with just a few tweaks. This blog will assist you in installing RoR with a simple step-by-step process. Your installation may differ, for other operating systems refer to .
I am new to developing and have been using LTS, a flavor of Debian Linux, for my projects. This blog will provide the steps and information needed to get the environment and dependencies installed for RoR so you can get your first project going.
- A distributed version control system.
- Secure Shell is a protocol that allows users to control and modify their remote servers over the Internet while ensuring security.
- Software package manager that simplifies the installation process for Mac OSX and Linux.
- A tool that manages, installs and runs multiple versions of Ruby.
- My preferred code editor.
- A used for long term storage.
- A used for short term storage (caching).
- Javascript runtime environment. Runs on the Chrome V8 engine and executes javascript code outside of a web browser.
- A more secure npm (node package manager - gets installed with NodeJS).
Remember git is the program for distributed version control, and is our preferred vendor. So, if you haven’t already, create an account with GitHub.
You will need to and connect it to GitHub. We will first check to see if there are any existing SSH keys. Run this command to see if there are any pre-existing SSH keys.
The final step is to add it to GitHub. Follow to do so.
is a package manager (similar to apt-get) that helps us install other packages to our system. To get the Homebrew package installed, you will have to run the below command:
Remember, is a tool that will help us manage installing and running multiple version of Ruby. To install rbenv, run the following:
You can follow the directions in the link to get the correct version installed on your device.
will be our relational database preference for our RoR setup. To install:
is our key value database that RoR uses for caching.
The link will take you through the steps to get the correct version of installed to your device and will give a thorough understanding.
For the final step we will be installing the package manager ) by running the command below.
Now that your environment is ready, you can dive into your first project. All in all, Ruby on Rails is a great development environment. It is easy to navigate, scalable, and is excellent for team projects. Looking for more useful information for Ruby on Rails? Check out this .
Today, we are excited to announce PagerTree has added support for ! Public calendars allow you to share a team’s on-call calendar with the rest of the world.
Public Calendars are available on our . If you don’t already have an account, .
For details check out .
Today, we are excited to announce PagerTree has added 3 new chatbot services including , and (this is in addition to our core Slack notification channel).
Chatbots are available on all pricing tiers free of charge! If you don’t already have an account, .
99%
3 Days : 15 Hours : 39 Minutes
99.9%
8 Hour : 45 Minutes : 56 Seconds
99.99%
52 Minutes : 35 Seconds
99.999%
5 Minutes : 15 Seconds
99.9999%
31 Seconds
99.99999%
3 Seconds
encrypted_otp_secret
:string
encrypted_otp_secret_iv
:string
| Generates model and migration files |
| Generates controller, model, migration, view, and test files. Also modifies |
| Generates controller and view files. Useful if you already have the model |
| View all routes in application (pair with |
| Seed the database using the |
| Run any pending migrations |
| Rollback a database migration (add STEP=2 to remove multiple migrations) |
| Destroy the database, re-created it, and run migrations (useful for development) |
save | save | destroy |
save! | save! | destroy! |
create | update_attribute |
|
create! | update |
|
| update! |
|
|
|
|
before_validation | before_validation |
|
after_validation | after_validation |
|
before_save | before_save |
|
around_save | around_save |
|
before_create | before_update | before_destroy |
around_create | around_update | around_destroy |
after_create | after_update | after_destroy |
after_save | after_save |
|
after_commit / after_rollback | after_commit / after_rollback | after_commit / after_rollback |
| Find model by id |
| Find models where conditions |
| Find models where condition |
| Find models where condition not true |
| Get the first model in the collection (ordered by primary key) |
| Get the lst model in the collection (ordered by primary key) |
| Order your results or query |
| Select only specific fields |
| Limit and offset (great for pagination) |
Learn how maintenance windows make it easy to suppress alerts from integrations during specific times.
Today, we are excited to announce PagerTree now officially supports maintenance windows! Long overdue (technically from our 2019 roadmap), with maintenance windows it’s now easier than ever to suppress alerts from integrations during specific time periods.
Maintenance Windows are available on our Pro and Elite pricing plans. If you don’t already have an account, sign up for a free-trial now.
Easily set a start/end date & time for when alerts should be suppressed
Specify all or specific integrations the maintenance window affects
Easily enable/disable/delete the maintenance window when complete
A maintenance window can be quickly created via an integration page.
Click the Maintenance Window dropdown (top right actions).
The maintenance window will be created and you will be redirected to the integration page.
You can find the full documentation on maintenance windows here.
Learn how schedule rotations make it easy to rotate a list of users that need to go on-call for 24/7 support.
Today, we are excited to announce PagerTree now officially supports schedule rotations! A long awaited feature and requested by many customers, with schedule rotations it’s now easier than ever to schedule a list (or “rotation”) of people for full coverage support.
Schedule rotations are available on our Pro and Elite pricing plans and are technically a subset of our “recurring schedules” feature. If you don’t already have an account, sign up for a free-trial now.
Schedule Rotation Features:
Easily schedule a list of users that should “rotate” through the calendar
Select a custom frequency for the rotation
Easily edit/clone/delete an entire rotation
Configuring rotations is easy!
On the calendar, use the cross hairs to select a date range for the length of the initial event.
Configure the event:
Select multiple users for the rotation.
Check the repeat flag
Check the rotation flag
Drag and drop the user to change the order of the rotation
Voila! Just like that you have a “rotating” schedule.
In the future if you ever need to update, clone, or delete the rotation:
Double click any of the rotation events.
Click the Pencil Icon.
Click All Events (in this series)
Modify any of the attributes. (This could be layer, rotation order, frequency, ect.)
Click Save button.
Route incoming phone calls to the right person with the on-call schedules and escalation policies you already use. Add real-time conversations to your support workflow!
Use the Email to Slack integration to keep an entire stakeholder channel up to date during on-going incidents.
To get started, you’ll need several things.
From your Slack workspace:
In the left hand menu, click the + button, next to the Channels.
In the Create a channel form, let’s add the name of the group to be stakeholders, then click the Create Channel button.
Click the Install Button.
Select the #stakeholders channel we just created above, then click Add Email Integration
Copy the email address that slack provides you.
Click Save Integration
Paste the Slack email address in the additional emails section of the PagerTree stakeholder.
Click Create.
🚀 Now test it out by manually creating an incident in PagerTree. You should see Slack shows a notification in the #stakeholder channel.
I hope you found this useful, and can use it to keep stakeholders informed during ongoing incidents.
–Austin
Docker Commands Cheat Sheet- A quick reference guide to Docker CLI commands used on a daily basis: usage, examples, snippets and links.
By no means is this an extensive list of docker commands. I kept it short on purpose so you could use it as a quick reference guide. I’ve also omitted the topic of building images and the commands that are associated with that.
Command: docker ps
Description: Show running containers.
Normally I will use this command just as often as I use ls
command on a *NIX terminal. It’s especially useful when you first SSH to a machine to check what’s running. It’s also useful during the development and configuration process since you’ll likely have containers stopping/starting/crashing.
Command: docker exec -it <container name> /bin/sh
Description: SSH into a container.
This is probably my 2nd most popular command. Normally I am using this while trying to debug a container and need to shell into the container. Just note the -i
flag means interactive and -t
means TTY (aka a teletype terminal).
Also, you can use any command instead of /bin/sh
; I only put that here because I frequently am SSHing into an alpine image which doesn’t support bash.
Command: docker restart <container name>
Description: Restart a container.
Command: docker stats
Description: Display a live stream of running containers usage statistics.
Command: docker system df
Description: Display information about disk space being used by your containers.
Command: docker system prune -af
Description: Remove all unused images (dangling and unreferenced), containers, networks, and volumes.
You’ll probably only use this command on a Docker build machine or on your dev box, nevertheless take note, cause you are likely to use it.
You can download this entire blog article with the "Export as PDF" link in the top right of this page (you might have to be on the desktop version. Additionally, below I've provided some PDFs from the web I have found useful.
Select the duration the maintenance window should span.
Today, we are excited to announce PagerTree now officially supports ! With Live Call Routing you can route incoming phone calls to the right person using your existing on-call schedules and escalation policies.
Purchase and manage global phone numbers using
Route incoming phone calls to PagerTree teams with and
Better yet, can be configured to route to multiple teams. When configured to route to multiple teams, callers will be presented with team options. For example: “Press 1 for Devops, Press 2 for Network Operations, Press 3 for Security”.
Live Call Routing is available on our . To get setup, make sure to follow the .
Recently, while working with a customer, I saw a really cool use of to send messages into a Slack channel. They did this by using the Slack’s integration along with PagerTree’s Stakeholder notifications. Today I’d like to share with you how they did it.
A PagerTree subscription with the .
A account with a .
In your browser, navigate to the .
Enable .
group.
In this article I will highlight the 6 key docker commands I use on a daily basis while using in the real world.
At the bottom of the page, I’ll also put some good links to other Docker resources I like or frequently use as well as other PagerTree cheat sheet documents. For a full crash course, check out our !
Command | Short Description |
---|
Official Docs:
Official Docs:
Official Docs:
I debated putting this command in here, since I don’t use it all that often, but it’s a nice to have. Great example of when to use this - you change your and need it to pick up the changes in your config file. You might also use this when .
Other commands you might use often, but I didn’t think were so worthy of their own section are docker start
() and docker stop
(). You’ll use these commands normally when setting up or testing images and you’ll likely use a lot of flags. I didn’t think they were so applicable because you should honestly be using or some other orchestration system (like or ) to launch your containers.
Official Docs:
I’m normally using this command when I am trying to figure out optimal for containers. You might also use this if you are debugging which container is using most of your host’s resources.
Official Docs:
This one doesn’t come up to often, but it has, especially when you are building lots of images on a box or you are storing lots of data (like ). If you are, you might consider setting up a cron job to .
Official Docs:
Also, just like mentioned above, if this is a build box consider setting up a cron job to prune your images. If you’re a cron syntax noob like me, you might find of use in understanding the syntax and shortcuts for .
- Run an IPsec VPN server. Super useful if you work from cafes like Starbucks.
- Out of the box Prometheus and Grafana setup. We actually use a fork of this for monitoring the platform.
- Slim OS image for Node.js apps in production.
- A GitLab Runner inside a docker container. We use this at PagerTree to build our images.
- A quick and concise overview of all the peices of Docker you need to know.
- If you’re new to Docker this is a great crash course. Starts from installing Docker all the way to docker compose. It can be a lot to take in so you might have to read it a couple times.
- Super helpful writeup on setting up the GitLab Runner for your own CI/CD pipeline. Used this tutorial extensively when setting up PagerTree’s CI/CD.
| List running containers |
| SSH into container |
| Restart a container |
| Show running container stats |
| Check docker daemon disk space usage |
| Remove images, networks, containers, and volumes |
Over the past decade, multiple scientific studies have confirmed what we in DevOps have known for ages, being on-call is a pain! But just how bad is it?
Over the past decade, multiple scientific studies have confirmed what we in DevOps have known for ages: Being on-call is a pain! But just how bad is it?
After a long night on-call, we’re bound to be just a little bit on edge. A bit snappier with the kids, a little bit snarkier with our colleagues. Some people think we’re just grumpy and ornery, but as it turns out, there’s a pretty legitimate reason for it. Studies show that when we’re on call, we tend to start the day with increased cortisol levels. That’s right. Cortisol. Our favorite stress hormone.
Now to be fair, Cortisol can be a good thing. It’s the hormone that drives our fight or flight response and gets us off our butts and moving throughout the day. But too much cortisol, and it puts us in a fightin’ mood. Studies have even shown that heightened cortisol levels over extended periods of time can contribute to some pretty unpleasant health issues!
As if waking up stressed wasn’t bad enough, being on-call also affects our mood. Participants in the study were more likely to feel unpleasant, restless, and without energy after a night on-call. It’s a bit of a paradox; that feeling of being restless AND without energy at the same time. But we’ve all been there, haven’t we? We’re too exhausted to collaborate, but when we actually sit down at our desks we’re too restless to focus. Wired but tired, we bounce back and forth all day trying to figure out what the issue is, and we finally just chalk it up to an off day at work.
Of course getting an alert at three in the morning is going to disrupt your sleep. But what you probably didn’t know is that getting that call might actually be the preferred scenario. Sure it’s going to ruin your night, but at least your manager (and your team) knows you were up late resolving an incident. And hopefully you’re getting appropriate compensation, kudos for saving the day, and a bit of a pass for being a bit on edge the following day.
But the painfully unappreciated scenario is actually what happens every other night when your rest gets ruined by the mere anticipation of getting a call. Studies repeatedly show that on-call employees experience disrupted sleep and poor quality rest regardless of whether or not a call is actually received. But alas, no one says “thanks for anticipating a call last night”...
When you put it all together, on-call is even worse than we thought! If it was just the actual incident that was disruptive, at least those don’t happen too often. But the science is conclusively telling us that the mere possibility that you might get a call, regardless of whether or not it happens, is painful! Just the anticipation of an incident is enough to keep us on our toes, in work-mode, and unable to rest and refresh. The lingering effects of on-call spill over to the next day, and the next, and the next, leaving us stressed out, restless, and exhausted.
Chances are, either you or your team is currently suffering through the effects of on-call scheduling. But systems don’t wait until the morning shift to crash, and they certainly don’t fix themselves! So what can we do?
The studies indicate that employees who were able to detach themselves from work demonstrated the ability to rest and refresh even while on-call. Since the mere anticipation of a call is enough to increase stress, decrease energy, and disrupt sleep, empowering employees to truly disconnect until they’re needed frees them from the dreaded anticipation. It’s common sense, really. When employees are free to take their eyes off the phone and actually be present with family and friends, they’re more likely to feel refreshed even after a night on-call.
This ability to detach affects sleep quality too. For example, how well do you sleep when you’re anxious about missing an alarm? Chances are, you’re subconsciously hesitant to enter into deep sleep, and instead, you drift in and out constantly glancing at the clock. But what happens if you set a backup alarm, or better yet stagger three alarms? The redundancy allows for peace of mind, which allows you to detach: worry less, sleep more.
It’s the same idea with on-call scheduling. When you’re the only guy on-call and you’re one missed email away from a SEV-1 production outage, of course you’re going to be anxiously tethered to your phone. But add in multi-channel notifications and smart escalation rules, and all of a sudden you’re not feeling so alone. You’ve got redundancy, and you’ve got backup. As it turns out, multi-channel notifications and smart escalation rules not only improve mean time to resolution (MTTR), but can also help your teams get a better night’s rest.
The second mitigating factor to offset the anxiety of being on-call was that of control. When on-call employees are confident they’ll be able to resolve an incident, they’re less likely to expend energy dreading the call. If it comes, it comes - they’ve got it handled. Similar to detachment, the feeling of control allows on-call employees to spend more time enjoying their evenings and less time worrying.
Short of constantly assigning your most senior developers, how do you empower your employees to be in control? Intelligent call routing with configurable teaming allows you to send the right incidents to the right teams at the right time. No need to have a one-developer-fix-all model any longer. Getting the right incidents to the right teams not only ensures higher quality work, but as studies show, on-call employees recover more quickly from a night on-call when they’re confident they’ll be operating within their area of expertise.
Lastly, it’s important to know who’s on call and how often they’re being asked to jump in and help. Maintaining clear lines of communication with your team and evenly distributing on-call shifts not only promote transparency and a sense of shared camaraderie, but also helps to reduce developer burnout over time.
Recent studies have clearly demonstrated the negative effects of being on-call, and the results aren’t pretty. Studies show that the mere anticipation of receiving a call is enough to increase stress, decrease energy, and disrupt sleep. When you’re on-call, your inability to rest and refresh can have severe consequences when sustained over time.
Dev managers can help their employees better recover from a night on-call by empowering them to detach and be confident during their on-call shifts. On-call scheduling done well can provide the necessary infrastructure to help mitigate the negative effects of on-call.
When you’re operating within your realm of expertise with added layers of redundancy and backup, you can finally put down that phone, enjoy dinner with family, and get some much needed rest.
Discover what serverless technology is, what it is not, and some of the pros & cons of a serverless architecture.
In this post we’ll answer the following questions:
What is serverless architecture? (and what it’s not)
What are the pros & cons of serverless?
If you already know these things, feel free to skip ahead to other posts in this series:
Depending where on the internet you go to, you’ll get different answers. For example:
Wikipedia defines serverless computing as a “cloud computing execution model in which the cloud provider dynamically manages the allocation of machine resources. Pricing is based on the actual amount of resources consumed by an application, rather than on pre-purchased units of capacity”.
Amazon Web Services defines serverless computing and applications as “Build and run applications without thinking about servers”.
In my opinion I think serverless could best be summarized as:
Serverless is a cloud architecture in where resource allocation, maintenance, and highly availability is managed by the cloud provider.
Serverless is not Docker, nor a virtual machine, and its not code that runs without a server. The term “serverless” is a misnomer, since any application still has to run on some sort of computing machine. The name is catchy, but really should convey that serverless abstracts away server resource allocation. It’s a nice thought, to be able to run applications without servers, but as of this writing, the technology is there yet. One must still think about certain structural components when developing serverless apps.
I found this twitter post by @kelseyhightower helpful in visualizing where serverless actually falls.
There are many pros & cons to the serverless architecture, and whether you are a startup or a large organization you can benefit from a serverless application.
Zero downtime deployment - This is perhaps one of the biggest pros to serverless, the fact that you don’t have to think about or architect high available services. High availability is baked in. In fact, I think this is so important I wrote another blog post just that. Make sure to checkout out Part 2: Serverless Scales.
Faster deployments and quicker time to market - Because you don’t have to worry about infrastructure or maintenance you can focus more of your time on business logic and quick iterations. This means a quicker time to market, naturally a leaner lifecycle.
Reduced costs - This is a double edged sword. In most cases you can reduce your costs by several factors, but if you have a consistent load it could also be more expensive. Make sure to read Part 3: Serverless Costs to understand the total cost of a serverless application.
Less Infrastructure & Maintenance - Another really big pro is you don’t have to maintain the infrastructure. Your cloud provider handles updates and network management. It most cases you’ll get security updates fixed before you event know they exist. For example, many serverless applications were protected against the Spectre Vulnerability before owners of these applications knew it existed.
Great for event-driven applications - Serverless is a perfect use case for event driven applications. By chaining events you only pay for the execution time in response to those events. In a classic setup, you would pay for a server to be available 24/7 until an event needed processing.
Multi-Tenancy (Security) - For many businesses and applications this can be a big drawback. When running serverless, you are never on a dedicated machine, you share physical resources with other customers. This can be a big deal, especially for sensitive data.
Vendor lock in - You’ll want to make sure your application is not dependent on any one vendor. In Part 4: Serverless Tools & Best Practices I’ll talk about how to keep your application insulated from specific vendors. You want to make sure you application can run anywhere, both in a serverless and classic environment.
System wide limits - Depending on your application, it can be easy to reach system wide limits, such as concurrent serverless executions. This is especially common when using the same cloud account for development and production. Many people have accidently DDoS themselves by running load tests on the development environment, effectively starving the production environment of resources.
No dedicated hardware options - If you need specific hardware for your application, serverless does not offer you any choices beyond the amount of RAM.
Debugging - While not impossible, debugging can be challenging especially if you rely on monitoring agents.
While there are many pros & cons, don’t feel daunted by the task of evaluating if serverless is right for you.
In conclusion, serverless can be defined as a cloud architecture in where resource allocation, maintenance, and high availability is managed by the cloud provider. There are many pros & cons to a serverless architecture, and as we’ll see in my next post Part 2: Serverless Scales, serverless can actually simplify many elements of the traditional architecture.
Learn how serverless architectures scale and handle high availability. Compare serverless architectures to classic N-tier architectures.
In Part 1: What is Serverless? I talked about how one of the biggest pros to a serverless architecture is how well it scales and how high availability is baked in.
In this post I’ll go over:
How a traditional highly available scalable architecture works
How a scalable serverless architecture works
How you can benefit from a serverless architecture
Firstly, I would like to address that I am not advocating that a serverless architecture is always better than a traditional architecture. Each have their purpose. In this article I will be highlighting how you can benefit from a serverless architecture.
In a traditional highly available scalable architecture, a developer or architect, will have to think about many key components:
Networks & Availability Zones
Load Balancers
Scaling Triggers
Security (ex: DMZ or Web layer that handles authentication/authorization)
Most commonly, these components are put together to create what is known as an N-Tier or Multitier Architecture. Below I will talk about a 2 and 3 tier architectures to highlight their differences and complexities. This will help us better understand how a serverless architecture can simplify a highly available scalable architecture solution.
In this simple 2 tier architecture, you’ll notice we already have to account for a load balancer, and at least 2 availability zones to make this a highly available application. To be even safer, many would argue you want to have 3 availability zones, with least 3 servers always running. This will make sure that even during a zero downtime deployment your application stays highly available.
This 2-tier architecture still doesn’t address security concerns like a DMZ/Web layer. For these reasons, most organizations will implement what is known as a 3-tier architecture. In order to implement this we must add another load balancer, and another layer of applications.
In this common 3-tier architecture we have added 1 more load balancer, and at least 3 more servers. By doing so, we have added many key benefits to our architecture including a DMZ/Web layer and scaling at both Web & App layers. However, we have also increased the complexity of our system roughly by 2.
By using a serverless architecture (like shown above), we have removed the complexity of availability zones, load balancers, scaling triggers and a DMZ. The nature of the serverless function is that the high availability responsibility is now managed by the cloud provider.
Now you might be asking about security, and how that got lost. The API gateway, will handle authentication and authorization with some extra configuration. The fact that the serverless functions are created and destroyed on each execution, means attackers cannot infect your servers, since you essentially have none.
Granted, there still might be vulnerabilities that have not been make public or patched that attackers can still exploit, however, the responsibility of keeping the platform secure is now the cloud providers responsibility.
Further, you still could have security flaws in your application that allow it to leak data, but this is an application issue, and not an architecture issue.
As we can see from above, by using a serverless architecture, we can benefit in many ways
Baked in high availability
Reduced complexity
Costs
Costs are something I have not yet covered, since they can be a double edged sword. To get a deep dive on serverless costs, make sure you read Part 3: Serverless Costs where I analyze when to use a serverless architecture for cost benefits. However, for most web applications, you will see a significant cost reduction, especially if usage is infrequent or sporadic.
In conclusion, there are many benefits such as reduced complexity, baked in high availability, and reduced costs when using a serverless architecture. Make sure to check out the next post in this series Part 3: Serverless Costs where I dive deep into the costs of serverless.
Learn by example in this tutorial creating a serverless slack command using Node.js and Up. Learn best practices for developing serverless applications.
Throughout this series we have been exploring how to use serverless architectures to our advantage. In this article I will show you:
When I first started experimenting with serverless technology, I was amazed at the complexity of managing individual functions and environments. At first I scratched my head, and thought, “how can the cloud providers actually believe people will really adopt this”? Now that was 2015, documentation was light and tools were in their infancy. Since then, things have changed dramatically. Today, I am excited to show you just how much easier the serverless development experience has become.
I love Up for many reasons:
Learning curve is ultra low
Common uses cases like static sites or REST APIs are its bread and butter
For existing projects, this is the most simple “Lift & Shift” operation I have seen
It’s Free & Open Source
For this tutorial you will need accounts on these two platforms:
And for brevity of this blog post I’ll just trust that you can get these setup items done:
Once your app is created, click the Slack Commands from the left hand navigation menu. Then click the Create New Command button. Fill in the details like so:
Keep this window up, we’ll come back to this
Once you have downloaded the code, make let’s make sure we install it’s dependencies:
npm install
You’ll now want to run the following commands:
Now that we have our API endpoint copied, go back to the Slack slash command page and paste the url in the Request URL field, and then click the Save button.
Next click the Install App from the left hand menu and click the Install App to Workspace button. You’ll be redirected to an authorization page. Click the Authorize button.
Now go to your slack workspace and give it a try. The first time it might fail because of a “cold start” taking too long to respond to but the invocations after that should work just fine.
From my experience working with serverless code, here are some tips and best practices when writing your serverless code:
Write libraries, not functions - This will compartmentalize logic in a library that you can take to other projects. The serverless code, should really only be an adapter to the serverless platform
Don’t count on background processing - If your an async fan, make sure all your deferred executions have finished before your function finishes. (This is a common mistake during “Lift & Shift” operations)
Below I have put together a curated list of resources for your serverless knowledge:
In this tutorial you should have successfully created a slack command using Node.js and Up, learned best practices when it comes to developing serverless applications, and now have resources to do further reading.
I really hope you have enjoyed reading this series and were able to learn something. If you liked this one, and didn’t get a chance to read the others in the series make sure you do.
Learn about the hidden costs of serverless, and how to perform a cost analysis to understand the total cost of a serverless application.
How to approach and analyze the cost of serverless
Two detailed examples of a cost analysis
As we’ll see below, there are several questions you’ll need to ask yourself when making the choice to go serverless. Some costs are easy to associate with dollars and cents, others not so much. Make you you consider these four things when analyzing the total cost of serverless.
This has to be by far the most important when performing an analysis. Does it make sense for your application to go serverless? If its a sporadic process (e.g. send a weekly report) it might make sense. However, if the application runs a consistent load (e.g. bit miner) it might not.
In the simple graph below, I made a chart showing the break even cost analysis.
The blue line represents the cost of an EC2 (512MB)
The grey line represents the cost of a Lambda function (512MB)
The orange line represents the inverse Lambda cost. (This is just the reverse grey line. We use this to solve for the intersection.)
As you can see, if we consider apples to apples a 512MB server ($0.0058/hr) vs 512MB Lambda ($0.03/hr), the EC2 server is always cheaper (blue vs grey line). However, if you consider a sporadic nature of the serverless function, you can actually run a serverless function for up to 4 days of execution time, and still be more cost effective than running a server full time (blue vs orange line)
For smaller applications, you won’t have to worry about this, and can most likely run inside the cloud providers free tier. However, for request intense applications or applications serving big files, the costs can get out of hand fairly quickly. Make sure to do your own due diligence, and estimate what your request load could look like, then forecast your costs accordingly.
API Gateway Cost = 3M requests * $3.50/1M requests = $10.50 API Gateway Data Transfer = (3M requests * 1KB) * $.09/GB = $0.27 Lambda Charges = 3M executions * $0.000000417/100ms = $3.75 Total Cost = $14.52
That gets you a fully functioning voting application that is highly available.
EC2 Cost = 3 servers * $.0058/hour * 730 hours = $12.702 Load Balancer = $.0252/hour * 730 hours = $18.39 LCU (Data flow) = $.008/hour * 730 hours = $5.84 Total Cost (2 tier) = $36.93 Total Cost (3 tier) = $73.86
As you can see, it is actually more cost effective to run the voting application in a serverless environment. Depending on the architecture it can be anywhere from 2x to 5x more cost effective to run serverless.
You have a legacy application (meme generator) that your boss wants you to take serverless. The application receives some text, overlays it on an image and returns it to the user. The application needs a lot of memory (2GB) to run, and takes approximately 2 seconds to generate a meme thats about 1MB in size. You application is fairly popular and receives 30M requests per month.
API Gateway Cost = 30M requests * $3.50/1M Requests = $105 API Gateway Data Transfer = (30M requests * 1MB) * $.09/GB = $2700 Lambda Charges = 30M executions * $0.000003334/ms = $2000.40 Total Cost = $4805.40
Notice how quickly the data transfer charges added up.
This setup will successfully run your legacy meme generator, be highly available and load balanced.
In order to architect a classic setup, we’ll need to figure out how much computed power we’ll need. On average, we’ll receive 11.57 requests per second (1M requests per day / 86,400 seconds per day).
Since each request takes approximately 2 seconds and 2GB of memory, we’ll need roughly the capacity to handle double that. Looking at the EC2 pricing page, we will use 23 t2.small instances to handle our load.
EC2 Cost = 23 servers * $0.023/hour * 730 hours = $386.17 Load Balancer = $.0252/hour * 730 hours = $18.39 LCU (Data flow) = (~41GB/hour) * $.008/hour * 730 hours = $239.44 Total Cost (2 tier) = $644 Total Cost (3 tier) = $1288
In this example, it’s actually more expensive to run the meme generator application in a serverless environment. Depending on your architecture choice, between 4x to 7x more expensive.
In summary, a serverless architecture can save you money, especially if your application has little traffic or is sporadic in nature. A serverless architecture can also be more expensive especially for network heavy or compute intensive applications. Always make sure you do your due diligence by analyzing your needs and the total cost of ownership for a serverless application.
How to using Node.js & Up
when developing serverless applications
A of serverless resources
In the following tutorial we will be creating a serverless slack command that pings a url and checks how long the website takes to respond. We will be writing the code using and using a tool called to manage our serverless application deployment.
Supports common Languages like , , &
The project is maintained by , he’s built a ton of other tools that you have most likely either directly/indirectly used. He offers a Pro version for $20/month ($10/month forever with this coupon code: am-376E79B073F3) that I think is well worth the money. I specifically like the active warming feature (combats “”); it come with a slew of other features like encrypted environment variables, instant rollbacks, asset acceleration, and alerting that the free version just doesn’t have. You can find all the details .
The first thing we’ll do is create our Slack app. For this you’ll want to create a new app by going to and clicking the Create New App button. Give your app a name, and select the workspace it should live in. Then click Create App.
To keep this tutorial simple, I have posted all the code on . If you really enjoy reading the code, the two main files are:
- code that actually handles the http request from slack, and pings the url
- configuration for our project
Minimize cold starts - If your application is customer facing, minimize cold starts by investing in a . It’s a terrible customer experience to have a request timeout or be slow.
- Deploy infinitely scalable serverless apps, apis, and sites in seconds.
- This was the first tool I used, it fairly quick to get setup, but IMHO you have to configure too many settings.
- Another tool by TJ, but this is a toolset more for functions, rather than an entire apps
In Part 2: I briefly touched on how a can have a cost benefit. In this post, I will go over
Note: I will make assumptions on costs based on the AWS pricing as of 3/19/2018 (see , , and pricing). Different cloud providers offer different pricing for different solutions and services. It’s always a good idea to explore your options. Here is a that compares 4 of the big cloud providers costs.
You’ll also want to consider what is acceptable performance for you application. There are some gotcha’s to serverless, specifically “” (AKA the time it takes to boot your code) that affect the performance of your application. There are ways to counter the gotcha’s, but you will have to incorporate these into your serverless design. I’ll talk more on how to combat these in Part 4: .
If you are hosting a serverless web application, you’ll want to take a look at the of ownership. If we look at a quick costs breakdown between API gateway and Lambda, it’s straightforward that if using serverless in conjunction with an API gateway, the API gateway service will be your biggest cost, especially if your app serves large files:
As much as the cloud providers want to advertise that the migration to serverless is a simple “Lift & Shift” there are many gotcha’s when it comes to going serverless. Application code will most likely need to be refactored, especially if the application relies on background processing. You will also need to factor in the cost of testing and having developers with a serverless skillset to diagnose any issues that might arise. In Part 4: , I will show you how to minimize code maintenance by using some tools and working through a tutorial.
You want to host a small web application that counts votes, similar to . On average, per month, you receive 3M votes each resulting in a 1KB response which take 300ms on a (256MB lambda).
A similar setup with a classic architecture that includes 1 load balancer and 3 of the smallest EC2 (t2.nano) instances (for high availability, see ) would cost you:
Make sure to check out the next post in this series Part 4: where I’ll show you how to build a serverless slack command.
A simple 10 minute tutorial to setup a Prometheus monitoring stack. Create a Docker stack that includes Prometheus, Grafana, and AlertManager with a PagerTree integration.
In this post, I will walk you through creating a simple Prometheus monitoring stack, connecting it to Grafana for pretty dashboards, and finally configuring alerts via PagerTree.
If you would like a video to follow along instead, you can see it on YouTube. You can find all the code for this stack on Github.
The first thing we’ll do is a machine up and running for this solution. This tutorial assumes you will be using Ubuntu 16.04.
I like Digital Ocean for small tutorials like this one.
If you don’t already have an account, use this link to create an account and get $10 in credits.
If you don’t know how to create a Digital Ocean droplet or SSH into the machine you can follow this article on Medium.
Once you’ve created the Ubuntu server, run the following command in the shell terminal:
At this point you’ll have automagically deployed the entire Prometheus, Grafana, and Alert Manager stack. You can now access the Grafana dashboard from your browser at:
Address: http://<Host IP Address>:3000
Username: admin
Password: 9uT46ZKE
Since the release of Grafana 5.x, Grafana supports auto provisioning data sources and dashboards. We’ve updated the repo for Grafana to auto provision the Prometheus data source and dashboards. Please continue to the next section, Grafana Dashboards.
Awesome! Now if you navigate to the Dashboards in Grafana you will see data populating and some nice looking graphs.
At this point you’ll then 2 dashboards. They are pretty cool. Check them out. When your ready, head down to the Configure Alerts Section.
Ping Dashboard
This dashboard monitors a couple websites for uptime.
System Monitor Dashboard
This dashboard monitors the load on the machine that is running your Prometheus stack.
Now while the dashboards are cool, it would be even cooler if we were able to get alerted when something went wrong. Luckily for us, this project will create an alert after 30 seconds of high CPU. So let’s try to make use of it.
Create a new integration.
Click the Prometheus Logo.
Fill out the following:
Name
Appropriate urgency for the Prometheus alerts
A team alerts from Prometheus should be assigned to
Click Create button
Copy the endpoint URL
Ensure that for the team you are assigning alerts to, you are the Layer 1 on-call and that you have at least 1 notification method setup.
Now we want to modify the alert manager configuration to make use of our PagerTree Webhook. Run the following command and make sure to replace <Your PagerTree Webhook URL>
with the you copied.
After you have run the configuration script, restart the stack with the following command:
Sometimes this command fails. If it does, just run the command again.
In order for us to get an alert we’ll wan to simulate some sort of Alert Worthy Incident. From the shell terminal, run the following command:
Now we’ll wait for 30 seconds or so, and if you’ve followed all the steps correctly you should get a notification saying something like Instance {{ $labels.instance }} under high load
.
If you are reading this give yourself a pat on the back. Good job! You’ve successfully deployed a Prometheus monitoring system, hooked it up to Grafana, and configured and alerts to go to your PagerTree account.
This project is intended just to be a quick tutorial. Before being production worthy, several security considerations should be implemented.