1 of 33

Blog

PagerTree Blog

Learn about the industries latest trends and best practices.

Latest Stories

What is System Monitoring?

In this article we will help you understand system monitoring, what you should look for in your system monitoring tool, and give you our top 7 best APM tools.

Monitored systems can include:

Servers
Networks
Applications
Configurations

APM tools allow system administrators to instrument, collect, and analyze crucial operational insights to keep systems operating at their peak.

7 Best APM Tools

We have compiled a list of the top 7 industry-leading system monitoring tools using 5 key factors to evaluate and scrutinize APM tools.

Here are the top 7 APM tools using our key criteria:

What To Look For In Your System Monitoring Software

When evaluating an APM tool, consider these 5 key factors to ensure you are picking the right tool to monitor your systems. Reliability: Monitoring software's primary role is to provide consistent and accurate data on your system's health and performance. The software should have a proven track record of minimal downtime and accurate measurement, reporting, and alerting. Scalability: APM tools must be able to grow with your service, handling an increasing number of devices and metrics without a drop in performance. Integrations: The monitoring system you choose should easily fit into your existing technology stack. Its ability to integrate smoothly with other tools and platforms can significantly enhance its utility and the benefits you gain from it.

Pricing: APM tools range in pricing anywhere from free to hundreds of dollars. Pricing should align with your budget and the value it delivers. Look for transparent pricing models that scale sensibly with your usage. Ease of use: System monitoring software should provide a user-friendly interface, clear documentation, and responsive customer support. This will dramatically improve the experience of setting up and maintaining your monitoring solution.

Datadog: Modern Monitoring & Security

99.8% availability
700+ built-in integrations
Watchdog- built-in intelligence layer that continuously analyzes billions of data points
“Advertised” starting price of $54/month per host.

PRTG: High Availability System Monitoring

PRTGs’ APM tool at a glance:

On-premises and hosted options
Failover node system for high availability
Excellent usability on desktop, mobile, and web.
Maintenance plans to maintain services
$159/month for hosted services
$2,149 one-time perpetual license for a perpetual license

LogicMonitor: Intelligent Observability for Modern Applications

LogicMonitors’ APM tool at a glance:

2000+ integrations
Industry-leading 99.9% availability
Active Discovery feature monitors changes in your environment
Host of prebuilt dashboards
Plans start at $22/month per resource(device)

New Relic: Discover True Application Monitoring

New Relics pricing starts at $10/month for the first user with every additional costing $99/m.

New Relics’ APM tool at a glance

99.8% uptime
750+ integrations
APM tools give access to all of New Relics tools like AIOps and Security Monitoring
New Relic AI integrated into the system to assist users
Pricing starts at $10/month but quickly scales up.

AppDynamics: Make Better Decisions From the Code-Level to the C-suite

AppDynamics’ APM tool at a glance:

99.5%
Business-oriented service monitoring
Access to Cisco University
Starting price of $33/month

Elastic APM (ELK Stack): Monitor Your Infrastructure, Your Logs, and Your Users All Within a Single Solution

Elastics’ APM at a glance:

99.96% historical uptime
Self-managed and cloud-managed services
Application dependency mapping
Starting price “as low as $95/month”

SolarWinds: Accelerates the Identification and Root Cause of Application Performance Issues

AppOptics offers full-stack visibility into your application as well as an auto-instrumenting application service, allowing you to quickly diagnose issues within your environment. They present data in a simple, easy-to-understand way that allows you to find issues quickly and dig deeper into them with more detailed views. AppOptics boasts integrations into multiple oncall software, allowing you to be notified of any issues 24/7.

SolarWinds does not offer a publicly available uptime nor do they have a public historic uptime.

AppOptics service for Application monitoring includes infrastructure system monitoring and starts at a price of $24.99/month per host. They sell hosts in packs of 10, meaning the minimum monthly cost for AppOpicts is $249.90/month.

AppOptics APM tool at a glance:

Simple and efficient data presentation
150+ integration
Built-in oncall alerting integrations
Non-public SLA and historic uptime
$24.99/month per host

Finding The APM Tool For You

Top 5 Best PagerDuty Alternatives in 2024

Learn about what makes a great incident management tool and about 5 alternatives to the market leader, PagerDuty.

TL;DR Here’s the shortlist of the top 5 best PagerDuty alternatives in 2024:

What is PagerDuty?

But, PagerDuty can be expensive, has a steep learning curve, and lacks customer support. We’ve compiled a list of the top 5 best PagerDuty alternatives. We specifically looked at features, reliability, and pricing.

Why Choose a PagerDuty Alternative?

PagerDuty offers a robust incident management solution. However, there are many reasons why people might explore alternatives.

Cost - PagerDuty paid plans start at $25/user/month. Cost conscience organizations will be looking for more cost-effective alternatives. (Note: when comparing pricing below, we will compare plans that support Single Sign On)
Feature Set - PagerDuty offers many features. Some might even call it feature bloat. Most organizations only need the core subset, like on-call, escalations, and notifications. Why pay for features you don’t use?
Learning Curve - PagerDuty can be complex to set up and learn. Most teams will want a solution that is easy to set up and simple to understand. Time to go live is critical for small and agile teams.
Customer Service - PagerDuty is the largest and most well-known brand for on-call incident management software. However, their customer service and support are lacking, especially for small accounts. Businesses will be looking for the best customer support possible.

What Makes A Great Incident Management Platform?

When evaluating PagerDuty alternatives, consider the following criteria to make the best decision possible:

Ease of Use - Is the tool easy to use and understandable? How easy is it to onboard new users? You don’t want to be confused when figuring out who is on-call, especially during an incident.
Integration Capabilities - Ensure the tool integrates with your team’s existing toolset.
Scalability - Can the tool scale with your organization? Price, feature set, and onboarding should be considered when considering scalability.
Support and Training - Does the tool have easy-to-read support and training documentation?

By carefully assessing these criteria, you can identify the best PagerDuty alternative for your team's needs.

Top 5 Best PagDuty Alternatives

Here's our list of the top 5 best PagerDuty alternatives in 2024.

PagerTree

PagerTree’s Tagline: “On-Call. Simplified. - PagerTree empowers teams to share on-call responsibility and respond faster when incidents occur.”

PagerTree’s value proposition: On-call scheduling, escalations, and alert notifications starting at $15/user/month.

OpsGenie

OpsGenie has core features like on-call scheduling, escalations, and notifications. Their “Standard” price plan ($23/user/month) includes features like integrations and reports. Advanced features like live call routing are provided for a $10 upcharge per number.

OpsGenie was acquired by Atlassian in 2018. Since then, its reliability has been called into question. In April 2022, OpsGenie had a 2-week (that’s not a misprint, yes, 14 days) outage. Unlucky customers were forced to move to an alternative solution. Otherwise, they had to wait until their account was prioritized for restoration.

OpsGenie Tagline: “On-call and alert management to keep services always on.”

OpsGenie Value Proposition: “Centralized alert management starting at $23/user/month”

iLert

iLert offers all the core features like on-call scheduling, escalations, and notifications. Additionally, iLert offers status pages and live call routing for an additional fee (+$5/user/month).

If you are a European customer or have strict data storage requirements, iLert might be for you. iLert is made and supported in Germany. If you are looking for US-based customer support, you could wait 12+ hours for responses.

iLert tagline: “One platform for alerting, on-call management, and status pages.”

iLert value proposition: “On-call management and status pages starting at $24/user/month”

Splunk

Splunk offers all the core features like on-call schedules, escalations, and notifications. Splunk also bundles enterprise-focused features like Real User Monitoring (RUM), Log Observability, and Application Performance Monitoring (APM). The extra features come at the extra cost of complexity. Splunk can have a higher learning curve than other tools. Splunk doesn’t publish pricing, so you know it will be expensive.

OnPage

OnPage's primary benefit is that it offers HIPPA-compliant notifications. If you are looking for a healthcare-centric tool, OnPage could be for you. If you are not in the healthcare space, we suggest looking at another tool. OnPage's user interface and on-call scheduler lack modern ease of use. Only their marketing has adapted to target the IT space. Additionally, pricing is only offered in yearly installments.

OnPage tagline: “Rise Above the Clutter® Elevate urgent notifications and facilitate secure team collaboration in critical situations.”

OnPage value proposition: “HIPPA compliant on-call scheduling and notifications starting at $29/user/month paid annually.”

Other PagerDuty Alternatives Worth Mentioning

To keep this list short, we only reviewed the PagerDuty alternatives we thought were best. We also compiled a list of a couple of other PagerDuty alternatives. You may find them interesting, but they didn’t quite make the cut.

On-Call Alert and Notification

On-call alert and notification tools focus on the core feature set of on-call scheduling. They handle escalations and notifications.

Incident Management & Analysis

Incident Management and Analysis tools will generally be integrated into on-call alert and notification tools. They will also provide tools for retrospectives and postmortem analysis.

Open Source PagerDuty Alternatives

There are a few open-source alternatives that we want to mention. As always, with open source, these options might (or might not) be supported. They will have to be self-hosted.

In conclusion, choosing the right PagerDuty alternative depends on your team's specific needs. Each alternative mentioned above brings its own strengths to the table. Evaluate your requirements. Explore product features. Make an informed decision that’s best for your team.

Note: This list is based on features, user feedback, and industry trends as of 2024. Always check for the latest updates and reviews before making a decision.

Understanding Linux File System: A Comprehensive Guide to Common Directories

Welcome to an in-depth exploration of the Linux file system! In this comprehensive guide, we'll demystify the various directories found in a typical Linux distribution, explaining their purposes and functionalities. Whether you're a seasoned sysadmin or a curious newcomer, this article will enhance your understanding of the backbone of Linux's structure and operation.

/bin

The /bin directory is a fundamental part of the Linux file system, playing a crucial role in system functionality. It contains essential user binary files, the basic programs and utilities necessary for the system to operate and for users to interact with it. These binaries include common commands like ls, cp, and mv, which are indispensable for file management, and bash, the default command-line shell for many Linux distributions. Unlike other directories that house more specialized or user-installed software, /bin is reserved for these core components, ensuring that the system remains operational and accessible, even in single-user modes or when other file systems are not yet mounted.

/sbin

Similar to /bin , but this directory contains applications that only the super user (hence prefixed s) will need. Applications in this directory need to be run with the sudo command. Typically this directory contains tools that can install, delete and format. As you can imagine, some of these programs can cause system damage if used improperly.

/etc

The /etc directory is crucial for system configuration. It contains all the configuration files required by the system and other applications. Unlike /bin or /sbin, /etc does not hold executable programs, but rather static configuration files. Here, you'll find everything from user passwords (in /etc/passwd), to network configurations, to the services at boot. It's like the settings menu of your operating system, but in a folder. You can remember it with "everything to configure".

/dev

Short for "device", the /dev directory is a bit unique. It's where Linux stores device files, representing hardware components or drivers. For example, /dev/sda typically represents the first hard disk in your system. These are not regular files - they are special files that help the system communicate with its hardware. It's a crucial part of the Linux file system, though not something a regular user would interact with directly.

/proc

This is a virtual directory, meaning it doesn’t exist on your disk. It’s dynamically created by the system. The /proc directory contains information about system resources and the status of the operating system. Each running process has a folder here named by its process ID. You can peek into these folders to see detailed information about each process, but remember, it's mostly for viewing, not modifying.

/var

Stands for "variable files". /var is where files that are expected to grow are stored. This includes things like logs (/var/log), mail (/var/mail), and spool files. It’s a dynamic folder, and its contents change as the system runs. This is where you'll look when troubleshooting or when trying to understand more about what's happening on your system.

/tmp

Just as it sounds, /tmp is for temporary files. When applications or the system needs to store a file temporarily, it goes here. These files are usually deleted upon reboot or after a set period. It’s a scratch space for the system and applications.

/usr

Short for "Unix System Resources," /usr is one of the largest directories. It contains additional user applications and their files. Think of it as a secondary hierarchy for user utilities and applications. /usr/bin, for example, has many user commands, while /usr/lib contains libraries for /usr/bin and /usr/sbin.

The /usr directory used to be where users’ home directories were originally kept back in the early days of UNIX. However, now /home is where users keep their stuff.

/home

This is where the personal folders of each user on the system are located. If your username is “user”, you'll find your personal files in /home/user. It’s akin to the Users directory in Windows. This is where you'll spend most of your time: your documents, downloads, pictures, and personal data reside here.

/boot

The /boot directory contains the files needed to start up the system - the boot loader, the kernel, and other files needed during the boot process. It's small but essential. Without these files, the system can't start properly.

/lib

Similar to /usr/lib, /lib contains essential libraries needed for the binaries in /bin and /sbin. These libraries are fundamental to the operation of the system and the applications running on it.

/opt

An abbreviation for "optional", /opt is used for storing additional software and packages that are not part of the default installation. Often, third-party applications are installed here. It's a common directory for software that’s not included in the standard distribution.

/mnt

Short for "mount". This directory is a temporary mount point where system administrators can mount file systems before they are integrated into the system's file structure. Think of it as a place to plug in external resources temporarily.

/media

Similar to /mnt, /media is used for mounting removable media like USB drives, CD-ROMs, etc. It’s a more modern version of /mnt, and many systems automatically mount external drives here.

/srv

This stands for "service". /srv contains data related to services offered by the system. For instance, if your Linux machine is hosting a website, the website data might live here. It's not used in all distributions, but when it is, it's for service data.

And there you have it, a brief rundown of the most common directories in the Linux file system. Each serves a distinct purpose, and understanding them can be crucial for effective system management and navigation.

Ping Command: A Comprehensive Guide to Network Connectivity Tests

Understanding Ping: More Than Just a Sound

What is Ping Exactly?

How to Run a Ping Network Test

Navigating Ping Tests: Common Addresses and Results

Common Addresses to Ping

Performing ping tests involves checking your internet connectivity. Reliable addresses to ping include:

Failure to receive a response from these addresses may indicate a problem on your end.

Understanding Ping Results

Interpreting ping results is crucial. Analyzing server hostnames, response times, Time to Live (TTL), and packet loss provides insights into network performance. Troubleshooting connection issues becomes more effective when armed with this information.

Below is what the ping command will return:

Host name confirmation: The first line displays the server's hostname translated to an IP address. It also confirms an active connection to the server was made.
Bytes sent to server - The number of bytes sent to the server.
Response times - Total roundtrip time for the response to return.
TTL (Time to Live) - The total number of routers the packet will travel through.
Ping Statistics - Overall statistics for the ping test. It includes number of packets sent, received, and lost.
Approximate Round Trip Times - Minimum, maximum, and average times for the ping test. Higher times indicate a poorer quality connection or servers that are far away.

Troubleshooting Ping Connectivity Errors

Common Errors and Solutions

Request Timed Out: There is a problem in establishing a connection. This occurs when the destination host is either non-existent, powered off, or disconnected from the network.
Firewall Impact: Firewalls, based on port numbers and IP addresses, may permit or restrict traffic. In some instances, ping might be blocked as a precaution against potential reconnaissance by malicious actors.
Destination Host Unreachable: The "destination host unreachable" error points to a failure in finding the route to the intended destination. This could emanate from issues within the local host or the default gateway. To resolve this, check your IP settings and verify the default gateway address.
Unknown Host: This error indicates a challenge in translating the hostname to the corresponding IP address, suggesting a potential DNS server problem.

In addition to the above errors, there are instances of packet data loss, where some requests receive replies while others do not. Possible culprits for this issue include malfunctioning network cards, damaged cables, or problems with switches and routers.

Fly.io migrate-to-v2 Postgres stuck in read-only mode

A postmortem describing the issue, root cause, and remediation of our outage on July 30, 2023 00:30 - 01:15 (UTC)

TL;DR

Fly.io Postgres stuck in read-only mode

During the migration to the Fly.io platforms v2, the provided command (migrate-to-v2) times out if a Postgres cluster doesn't replicate and failover fast enough.
The migrate-to-v2 command first puts the database in a read-only state. When the timeout occurs, the command fails to remember to put the database back in a writable state.
After the database is read-only, new and existing connections will not be able to write to the database. This caused PagerTree to functionally fail for approximately 45 minutes on July 30, 2023 from 00:30 -> 01:15 UTC.
The Postgres cluster can be put back into a writable state with the following commands:

Note that on lines 9 & 10, the command will only work for the current connection. You need to use line 14 to solve the root cause.

Runway WAL log

To fix this, you need to connect to the Postgres cluster and tell it to forget the orphan VM.

Incident Postmortem

All times are UTC and any references to communication or actions taken by PagerTree were performed by Austin Miller.

Events leading up to the Postgres Migration

Tuesday, July 25th, 2023 at 15:30 we attempted to migrate our staging Postgres cluster and found errors. We ran into the timeout error (but didn't understand it as the root cause), and once replicated an migrated used the migrate-to-v2 troubleshoot command to kill the Nomad VMs and mark the application as V2. Additionally, we started troubleshooting why the staging database was left in a read-only state. We found the psql command SHOW default_transaction_read_only; to show the applications database in a read only state. We restarted the database using fly postgres restart -a <postgres_fly_app_name> and killed any active sessions using select pg_terminate_backend(pid) from pg_stat_activity where application_name like '/app%' or application_name like 'side%'; and the database went back into a read/write state and we thought everything had been fixed.
Tuesday, July 25 2023 at 16:32 (almost in parallel to the previous bullet point) we reached out to Fly support team stating the issue we had found with the migration of the staging postgres cluster. At 17:06 we replied reporting killing the sessions will fix the issue. Sam Wilson from Fly support responded 5 hours later at 21:37, reporting they were glad we were able to work around it and our app had been successfully migrated to v2 and they were also seeing read/write enabled on the staging primary.
Thursday, July 27 at 03:30 we attempted to perform the migrate-to-v2 command with failure (unspecified error code). Applications were turned off between 03:30->03:40 resulting in 10 minutes of application downtime.
Thursday, July 27, 2023 at 23:15 we were reminded via an automatic email that our production Postgres cluster was still scheduled for an automatic upgrade the following week.
Friday, July 28, 2023 at 22:30 we attempted the migration again without success, but now with an error "Page Not Found". There was approximately 5 minutes of downtime for the PagerTree app.
Friday, July 28, 2023 at 22:52 we emailed the Fly support team with the new "Page not found" error. We asked if they could look into their logs to see what could be happening. We also expressed concern for the automatic migration of apps the following week when the migration command was failing.
Saturday, July 29 at 05:09 Brian Li from the Fly support staff suggested we try deleting a volume from an orphaned VM, then using the LOG_LEVEL=debug on the migrate-to-v2 command. We deleted the orphan volume.
Sunday, July 30 at 00:30 we attempted to migrate the cluster with Brian Li's suggestion. The PagerTree application was taken offline and a service outage began. The migration looked to be going smoothly now that the orphaned volume had been deleted.
Sunday, July 30 at 00:35 the migrate command had timed out. At this point the new v2 machines had been created but were still replicating from the leader. We decide to wait until replication had been completed before running the troubleshooting command and deleting the v1 VMs (similar to what we had done with our staging Postgres cluster, bullet #2).
Sunday, July 30 at 00:40 replication had completed, so we tried to bring the PagerTree application back online. We immediately started to see errors in Honeybadger ActiveRecord::StatementInvalid: PG::ReadOnlySqlTransaction: ERROR: cannot execute INSERT in a read-only transaction
The following describes the actions taken between Sunday July 30 at 00:40 and 01:16 without specific timestamps.
1. We tried killing the connections using select pg_terminate_backend(pid) from pg_stat_activity where application_name like '/app%' or application_name like 'side%'; in hopes that a new connection would be in read/write mode. This failed and at this point, we knew that database was in a read-only mode.
2. Logging into the Postgres cluster an running SHOW default_transaction_read_only; on the application database confirmed our suspicion. We tried running SET default_transaction_read_only TO off; to fix the issue. With a successful command run, we believed the database would now be in a read/write mode. We would later learn this only set the option for the current connection.
3. We restarted the PagerTree application but again saw the errors regarding the read-only transactions.
4. We searched the internet how to make the appropriate change to all new connections. After searching for 5 or 10 minutes we found a working solution alter database application_db_name set default_transaction_read_only=off;
5. We restarted the PagerTree application and confirmed that the database was now in a writable state.
Sunday, July 30 at 01:43 we declared the incident resolved.

Review

Impact - The PagerTree application was down for 46 minutes and impacted all customers and integrations. Incoming requests, alerts, and notifications were all impacted during these 46 minutes.
Root Cause - migrate-to-v2 timeout and database left in read-only state.
Recurrence - This also happened in our staging Postgres upgrade, but we thought a Postgres restart and killing existing connections fixed the issue.
Corrective actions - alter database application_db_name set default_transaction_read_only=off; is the authoritative fix for the read-only state of the database.
Future Monitoring - We have added writing to the database a check in our monitoring. A write test is performed every minute now.

Multi-Tenant SSO using Devise

Tutorial showing how to implement multi-tenant single sign-on (SSO) using Ruby on Rails, Devise, and SAML. Works with identity providers like Okta, Google, Azure, etc.

Recently while scrolling on Twitter I saw this tweet by John Nunemaker.

In this blog post, I want to describe how we implemented multi-tenant SSO at PagerTree to work with any SAML2 identity provider (Okta, Google, Azure, etc.).

STOP HERE - This is not a Copy Pasta™ blog post. Some things are very specific to the PagerTree implementation. You'll need to adapt the code to work for your project. This post is to help do most of the heavy lifting.

Stack Setup and Assumptions

This blog post will make a lot of assumptions about its implementation (it's a highly niche implementation).

Important Notes

This implementation uses the emailAddress attribute of SAML as the primary identifier for Users.
We've snipped a lot of PagerTree specific code for the purposes of brevity and staying focused.

SSO Keywords and Jargon

Some of the most confusing things in SSO implementation is that there is no "standard" naming convention. I have seen many aliases and synonyms all over the web.

idp - Identity Provider (IdP) - Your customer's authentication provider (ex: Okta, Google, Azure, etc.)
idp_entity_id - The unique tenant identifier in the IdP's database.
idp_sso_service_url - The URL your app needs to redirect the user to with the AuthNRequest. It will be at the IdP's domain.
sp - Service Provider (SP) - Your app, the one you are building (ex: PagerTree)
sp_entity_id - A unique tenant identifier in the SP's database.
assertion_consumer_service_url - The endpoint on the SP where the IdP should send the user after they have authenticated.
authnrequest - Programmatic authentication request.
slo - Single Logout
saml - (Security Assertion Markup Language) is an XML-based standard for exchanging authentication and authorization data between parties, enabling Single Sign-On (SSO) functionality.

SSO Workflow Overview

If you are not familiar with SSO that's ok, I am going to go over the basic ideas (a full explanation is outside the scope of this article).

If you've ever logged in to an app using your Microsoft, Google, or work account, it likely used SAML to exchange information about your authentication. The IdP is responsible for the authentication of users (aka verifying users are who they say they are).

The basic workflow looks like this:

with The user comes to the SP application (aka your application).
The user provides the SP application with the authentication email (usually their work email).
The SP looks up the user and an IdP configuration this user is associated with. The user is then redirected to the IdP (idp_sso_service_url) with an authentication request in the format of an AuthNRequest.
At this point, the user either must provide valid credentials to the IdP. Once valid credentials are provided, and the the IdP confirms the user should have access to the SP application, the user is redirected by to the SP application at the assertion_consumer_service_url.
The SP is then responsible for granting access to the application based on the trusted response.

Two Entry Points

SP initiated - When a user comes to your app and clicks "Login using SSO" providing you their email address. This is probably the most common workflow and was described above.
IdP initiated - When a user logs in via their "app portal" from the IdP. Not very common, never have used it myself, but we need to support it. It doesn't change the code, but I am including it here for completeness.

Code

Migration

We need to add a model to hold each tenant's SSO configuration(s). I will briefly explain what each property is:

account_id - The tenant this belongs to.
meta - Free form hash where we can store any future data.
sp_entity_id - The unique identifier for this configuration.
name - A user friendly name so they can remember this configuration (ex: "Okta Config", "Okta Dev Config")
vendor - Enum identifier the IdP vendor. When debugging with customers why their configuration doesn't work, it's helpful to know the vendor (some vendors do some wonky stuff).
metadata_url - The URL to the IdP's metadata XML.
metadata_xml - The raw metadata XML (some vendors don't provide a metadata URL). The user should be able to copy and paste it into our app.
settings - A JSON representation of the parsed XML.
assertion_response_options - A hash of configurable options (per tenant) that we can pass into the Ruby SAML library.

Model

Our IdPConfig model will hold a SSO configuration. Each account can have many IdPConfigs, but there will only ever be 0 or 1 active IdPConfigs for an account at a time.

A couple of important notes:

Line 78 - We use SecureRandom.hex and not a UUID. Azure does not like dashes in the sp_entity_id; a hex key will work across all known providers.
Line 95 - We use OneLogin::RubySaml::IdpMetadataParser to parse the XML provided by the user or the IdP's metadata_url.

Routes

The important paths are as follows:

/sso - Where the user comes in the SP initiated workflow. We ask them for their email here.
/public/saml/consume - Where the IdP redirects the user to after they have provided their credentials to the IdP. This is the assertion_consumer_url. The payload of the request will be the assertion of who the user is.
/public/saml/metadata - A convenience endpoint for users to get information in XML format about the SP. IdP's sometimes will ask for this. Its a programmatic way for the SP to provide the IdP with details like the assertion_consumer_service_url
/public/saml/slo - The IdP will make a request here if the user is logged out. This is known as single logout. We need to destroy the users session when this URL is called.

Sessions Controller

You'll need to read through the sessions controller, but I will give a brief summary:

Line 7 - before_action :set_idp_config - Set the IdP Config for SSO methods.
Line 10 - def destroy - Override the Devise destroy method. Send the IdP a logout request if our user logsout from our app.
Line 27 - def sso - Render the SSO page to capture the users email.
Line 82 - def saml_callback - Process the IdP response. This is the assertion_consumer_service_url.
Line 91 - if !user - Create a user if they don't exist in our database but were authenticated by the trusted IdP. This can occur when a SSO administrator adds access to your application and it's the users first time to login to your app.
Line 118 - def saml_metadata - The convenience method providing metadata that describes the SP configuration.
Line 126 - def saml_logout - Process the IdP initiated single logout request.
Line 164 - def verify_can_username_password - SSO users should be forced to use SSO

Gotchas

Switching Accounts

So in /app/controllers/accounts_controller.rb we have something like this:

Feedback

serverless

Top 5 Best PagerDuty Alternatives in 2024

Learn about what makes a great incident management tool and about 5 alternatives to the market leader, PagerDuty.

TL;DR Here’s the shortlist of the top 5 best PagerDuty alternatives in 2024:

What is PagerDuty?

is a leading incident management platform that has been around since 2009. PagerDuty aims to streamline an organization's incident response process. It has features like on-call scheduling, escalation policies, and alerts.

Why Choose a PagerDuty Alternative?

PagerDuty offers a robust incident management solution. However, there are many reasons why people might explore alternatives.

Cost - PagerDuty paid plans start at $25/user/month. Cost conscience organizations will be looking for more cost-effective alternatives. (Note: when comparing pricing below, we will compare plans that support Single Sign On)
Feature Set - PagerDuty offers many features. Some might even call it feature bloat. Most organizations only need the core subset, like on-call, escalations, and notifications. Why pay for features you don’t use?
Learning Curve - PagerDuty can be complex to set up and learn. Most teams will want a solution that is easy to set up and simple to understand. Time to go live is critical for small and agile teams.
Customer Service - PagerDuty is the largest and most well-known brand for on-call incident management software. However, their customer service and support are lacking, especially for small accounts. Businesses will be looking for the best customer support possible.

What Makes A Great Incident Management Platform?

When evaluating PagerDuty alternatives, consider the following criteria to make the best decision possible:

Core Features - The tool, at a minimum, should have on-call scheduling, escalation policies, and notifications. Additional features like and reporting are also a plus.
Ease of Use - Is the tool easy to use and understandable? How easy is it to onboard new users? You don’t want to be confused when figuring out who is on-call, especially during an incident.
Integration Capabilities - Ensure the tool integrates with your team’s existing toolset.
Reliability - Check the tool’s historical (usually found at https://status.domain.com). The tool should have minimal downtime. The tool’s vendor should also communicate clearly and effectively during an outage.
Scalability - Can the tool scale with your organization? Price, feature set, and onboarding should be considered when considering scalability.
Support and Training - Does the tool have easy-to-read support and training documentation?

By carefully assessing these criteria, you can identify the best PagerDuty alternative for your team's needs.

Top 5 Best PagDuty Alternatives

Here's our list of the top 5 best PagerDuty alternatives in 2024.

PagerTree

is the best overall alternative to PagerDuty. It is efficient for on-call scheduling and reliable for notifications. Additionally, it’s the most cost-effective solution on our list at $15/user/month.

PagerTree excels at . These include drag-and-drop on-call scheduling, escalation layers, and reliable multi-channel notifications. Extra features like live call routing and reports are provided at no extra charge. PagerTree has ample documentation, scalable pricing, and a reliable track record. It provides the best all-around functionality at a fraction of the price ().

PagerTree’s Tagline: “On-Call. Simplified. - PagerTree empowers teams to share on-call responsibility and respond faster when incidents occur.”

PagerTree’s value proposition: On-call scheduling, escalations, and alert notifications starting at $15/user/month.

You can start a 14-day free trial here:

OpsGenie

is the best-known competitor to PagerDuty. They offer a streamlined approach to .

OpsGenie Tagline: “On-call and alert management to keep services always on.”

OpsGenie Value Proposition: “Centralized alert management starting at $23/user/month”

iLert

is the new kid on the block, but we actually really like this tool. iLert offers a modern user interface and reliable product that you would come to expect from a German startup.

iLert offers all the core features like on-call scheduling, escalations, and notifications. Additionally, iLert offers status pages and live call routing for an additional fee (+$5/user/month).

iLert tagline: “One platform for alerting, on-call management, and status pages.”

iLert value proposition: “On-call management and status pages starting at $24/user/month”

Splunk

Formerly known as VictorOps, is an incident management tool. It caters to enterprise organizations.

Splunk’s tagline: “Splunk On-Call - Make expensive service outages a thing of the past. Remediate issues faster, reduce , and keep your services up and running.”

OnPage

offers “incident alert management”. They primarily target the healthcare industry, including hospitals, doctors, and nurses.

OnPage tagline: “Rise Above the Clutter® Elevate urgent notifications and facilitate secure team collaboration in critical situations.”

OnPage value proposition: “HIPPA compliant on-call scheduling and notifications starting at $29/user/month paid annually.”

Other PagerDuty Alternatives Worth Mentioning

On-Call Alert and Notification

On-call alert and notification tools focus on the core feature set of on-call scheduling. They handle escalations and notifications.

Incident Management & Analysis

Incident Management and Analysis tools will generally be integrated into on-call alert and notification tools. They will also provide tools for retrospectives and postmortem analysis.

Open Source PagerDuty Alternatives

There are a few open-source alternatives that we want to mention. As always, with open source, these options might (or might not) be supported. They will have to be self-hosted.

and
(archived)

Note: This list is based on features, user feedback, and industry trends as of 2024. Always check for the latest updates and reviews before making a decision.

Start your 14-day free trial of PagerTree today!

What is System Monitoring?

In this article we will help you understand system monitoring, what you should look for in your system monitoring tool, and give you our top 7 best APM tools.

As service providers, we understand that for our service isn't an achievable goal, but we do everything in our power to provide our customers with the best possible service and possible. We implement tools and processes to allow ourselves the ability to respond to issues before they affect our customers. One type of tool we implement is system monitoring tools. Having access to all of our systems in a clean, easy-to-read dashboard helps us see trends and issues before they become serious problems. Understanding our systems and resolving issues before our customers see them helps improve customer satisfaction and service uptime and helps us meet our SLAs, but what is system monitoring? System monitoring, also known as “application performance monitoring” (APM), is the process of tracking and evaluating the performance and health of a service provider’s infrastructure.

Monitored systems can include:

Servers
Networks
Applications
Configurations

APM tools allow system administrators to instrument, collect, and analyze crucial operational insights to keep systems operating at their peak.

7 Best APM Tools

We have compiled a list of the top 7 industry-leading system monitoring tools using 5 key factors to evaluate and scrutinize APM tools.

Here are the top 7 APM tools using our key criteria:

What To Look For In Your System Monitoring Software

Datadog: Modern Monitoring & Security

is a widely used application performance monitoring tool, thanks to its extensive list of over that make it highly adaptable for any stack. The company promises a , ensuring its tools are available when you require them the most. Datadog is unmatched in the services it provides, encompassing everything from log management to full infrastructure monitoring.

They champion their “code level distributed tracing tool, Watchdog,” which enables users to “detect and resolve root causes faster,” improving application performance and security. This Watchdog AI automates root cause analysis, helps detect anomalies, and minimizes downtime.

The Datadog APM plan advertises a cost of $36/month per , but in order to utilize the APM tool you are required to also have an . These start at $18/month per host. This brings the cost of Datadog (on its face) to $54/month per host at its base level. Datadogs’ APM tool at a glance.

99.8% availability
700+ built-in integrations
Watchdog- built-in intelligence layer that continuously analyzes billions of data points
“Advertised” starting price of $54/month per host.

PRTG: High Availability System Monitoring

offers multiple solutions for system monitoring and APM tools offering and solutions. In addition to these solutions, PRTG offers different payment methods to meet your needs standing out as one of the only tools to offer a .

The PRTG Monitoring system stands out from many APM tools due to its “” system, which allows 4 simultaneous PRTG core servers to be active on any one machine at a time. This failover system could potentially offer the highest level of availability, and PRTG proclaims its system as “high availability” thanks to this feature. PRTG boasts its “excellent usability” on all platforms, including web, desktop, and mobile, making it one of the few to actually brag about how well its user interface works.

PRTG offers two pricing options: perpetual licensing for its on-premises services and per-month/annual pricing for its hosted services. We will look at one price point for each. : Services start at $159/month and covers (according to PRTG) 50 devices.

: One-time fee starting at $2,149 per license and covers 50 devices.

: To stay up to date with services and tech support PRTG offers maintenance plans that cost ¼ of the original price of your selected plan. This is not required.

PRTGs’ APM tool at a glance:

On-premises and hosted options
Failover node system for high availability
Excellent usability on desktop, mobile, and web.
Maintenance plans to maintain services
$159/month for hosted services
$2,149 one-time perpetual license for a perpetual license

LogicMonitor: Intelligent Observability for Modern Applications

When it comes to integrating an APM tool into your system, LogicMonitors' and self-proclaimed “lightning-fast implementation” make it the go-to system monitoring tool for any business looking for fast deployment, and smooth integrations. LogicMonitor’s feature actively monitors changes in your environment, automatically discovering and monitoring new virtual machines, volumes, or devices.

LogicMonitor comes with a host of pre-built dashboards offering immediate insight into the data that is most relevant to you and your industry. With an industry-leading , LogicMonitor is the choice for those looking for the highest level of promised availability from a hosted service.

Pricing for LogicMonitors' Infrastructure monitoring plan starts at (device) with a minimum of 30 resources. Additional add-on plans are available for log retention, AI-powered anomaly detection, and application traces.

LogicMonitors’ APM tool at a glance:

2000+ integrations
Industry-leading 99.9% availability
Active Discovery feature monitors changes in your environment
Host of prebuilt dashboards
Plans start at $22/month per resource(device)

New Relic: Discover True Application Monitoring

New Relics’ APM tool, APM360, not only monitors your applications with a but also gives you access to a slew of New Relic tools, including security monitoring and AIOps. With over and an industry-leading 8 supported programming languages, APM 360 dramatically increases usability for developers. It features an automatic service map that visualizes dependencies within your systems, simplifying the understanding of complex data structures. APM 360 excels in transforming difficult-to-comprehend data into easily digestible visuals, ensuring that insights are accessible and actionable. is one of their newest feature allowing users to find and fix issues, get quick insights, translate data and queries, and instrument their system with ease.

New Relic offers dynamic pricing to meet the needs of every business, large or small. They utilize the to determine pricing on all plans, + = Price.

New Relics pricing starts at $10/month for the first user with every additional costing $99/m.

New Relics’ APM tool at a glance

99.8% uptime
750+ integrations
APM tools give access to all of New Relics tools like AIOps and Security Monitoring
New Relic AI integrated into the system to assist users
Pricing starts at $10/month but quickly scales up.

AppDynamics: Make Better Decisions From the Code-Level to the C-suite

, the self-proclaimed leader in the APM tool sphere, offers a unique business-oriented system monitoring service to its clients connecting data, development teams, and IT teams to business results. Given their includes “AppDynamics is on a mission to help companies see their technology through the lens of the business” this would make sense. By offering observability across entire organizations, implementing end-to-end observability on a code and transaction level, and presenting data at a level in which C-suite and devs can both understand AppDyanmics upholds that mission.

AppDynamics promises one of the lowest availability rates in the industry: uptime. While this number pales in comparison to LogicMonitors' 99.9% uptime, it gives AppDynamics more room to test new features and expand its offering for customers.

As an added bonus to AppDynamics, all paid plans give users “unlimited standard access” to .

for the AppDynamic APM tool starts at $33/month per core. Users should be aware of additional add-on features when researching their tool.

AppDynamics’ APM tool at a glance:

99.5%
Business-oriented service monitoring
Access to Cisco University
Starting price of $33/month

Elastic APM (ELK Stack): Monitor Your Infrastructure, Your Logs, and Your Users All Within a Single Solution

offers both and services to its customers, giving them the flexibility to choose the service they want. Elastic offers a clean and simple view of Application Performance Monitoring, with clean and easy-to-understand dashboards. Their application dependency mapping helps teams identify application problems quickly by automatically visualizing the relationship between services inside your application ecosystem. Partnered with their , you will gain a level of observability like none other.

While Elastic does not present a public availability percentage, it does have a historically amazing uptime. Within the last six months, its tool has not fallen below uptime, which shows its ability to deliver on uptime.

Elastic APM offers a “” plan for self-hosted monitoring but requires you to contact their sales team for licensing. For cloud-hosted services, Elastics' pricing is advertised as “as low as .” Pricing is based on your cloud production configuration, but Elastic does offer a to help you determine your actual cost.

Elastics’ APM at a glance:

99.96% historical uptime
Self-managed and cloud-managed services
Application dependency mapping
Starting price “as low as $95/month”

SolarWinds: Accelerates the Identification and Root Cause of Application Performance Issues

APM tool, AppOptics, delivers effective and efficient application performance monitoring for the needs of any business, small or large. They boast their cost-effective scaling to match the growth of your businesses.

SolarWinds does not offer a publicly available uptime nor do they have a public historic uptime.

AppOptics APM tool at a glance:

Simple and efficient data presentation
150+ integration
Built-in oncall alerting integrations
Non-public SLA and historic uptime
$24.99/month per host

Finding The APM Tool For You

System monitoring or APM tools are essential for maintaining uptimes, detecting service failures, and evaluating performance. Whether you’re looking for a service with high uptimes, every integration under the sun, or unmatched services, there is a tool that fits your needs. When selecting an APM tool, factors like reliability, scalability, integration, cost, and usability must be considered to meet the current and future needs of your business. System monitors are only one tool in your arsenal to maintain service levels and improve overall performance. Most organizations partner these tools with like to ensure their teams are being notified when incidents occur.

Fly.io migrate-to-v2 Postgres stuck in read-only mode

A postmortem describing the issue, root cause, and remediation of our outage on July 30, 2023 00:30 - 01:15 (UTC)

TL;DR

Fly.io Postgres stuck in read-only mode

During the migration to the Fly.io platforms v2, the provided command (migrate-to-v2) times out if a Postgres cluster doesn't replicate and failover fast enough.
The migrate-to-v2 command first puts the database in a read-only state. When the timeout occurs, the command fails to remember to put the database back in a writable state.
After the database is read-only, new and existing connections will not be able to write to the database. This caused PagerTree to functionally fail for approximately 45 minutes on July 30, 2023 from 00:30 -> 01:15 UTC.
The Postgres cluster can be put back into a writable state with the following commands:

Note that on lines 9 & 10, the command will only work for the current connection. You need to use line 14 to solve the root cause.

Runway WAL log

Additionally, we ran into another issue of a . The WAL log is responsible for replica databases to catch up to the leader. If the leader believes a replica has not caught up, it will continue to keep the WAL log around; this can fill up the database's entire hard drive, causing the database cluster to fail.

In the monitoring page of your app, you might see output like this: 2023-07-31T16:52:25.689Z WARN cmd/sentinel.go:276 no keeper info available {"db": <db_id>, "keeper": <keeper_id>}. What this is trying to tell you is that the leader thinks there is a replica somewhere out there that hasn't been updated. In our case, an orphan (dead) VM was never unregistered.

To fix this, you need to connect to the Postgres cluster and tell it to forget the orphan VM.

Incident Postmortem

All times are UTC and any references to communication or actions taken by PagerTree were performed by Austin Miller.

Events leading up to the Postgres Migration

Monday, July 24, 2023 at 22:04 the PagerTree team was notified by an automated email that Fly.io would need to migrate our staging and production Postgres apps from the deprecated Nomad (v1) to the Machines (v2) platform the following week. It was advertised that the command flyctl migrate-to-v2 , but in our experience, we had run into issues during upgrades on Fly.io. We decided to proactively upgrade the application to be able to address any issues ahead of time.
Tuesday, July 25th, 2023 at 15:30 we attempted to migrate our staging Postgres cluster and found errors. We ran into the timeout error (but didn't understand it as the root cause), and once replicated an migrated used the migrate-to-v2 troubleshoot command to kill the Nomad VMs and mark the application as V2. Additionally, we started troubleshooting why the staging database was left in a read-only state. We found the psql command SHOW default_transaction_read_only; to show the applications database in a read only state. We restarted the database using fly postgres restart -a <postgres_fly_app_name> and killed any active sessions using select pg_terminate_backend(pid) from pg_stat_activity where application_name like '/app%' or application_name like 'side%'; and the database went back into a read/write state and we thought everything had been fixed.
Tuesday, July 25 2023 at 16:32 (almost in parallel to the previous bullet point) we reached out to Fly support team stating the issue we had found with the migration of the staging postgres cluster. At 17:06 we replied reporting killing the sessions will fix the issue. Sam Wilson from Fly support responded 5 hours later at 21:37, reporting they were glad we were able to work around it and our app had been successfully migrated to v2 and they were also seeing read/write enabled on the staging primary.
Thursday, July 27 at 03:30 we attempted to perform the migrate-to-v2 command with failure (unspecified error code). Applications were turned off between 03:30->03:40 resulting in 10 minutes of application downtime.
Thursday, July 27 at 03:58 we reported our findings to Fly. The Postgres cluster was also left in a strange state (ophaned VM) with lots of errors. (This would later be found to be the ). We asked the Fly support team to look into the issue and advise.
Thursday, July 27 at 17:50 UTC we asked for an update on the ticket since we noted our database hard drive filling up and we had not yet received a response from Fly support.
Thursday, July 27, 2023 at 18:38 Nina Vyedin from the Fly support staff responded with a very generic error and referenced us to try running the migrate-to-v2 command again and troubleshooting with .
Thursday, July 27, 2023 at 23:15 we were reminded via an automatic email that our production Postgres cluster was still scheduled for an automatic upgrade the following week.
Friday, July 28, 2023 at 22:30 we attempted the migration again without success, but now with an error "Page Not Found". There was approximately 5 minutes of downtime for the PagerTree app.
Friday, July 28, 2023 at 22:52 we emailed the Fly support team with the new "Page not found" error. We asked if they could look into their logs to see what could be happening. We also expressed concern for the automatic migration of apps the following week when the migration command was failing.
Saturday, July 29 at 05:09 Brian Li from the Fly support staff suggested we try deleting a volume from an orphaned VM, then using the LOG_LEVEL=debug on the migrate-to-v2 command. We deleted the orphan volume.
Sunday, July 30 at 00:30 we attempted to migrate the cluster with Brian Li's suggestion. The PagerTree application was taken offline and a service outage began. The migration looked to be going smoothly now that the orphaned volume had been deleted.
Sunday, July 30 at 00:35 the migrate command had timed out. At this point the new v2 machines had been created but were still replicating from the leader. We decide to wait until replication had been completed before running the troubleshooting command and deleting the v1 VMs (similar to what we had done with our staging Postgres cluster, bullet #2).
Sunday, July 30 at 00:40 replication had completed, so we tried to bring the PagerTree application back online. We immediately started to see errors in Honeybadger ActiveRecord::StatementInvalid: PG::ReadOnlySqlTransaction: ERROR: cannot execute INSERT in a read-only transaction
The following describes the actions taken between Sunday July 30 at 00:40 and 01:16 without specific timestamps.
1. We tried killing the connections using select pg_terminate_backend(pid) from pg_stat_activity where application_name like '/app%' or application_name like 'side%'; in hopes that a new connection would be in read/write mode. This failed and at this point, we knew that database was in a read-only mode.
2. Logging into the Postgres cluster an running SHOW default_transaction_read_only; on the application database confirmed our suspicion. We tried running SET default_transaction_read_only TO off; to fix the issue. With a successful command run, we believed the database would now be in a read/write mode. We would later learn this only set the option for the current connection.
3. We restarted the PagerTree application but again saw the errors regarding the read-only transactions.
4. We searched the internet how to make the appropriate change to all new connections. After searching for 5 or 10 minutes we found a working solution alter database application_db_name set default_transaction_read_only=off;
5. We restarted the PagerTree application and confirmed that the database was now in a writable state.
Sunday, July 30 at 01:16 we posted an update to our that the incident had been recovered and we were monitoring.
Sunday, July 30 at 01:43 we declared the incident resolved.

Review

Impact - The PagerTree application was down for 46 minutes and impacted all customers and integrations. Incoming requests, alerts, and notifications were all impacted during these 46 minutes.
Root Cause - migrate-to-v2 timeout and database left in read-only state.
Recurrence - This also happened in our staging Postgres upgrade, but we thought a Postgres restart and killing existing connections fixed the issue.
Corrective actions - alter database application_db_name set default_transaction_read_only=off; is the authoritative fix for the read-only state of the database.
Future Monitoring - We have added writing to the database a check in our monitoring. A write test is performed every minute now.

Multi-Tenant SSO using Devise

Tutorial showing how to implement multi-tenant single sign-on (SSO) using Ruby on Rails, Devise, and SAML. Works with identity providers like Okta, Google, Azure, etc.

Recently while scrolling on Twitter I saw this tweet by John Nunemaker.

Since we had implemented , I thought I could help out. After all, I (turns out it is closer to 400). After sharing a raw gist, I realized a blog post would be more helpful to the community.

In this blog post, I want to describe how we implemented multi-tenant SSO at PagerTree to work with any SAML2 identity provider (Okta, Google, Azure, etc.).

Stack Setup and Assumptions

This blog post will make a lot of assumptions about its implementation (it's a highly niche implementation).

is the framework (7.0.4)
gem is used for authentication (4.8.1)
gem for SAML parsing (1.14)
gem for tenant management (0.5.1)

Important Notes

This implementation uses the emailAddress attribute of SAML as the primary identifier for Users.
Checkout our on how this looks in practice.
We've snipped a lot of PagerTree specific code for the purposes of brevity and staying focused.

SSO Keywords and Jargon

Some of the most confusing things in SSO implementation is that there is no "standard" naming convention. I have seen many aliases and synonyms all over the web.

idp - Identity Provider (IdP) - Your customer's authentication provider (ex: Okta, Google, Azure, etc.)
idp_entity_id - The unique tenant identifier in the IdP's database.
idp_sso_service_url - The URL your app needs to redirect the user to with the AuthNRequest. It will be at the IdP's domain.
sp - Service Provider (SP) - Your app, the one you are building (ex: PagerTree)
sp_entity_id - A unique tenant identifier in the SP's database.
assertion_consumer_service_url - The endpoint on the SP where the IdP should send the user after they have authenticated.
authnrequest - Programmatic authentication request.
slo - Single Logout
saml - (Security Assertion Markup Language) is an XML-based standard for exchanging authentication and authorization data between parties, enabling Single Sign-On (SSO) functionality.

SSO Workflow Overview

If you are not familiar with SSO that's ok, I am going to go over the basic ideas (a full explanation is outside the scope of this article).

The basic workflow looks like this:

with The user comes to the SP application (aka your application).
The user provides the SP application with the authentication email (usually their work email).
The SP looks up the user and an IdP configuration this user is associated with. The user is then redirected to the IdP (idp_sso_service_url) with an authentication request in the format of an AuthNRequest.
At this point, the user either must provide valid credentials to the IdP. Once valid credentials are provided, and the the IdP confirms the user should have access to the SP application, the user is redirected by to the SP application at the assertion_consumer_service_url.
The SP is then responsible for granting access to the application based on the trusted response.

Two Entry Points

SP initiated - When a user comes to your app and clicks "Login using SSO" providing you their email address. This is probably the most common workflow and was described above.
IdP initiated - When a user logs in via their "app portal" from the IdP. Not very common, never have used it myself, but we need to support it. It doesn't change the code, but I am including it here for completeness.

Code

Migration

We need to add a model to hold each tenant's SSO configuration(s). I will briefly explain what each property is:

account_id - The tenant this belongs to.
meta - Free form hash where we can store any future data.
sp_entity_id - The unique identifier for this configuration.
name - A user friendly name so they can remember this configuration (ex: "Okta Config", "Okta Dev Config")
vendor - Enum identifier the IdP vendor. When debugging with customers why their configuration doesn't work, it's helpful to know the vendor (some vendors do some wonky stuff).
metadata_url - The URL to the IdP's metadata XML.
metadata_xml - The raw metadata XML (some vendors don't provide a metadata URL). The user should be able to copy and paste it into our app.
settings - A JSON representation of the parsed XML.
assertion_response_options - A hash of configurable options (per tenant) that we can pass into the Ruby SAML library.

Model

Our IdPConfig model will hold a SSO configuration. Each account can have many IdPConfigs, but there will only ever be 0 or 1 active IdPConfigs for an account at a time.

A couple of important notes:

Line 78 - We use SecureRandom.hex and not a UUID. Azure does not like dashes in the sp_entity_id; a hex key will work across all known providers.
Line 95 - We use OneLogin::RubySaml::IdpMetadataParser to parse the XML provided by the user or the IdP's metadata_url.

Routes

The important paths are as follows:

/sso - Where the user comes in the SP initiated workflow. We ask them for their email here.
/saml_callback - Alias for /public/saml/consume (see below). We had to support some legacy URLs when .
/public/saml/consume - Where the IdP redirects the user to after they have provided their credentials to the IdP. This is the assertion_consumer_url. The payload of the request will be the assertion of who the user is.
/public/saml/metadata - A convenience endpoint for users to get information in XML format about the SP. IdP's sometimes will ask for this. Its a programmatic way for the SP to provide the IdP with details like the assertion_consumer_service_url
/public/saml/slo - The IdP will make a request here if the user is logged out. This is known as single logout. We need to destroy the users session when this URL is called.

Sessions Controller

You'll need to read through the sessions controller, but I will give a brief summary:

Line 6 - skip_before_action :verify_authenticity_token - On requests from the IdP, don't verify the .
Line 7 - before_action :set_idp_config - Set the IdP Config for SSO methods.
Line 10 - def destroy - Override the Devise destroy method. Send the IdP a logout request if our user logsout from our app.
Line 27 - def sso - Render the SSO page to capture the users email.
Line 28 - user = User.find_by_email(email) - between IdP and SP.
Line 82 - def saml_callback - Process the IdP response. This is the assertion_consumer_service_url.
Line 91 - if !user - Create a user if they don't exist in our database but were authenticated by the trusted IdP. This can occur when a SSO administrator adds access to your application and it's the users first time to login to your app.
Line 118 - def saml_metadata - The convenience method providing metadata that describes the SP configuration.
Line 126 - def saml_logout - Process the IdP initiated single logout request.
Line 164 - def verify_can_username_password - SSO users should be forced to use SSO

Gotchas

Switching Accounts

In PagerTree, . However, we don't want users to be able to have a personal account and login via username and password and then switch to an SSO enabled account. For SSO enabled accounts, a user should always be required to authenticate via SSO.

So in /app/controllers/accounts_controller.rb we have something like this:

Feedback

The Multi-Tenant SSO setup is a fairly advanced topic. Having done this several times before, I am sure I missed some things and could likely make other things clearer. If you have any constructive feedback you can . I can't address every comment, but with your input I will try my best to update this content to make it even clearer for others in the community.

PowerShell Cheat Sheet: Essential Commands for Efficient Scripting

PowerShell is a powerful scripting language and command-line shell that is widely used for automation, administration, and managing Windows environments.

PowerShell is a powerful scripting language and command-line shell that is designed specifically for system administration and automation tasks in Windows environments. Whether you're a seasoned sysadmin or just starting with PowerShell, having a cheat sheet of essential commands at your fingertips can greatly enhance your productivity. In this blog post, we will cover some fundamental PowerShell commands, starting from the basics and gradually progressing to more advanced concepts.

Powershell Introduction and Basics

Set-ExecutionPolicy

The Set-ExecutionPolicy command allows you to manage the script execution policy on your system. It determines whether PowerShell scripts can be run and helps ensure system security. Here's an example of setting the execution policy to allow running scripts.

Set-ExecutionPolicy Unrestricted

If you have not yet run Powershell on your computer and are getting errors because of permissions, you likely need to run the Set-ExecutionPolicy command.

Powershell Piping

PowerShell piping is a powerful feature that allows you to take the output of one command (or cmdlet) and use it as input for another command. It enables you to chain together multiple commands to perform complex operations with ease.

To understand piping, let's consider a simple example. Suppose you want to retrieve a list of running processes on your computer using the Get-Process command. By default, running Get-Process will display a table showing various details about the processes. However, what if you only want to see the processes related to a specific application, such as "chrome"?

In a non-piping scenario, you might need to run a separate command to filter the results manually. However, with PowerShell piping, you can achieve this in a more straightforward way. Here's an example:

Get-Process | Where-Object { $_.Name -eq "chrome" }

Let's break down this example step by step:

Get-Process: This command retrieves a list of all running processes on your computer.
|: The vertical pipe character (|) is the piping operator in PowerShell. It takes the output from the left side and passes it as input to the command on the right side.
Where-Object: This command is used to filter objects based on specific criteria. In this case, we want to filter the processes based on their name.
{ $_.Name -eq "chrome" }: This is a script block, which is essentially a piece of code enclosed within curly braces. It specifies the condition we want to use for filtering. Here, we're checking if the process name ($_.Name) is equal to "chrome". The $_ is an automatic variable referencing the current object in the pipeline.

By using the piping operator, we can take the output of Get-Process and directly pass it to Where-Object for further processing. As a result, only the processes with the name "chrome" will be displayed.

Piping can be used with multiple commands, allowing you to perform complex operations in a single line. You can chain together as many commands as needed, each building upon the output of the previous one.

Get-Alias

Get-Alias retrieves the list of aliases (shortcuts) for PowerShell commands. It helps you understand and use PowerShell shortcuts effectively. Here's an example of retrieving all the aliases:

Get-Alias

--- Output ---
CommandType     Name                                               Version    Source
-----------     ----                                               -------    ------
Alias           % -> ForEach-Object
Alias           ? -> Where-Object
...

Where-Object

Where-Object filters objects based on specified criteria. It's handy for selecting specific data from a collection. Here's an example (using the Where-Object alias) of filtering processes based on their name:

Get-Process |? { $_.name -eq "chrome" }

Setting Environment Variables

To set environment variables within a PowerShell session, you can use the $env: notation. Here's an example of how to set an environment variable:

$env:MY_CUSTOM_VARIABLE= "Hello, World!"

Powershell Commands

Get-Process

Get-Process retrieves information about the running processes on your computer. It's an excellent command for monitoring and managing processes. The top 10 processes utilizing highest CPU:

Get-Process | Sort-Object -Property CPU -Descending | Select-Object -First 10

Stop-Process

Stop-Process terminates a running process. It allows you to end processes gracefully or forcefully if needed. Here's an example of forcefully stopping a process using its process ID (PID):

Stop-Process -ID 1234 -Force

Get Service

Get-Service retrieves information about services running on your system. It's helpful for managing and monitoring services. Here's an example of retrieving all the running services:

Get-Service | Where-Object {$_.Status -eq "Running"}

Stop-Service

Stop-Service stops a running service. It helps you manage services effectively. Here's an example (using Powershell piping) of stopping a service by its display name:

Get-Service -DisplayName "telnet" | Stop-Service

Get-EventLog

Get-EventLog allows you to access event logs on your computer. It helps in analyzing system events and troubleshooting. Here's an example of retrieving the Application event log:

Get-EventLog -LogName Application -Newest 100

Invoke-WebRequest

Invoke-WebRequest allows you to send HTTP requests and retrieve web content. It's useful for automating web interactions. Here's an example of downloading a file from a URL:

Invoke-WebRequest -Uri "https://example.com/file.txt" -OutFile "C:\Path\To\file.txt"

Formatting Output

Export-CSV

Export-CSV enables you to export PowerShell objects to a CSV (Comma-Separated Values) file. It's useful for storing and analyzing data. Here's an example of exporting process information to a CSV file:

Get-Process | Export-CSV -Path "C:\Path\To\ProcessReport.csv" -NoTypeInformation

Format-Table

Format-Table allows you to format and display PowerShell output in a tabular form. It's handy for better readability and presentation. Here's an example of formatting the output of the Get-Process command:

Get-Process | Format-Table -AutoSize

Advanced Commands

Invoke-Command

Invoke-Command enables you to execute commands on remote systems or run commands in the background. It provides a way to manage and automate tasks across multiple machines. Here's an example of how to execute a command on a remote system:

Invoke-Command -ComputerName "Computer1" -ScriptBlock { Get-Process }

Parallel Execution

The ForEach-Object -Parallel command allows for parallel processing of data, making your scripts more efficient. It splits the input into multiple threads and processes them concurrently. Here's an example of how to parallelize a loop to perform actions on multiple computers simultaneously:

$Computers = "Computer1", "Computer2", "Computer3"

$Computers | ForEach-Object -Parallel {
    $Computer = $_
    Invoke-Command -ComputerName $Computer -ScriptBlock {
        # Perform actions on each remote computer here
        # Example: Get the system uptime
        $Uptime = (Get-WmiObject -Class Win32_OperatingSystem).LastBootUpTime
        Write-Output "Uptime on $env:COMPUTERNAME: $Uptime"
    }
}

Conclusion

PowerShell is a versatile tool for automating tasks, managing systems, and performing various administrative tasks in Windows environments. This cheat sheet covered essential commands, from basic system information retrieval to more advanced concepts like parallel processing and remote command execution. By familiarizing yourself with these commands and their usage, you can become more efficient and effective in your PowerShell scripting journey. Happy scripting!

(Note: While this blog post aims to provide a comprehensive overview of the mentioned commands, it is essential to refer to the official PowerShell documentation for in-depth explanations and additional examples.)

Additional PagerTree Cheat Sheets

PromQL Cheat Sheet: A Quick Guide to Prometheus Query Language

Prometheus is an open-source monitoring and alerting toolkit that has gained significant popularity in DevOps and systems monitoring. At the core of Prometheus lies PromQL (Prometheus Query Language), a powerful and flexible query language used to extract valuable insights from the collected metrics. In this guide, we will explore the basics of PromQL and provide query examples for an example use case.

Example Scenario

You have a high availability web app that you maintain. You'd like to have some observability into the traffic of your application. Your environment consists of 3 production web servers and 1 staging web server. Below is a table of instance vectors for your servers.

Querying Time Series

PromQL allows you to query time series data, which consists of metrics and their corresponding labels. The basic syntax for querying time series is as follows:

<metric_name>{<label_name>=<label_value>, ...}

Example:

To query the total HTTP requests metric for your fleet of servers, you would use:

http_requests_total

The above example would return an instance vector for each server in your fleet.

{__name__="http_requests_total", server="web_prod_1", environment="production"} = [100, 110]
{__name__="http_requests_total", server="web_prod_2", environment="production"} = [200, 220]
{__name__="http_requests_total", server="web_prod_3", environment="production"} = [300, 330]
{__name__="http_requests_total", server="web_stg_1", environment="staging"} = [10, 20]

Instance Vector Selectors

Instance vector selectors allow you to filter and focus on specific labels to extract relevant metrics. To filter the time series, append a comma-separated list of label matchers in curly braces {}

http_requests_total{environment="production"}

The above example would return an instance vector for each production server in your fleet.

{__name__="http_requests_total", server="web_prod_1", environment="production"} = [100, 110]
{__name__="http_requests_total", server="web_prod_2", environment="production"} = [200, 220]
{__name__="http_requests_total", server="web_prod_3", environment="production"} = [300, 330]

Label Matching Operators

Additionally, PromQL provides the following label matching operators:

=: Select labels that are exactly equal to the provided string.
!=: Select labels that are not equal to the provided string.
=~: Select labels that regex-match the provided string.
!~: Select labels that do not regex-match the provided string.

Regex matches are fully anchored. A match of env=~"foo" is treated as env=~"^foo$". You can test your regex matches here using the Golang flavor.

So, to select all of our staging servers, we could use the following query:

http_requests_total{server=~".*_stg_.*"}

Aggregation Functions

PromQL provides various aggregation functions to summarize and aggregate time series data. Here are a few commonly used functions:

sum: Calculates the sum of all matching time series.
avg: Computes the average value of matching time series.
min: Returns the minimum value among all matching time series.
max: Returns the maximum value among all matching time series.

Example:

To calculate the average HTTP requests across all production instances, you can use:

avg(http_requests_total{environment="production"})

and The above would first return the instance vectors and then generate the average:

{__name__="http_requests_total", server="web_prod_1", environment="production"} = [100, 110]
{__name__="http_requests_total", server="web_prod_2", environment="production"} = [200, 220]
{__name__="http_requests_total", server="web_prod_3", environment="production"} = [300, 330]

  [100, 200]
  [200, 400]
  [300, 600]
+ ----------
  [600, 1200]
÷    3,    3
  -----------
= [200, 400]

{__name__="avg(http_requests_total{environment="production"})"} = [200, 400]

Range Vectors and Functions

PromQL allows you to work with range vectors, representing time series data over a specified time range. This is particularly useful for analyzing trends and patterns. Here are a few important range functions:

rate: Calculates the "per-second rate of increase" of a time series over a specified time range.
irate: Similar to rate, but calculates the "instantaneous per-second rate of increase" of a time series over a specified time range by only considering the last 2 points.
increase: Computes the "absolute increase" in a time series value over a specified time range.

Enjoying this content? Check out our full article on Counter Rates and Increases here: https://pagertree.com/learn/prometheus/promql/counter-rates-and-increases

Example:

To calculate the number of HTTP requests you are getting for your entire production fleet.

sum(increase(http_requests_total{environment="production"}))

The above would first return the instance vectors, then calculate the difference between the vector values t1-t0, then sum them.

{__name__="http_requests_total", server="web_prod_1", environment="production"} = [100, 110]
{__name__="http_requests_total", server="web_prod_2", environment="production"} = [200, 220]
{__name__="http_requests_total", server="web_prod_3", environment="production"} = [300, 330]

  [100, 200] -> increase() = 200 - 100 ->   100
  [200, 400] -> increase() = 400 - 200 ->   200
  [300, 600] -> increase() = 600 - 300 ->   300
                                         + -----
                                         =  600

{__name__="increase(sum(http_requests_total{environment="production"}))"} = [600]

Conclusion

PromQL is a versatile and powerful query language that empowers users to extract valuable insights from Prometheus metrics. By mastering the basics covered in this cheat sheet, you'll be well-equipped to explore and analyze your monitoring data effectively. Remember, this blog post only scratches the surface. Experiment with different functions and operators to make the most of PromQL's capabilities.

By keeping this cheat sheet handy, you'll be able to navigate PromQL queries efficiently and unlock the full potential of Prometheus for monitoring and alerting in your systems.

PromQL Queries Used At PagerTree

At PagerTree we monitor our systems extensively; here are some of the common queries we use. The metrics (and metric names) we use are provided by the discord/prometheus_exporter gem or our own metric label name.

HTTP Response Count by Status Code

Query:

sum(increase(fly_app_http_responses_count{app="pt4-[[environment]]-web"})) by (status) > 0

Graphed Result

Query:

sum(increase(ruby_pagertree_alerts_total{app=~"pt4-[[environment]]-.*"}))

Graphed Result:

Notifications Total By Channel

Query:

sum(increase(ruby_pagertree_message_notifications_total{app=~"pt4-[[environment]]-.*"})) by (channel)

Graphed Result:

Sidekiq Queue Latency By Queue

Query:

avg(ruby_sidekiq_queue_latency_seconds{app="pt4-[[environment]]-worker"}) by (queue)

Graphed Result:

Resources

⭐ Link to full Prometheus Knowledge Hub

Additional PagerTree Cheat Sheets

Jekyll site to AWS S3 using GitHub Actions

GitHub Actions are a great way to automate the build and deploy process for your repos.

In this tutorial, I will show you how to build and deploy a Jekyll static site to AWS S3 + Cloudfront using GitHub Actions. At PagerTree we use GitHub Actions to automate the building and deploying of our marketing site pagertree.com.

What are GitHub Actions?

These days, if you have to do anything manually more than a couple of times, you should probably be automating it. GitHub Actions make it easy to automate software workflows. At PagerTree, we use GitHub Actions to deploy our marketing site in a continuous and reliable way.

Tutorial

For this tutorial, I’ll make the assumption that you are fairly familiar with git and Jekyll and already have a static website hosted on AWS S3 + Cloudfront website setup.

What You’ll Need

Below I’ve listed what you’ll need for this tutorial. I’ll assume you are dangerous enough to create the following on your own and won’t cover how to create these, as it’s out of the scope of this post.

Jekyll static site
GitHub Account and Repo
AWS
- Account
- S3 Bucket with static hosting enabled
  - Bucket Permissions - Block all public access - Off - (2:54)
  - Bucket Policy - Public Access Policy - (3:42)
  - Static Website Hosting - Enabled - (5:02)
- Cloudfront distribution - (9:58)
- Access to create IAM User and Policy

Desired Workflow

Our desired workflow should look something like the following:

On push to our repo’s main branch or when manually clicked in GitHub:

Build the main branch.
Deploy the generated static site files to AWS S3.
Create an AWS Cloudfront invalidation.

This is pretty minimal, and you can get waaay fancier, but for the purpose of this tutorial it should help us understand how to use GitHub Actions.

Add a GitHub Action Workflow

Your GitHub Actions definitions live in a special directory in your repo (<repo>/.github/workflows/). Inside this directory, you’ll have all your workflow files (yml format).

Workflows will trigger off events (aka specific activities) that happen in GitHub. There’s quite a few, but for this tutorial we will focus on the push and workflow_dispatch events.

Build and Deploy on Push into Main

In your <repo>/.github/workflows/ directory, create a new file called build_and_deploy.yml. Copy and paste the following into your newly created GitHub Action workflow:

name: CI / CD

# Controls when the action will run. 
on:
  # Triggers the workflow on push for the main branch
  push:
    branches: [ main ]

  # Allows you to run this workflow manually from the Actions tab
  workflow_dispatch:
  
env:
  AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
  AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
  AWS_DEFAULT_REGION: 'us-west-2'

# A workflow run is made up of one or more jobs that can run sequentially or in parallel
jobs:
  build:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v2
    - name: Set up Ruby
      uses: ruby/setup-ruby@v1
      with:
        ruby-version: "3.0" # Not needed with a .ruby-version file
        bundler-cache: true
    - name: "Build Site"
      run: bundle exec jekyll build
      env:
        JEKYLL_ENV: production
    - name: "Deploy to AWS S3"
      run: aws s3 sync ./_site/ s3://${{ secrets.AWS_S3_BUCKET_NAME }} --acl public-read --delete --cache-control max-age=604800
    - name: "Create AWS Cloudfront Invalidation"
      run: aws cloudfront create-invalidation --distribution-id ${{ secrets.AWS_CLOUDFRONT_DISTRIBUTION_ID }} --paths "/*"

This workflow file is responsible for building and deploying the site. We’ve named it “CI / CD”. It’s pretty self explanatory, but I’ll explain the process:

When a push is made into the main branch (or manual button click in GitHub), run this workflow.
The job - Use the Ubuntu latest virtual environment (see all environment options here)
1. Checkout our main branch.
2. Setup our Ruby environment (see docs) - Installs Ruby (with specified Ruby version if you have a .ruby-version file) and runs ‘bundle install’.
3. Build the site (with the production environment).
4. Uploads output files from our _site directory to our S3 bucket.
5. Creates a Cloudfront invalidation (so we can see our new site immediately).

Pretty straight forward, but we still need to create a few resources in AWS and configure secrets in our GitHub repository.

Create necessary AWS resources

We’ll need to create 2 AWS resources, namely an IAM Policy and User.

IAM Policy - will grant restricted access to deploy to our S3 bucket and create an invalidation on our Cloudfront distribution. You’ll attach this policy to the IAM User.
IAM User - will be the credentials the GitHub Action uses to run its aws-cli commands.

Below is the AWS IAM Policy you’ll need to create. You must modify it by replacing a couple of the items below (make sure to replace the ‘<’ and ‘>’ too).

<your-bucket-name> - Your S3 bucket name (ex: www.acme.com)
<your-aws-account-number> - The 12 numeric characters of your AWS account.
<your-distribution-id> - The alpha numeric 14 characters of your associated Cloudfront distribution

In AWS, create a new IAM Policy

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Resource": [
                "arn:aws:s3:::<your-bucket-name>",
                "arn:aws:s3:::<your-bucket-name>/*"
            ],
            "Sid": "VisualEditor1",
            "Effect": "Allow",
            "Action": [
                "s3:*"
            ]
        },
        {
            "Sid": "VisualEditor2",
            "Effect": "Allow",
            "Action": "cloudfront:*",
            "Resource": "arn:aws:cloudfront::<your-aws-account-number>:distribution/<your-distribution-id>"
        }
    ]
}

Create a new IAM User with programmatic access and attach the IAM Policy you just created above.
Copy the AWS access key ID and Secret access key to somewhere safe, as we will need these in our next step

Configure The GitHub Action Secrets

In order to use the special variables like ${{ secrets.AWS_ACCESS_KEY_ID }} we’ll need to configure them in the GitHub Actions Secrets. To do this:

In GitHub, navigate to Your Repo > Settings > Secrets > Actions
For each secret below, click the New repository secret button, fill out the form, and click Add Secret
- AWS_ACCESS_KEY_ID - What you copied in the previous step as AWS access key ID.
- AWS_SECRET_ACCESS_KEY - What you copied in the previous step as Secret access key.
- AWS_S3_BUCKET_NAME - The bucket name you set previously in your IAM Policy (ex: www.acme.com).
- AWS_CLOUDFRONT_DISTRIBUTION_ID - The Cloudfront distribution id you set previously in your IAM Policy.

Testing your new GitHub Action

The easiest way to test your new GitHub action, is to:

Make a small change to your Jekyll site
Commit the change, and push to main.
Navigate to your website (https://www.acme.com), do a hard refresh (Ctrl + F5) and then you should see the changes you just made.

What you should see in the GitHub Actions panel, is a workflow that was created, and the output of the commands that were run.

Note: The first time this runs it could take ~5 minutes. The ‘bundle install’ command for our project took a while, but don’t worry, subsequent builds should use the bundle cache.

Conclusion

That’s it, you’ve now successfully created a GitHub Action to build and deploy your Jekyll static site to S3 and Cloudfront. I hope you found some value in this tutorial, it’s pretty basic, but if your new to GitHub Actions it should provide a valuable launching pad. Make sure to follow me on twitter, and if you haven’t yet, make sure to checkout PagerTree :)

GitHub Actions Documentation

Migrate attr_encrypted to Rails 7 Active Record encrypts

In this guide, I’ll show you how to migrate away from attr_encrypted to the new Active Record encrypts.

Rails 7 has introduced Active Record Encryption, functionality to transparently encrypt and decrypt data before storing it in the database. This is awesome news for any developer who has ever had to encrypt data before storing it.

In this guide, I will walk you through an example to migrate from away from using attr_encrypted gem to the new Rails 7 Active Record encrypts. We will do this using strong migrations and also maintain the ability to perform a database rollback without data loss.

TLDR

If you are short on time, below is the crux of this article. If you are actually implementing this, I would highly encourage you to read on, this can be a fairly complex migration.

Important notes

This article is written on 13 April 2021 - Currently Rails 7 is Edge (aka alpha). This tutorial makes certain assumptions based on that. I will publish updates when Rails 7 is officially released.
attr_encrypted and Active Record encrypts are not compatible - you’ll need to use a fork of attr_encrypted
devise - currently needs the patch-2 branch to work with Rails 7

Basic Process

Upgrade to Rails 7
Add dynamic attributes to model
Perform Migrations
Delete attr_encrypted gem dependency

Background

Most applications at some point in time need to encrypt data before storing it to the database (and conversely decrypt it before using it in the application). Historically there have been 2 gems that were fairly popular for this sort of functionality, namely attr_encrypted and lockbox. I personally have preferred lockbox since its still actively maintained and used less columns, but if you are like me you can’t always choose whats handed to you.

Unfortunately, the attr_encrypted gem is no longer maintained has a lot of name clashes with the Rails 7 Active Record encrypts functionality. To work around this, we had to create a fork and rename many of the function calls and properties (namely encrypt, decrypt, ect.). You too will need to use the PagerTree fork of the attr_encrypted gem during your migration process (but don’t worry, you can delete it after your migration).

Upgrade to Rails 7

You’ll first need to upgrade to Rails 7. As of this writing (13 April 2021), Rails 7 is Edge. This tutorial will use syntax and functionality that is currently in alpha.

In your gem file you’ll need to change:

Gemfile

# Use edge rails, currently 7.0
gem 'rails', github: 'rails/rails'
# Until this is officially release in devise, we need to use the fixed branch
gem "devise", github: "ghiculescu/devise", branch: 'patch-2'
# attr_encrypted is no longer maintained. There are lots of function names that clash with 
# Rails 7 encrypts functionality. This branch has renamed the conflicting functions 
# you will be able to remove this after the migration
gem "attr_encrypted", github: "PagerTree/attr_encrypted", branch: "rails-7-0-support"

And then make sure to install the new dependencies, and update any others.

bundle install
bundle update

At this point, we now have Rails 7 installed, with a compatible version of attr_encrypted.

Add Rails 7 Active Record encrypts keys

Following encrypts documentation, you’ll need to add some keys to your rails credentials file.

bin/rails db:encryption:init

Copy the output YAML and paste it into your credentials file. It should look something like this:

active_record_encryption:
  primary_key: EGY8WhulUOXixybod7ZWwMIL68R9o5kC
  deterministic_key: aPA5XyALhf75NNnMzaspW7akTfZp0lPY
  key_derivation_salt: xEY0dt6TZcAMg52K7O84wYzkjvbA62Hz

You’ll need to do this once for each environment (normally development, staging, production).

rails credentials:edit --environment=development

At this point, the Active Record encrypts should be ready to go.

Migrate attr_encrypted

The next steps will be to migrate any data that was previously using attr_encrypted to use the new encrypts methods. Because we want to be secure and also use strong migrations our process should look like this:

Modify our model to dynamically define attributes to use during our migration
Create a new temporary column for our old encrypted data (attr_encrypted)
Copy old encrypted data to new temporary column, and delete old column
Add a new column for our Rails 7 Active Record encrypts data
Run a migration to programmatically decrypt attr_encrypted temporary column and put it in Rails 7 Active Record encrypts column
Delete temporary column

Seems like a lot of overkill, but we do it this way so we don’t perform any dangerous database activity and make strong migrations happy. This process will also keep our migrations backward compatible and prevent data loss in case we ever need to rollback.

Assumption

I’m going to make the assumption you are fairly dangerous when it comes to coding, and you are relatively familiar with the rails framework. Please use this example as a guide. You’ll need to make modifications to your own code to make this work for you..

Below, is what I will assume is our starting point. We have a User model that has an attribute called otp_secret (stands for “one time password secret”, for two factor authentication).

class User < ApplicationRecord
    ...
    attr_encrypted :otp_secret, key: Base64.decode64(Rails.application.credentials.otp_secret_encryption_key)
end

The otp_secret property currently uses attr_encrypted. This means in our database we should have the following columns:

We’ll take advantage of the fact that attr_encrypted prefixed its column names with “encrypted”. By copying our data into a temporary column, we can avoid name clashes, and use the encrypts functionality almost transparently (you’ll see below how the names will come full circle).

Modify model to use dynamic attributes

We need to add some extra code to dynamically define attributes. During the migration, only two of these columns will ever exist at a time, making it so that we can migrate our columns without name clashing.

class User < ApplicationRecord
...
  if column_names.include? "encrypted_otp_secret"
    attr_encrypted :otp_secret, key: Base64.decode64(Rails.application.credentials.otp_secret_encryption_key)
  end

  if column_names.include? "encrypted_otp_secret_2"
    attr_encrypted :otp_secret_2, key: Base64.decode64(Rails.application.credentials.otp_secret_encryption_key)
  end

  if column_names.include? "otp_secret"
    encrypts :otp_secret 
  end
end

Create a Temporary Column

The temporary column will just hold a copy of our existing attr_encrypted field. We move data here for strong migrations and so the Rails 7 encrypts column doesn’t conflict with the attr_encrypted accessor.

rails g migration AddEncryptedColumnToUsers

class AddEncryptedColumnToUsers < ActiveRecord::Migration[7.0]
  def change
    add_column :users, :encrypted_otp_secret_2, :string
    add_column :users, :encrypted_otp_secret_2_iv, :string
  end
end

Copy attr_encrypted columns

You’ll want to create a new migration that copies the original attr_encrypted column to the one we just created, but you’ll want to make sure you define both up and down so that you can have backward compatibility.

rails g migration CopyEncryptedColumnsOnUsers

class CopyEncryptedColumnsOnUsers < ActiveRecord::Migration[7.0]
  def up
    User.update_all("encrypted_otp_secret_2=encrypted_otp_secret")
    User.update_all("encrypted_otp_secret_2_iv=encrypted_otp_secret_iv")
    safety_assured {
      remove_column :users, :encrypted_otp_secret
      remove_column :users, :encrypted_otp_secret_iv
    }
  end

  def down
    add_column :users, :encrypted_otp_secret, :string
    add_column :users, :encrypted_otp_secret_iv, :string
    User.update_all("encrypted_otp_secret=encrypted_otp_secret_2")
    User.update_all("encrypted_otp_secret_iv=encrypted_otp_secret_2_iv")
  end
end

Add Rails 7 Active Record encrypts columns

Now we’ll add a new column, where we will store the Rails 7 Active Record encrypts data.

It’s important that the column be of type :text. The rails guides specify that the column should be at least 510 bytes.

rails g migration AddOtpSecretToUsers

class AddOtpSecretToUsers < ActiveRecord::Migration[7.0]
  def change
    add_column :users, :otp_secret, :text
  end
end

Migration data from attr_encrypted to encrypts

In this step, we generate a migration to move data from the attr_encrypted property to the Rails 7 Active Record encrypts property. We have to do this programmatically (and can’t do a shortcut db command) because it is the rails engine is what is actually doing the encrypt and decrypt work for us.

Additionally, we do some special reloading of the User model because of how we have dynamically defined attributes (Again, this is meant just to be temporary while we migrate).

rails g migration PortAttrEncryptedToEncrypts

class PortAttrEncryptedToEncrypts < ActiveRecord::Migration[7.0]
  def up
    reload_user_model
    User.all.each do |u|
      # takes the attr_encrypted properties and puts in the Rails 7 properties
      # must do this programmatically because thats how encryption happens.
      # We can't shortcut this via a db command
      u.otp_secret = u.otp_secret_2
      u.save!
    end
    reload_users_model
  end

  def down
    reload_users_model
    User.all.each do |u|
      # takes the Rails 7 properties and puts in the attr_encrypted properties
      # must do this programmatically because thats how encryption happens.
      # We can't shortcut this via a db command
      u.otp_secret_2 = u.otp_secret
      u.save!
    end
    reload_users_model
  end

  def reload_users_model
    # We must do this because of how the User model is
    # dynamically defined
    User.reset_column_information
    Object.send(:remove_const, "User")
    load "app/models/user.rb"
  end
end

Remove Temporary Column

Our last step is to remove our temporary column, so our database is kept nice and clean. Again, we define the up and down methods in this migration so we are backward compatible, and if for any reason we can go back in time and re-create our data.

rails g migration RemoveAttrEncryptedColumnsFromUsers

class RemoveAttrEncryptedColumnsFromUsers < ActiveRecord::Migration[7.0]
  def up
    safety_assured {
      remove_column :users, :encrypted_otp_secret_2
      remove_column :users, :encrypted_otp_secret_2_iv
    }
  end

  def down
    add_column :users, :encrypted_otp_secret_2, :string
    add_column :users, :encrypted_otp_secret_2_iv, :string
  end
end

Run your migrations

Now you should be able to run all your newly created migrations with one swift command.

rails db:migrate

Remove attr_encrypted dependancy

You can now safely remove the attr_encrypted dependancy in your gem file. However, be aware that this will break existing process of rails db:create db:setup (for example in development). You’ll likely want to use rails db:setup instead so that it loads from the schema file and at some point squash your migrations directory.

I hope you find some value in this tutorial and it can save you time and effort when it comes to migrating away from attr_encrypted. There’s probably a lot I missed on here, so if you have something to add you can reach out to me on twitter and I will update the article with your suggestion.

Other Notes

Some other notes on snags I came across during development.

Problems when creating a database with Devise and Rails 7 Active Record Encrypts

The Rails 7 Active record encrypts seems to break db:create when used in conjunction with Devise. Didn’t dig too far into this, but Rails complains that the encrypts modifier can’t properly check the database column size. Makes sense since there currently is no database, but it did force me to create a hack on the user model. It didn’t seem to affect other models that didn’t interact with Devise.

I assume this will get fixed at some point and is just a Devise + Edge (alpha) thing.

# Only declare encrypts attributes when we have a database
if (::ActiveRecord::Base.connection_pool.with_connection(&:active?) rescue false)
  encrypts :otp_secret
end