Email Parser Consuming 100% CPU

Executive Summary

Between Wednesday 17 July, 2019 10:08:02 UTC - Wednesday 17 July, 2019 19:11:27 UTC PagerTree experienced degradation in service which potentially affected all customers. Incident creation, notifications, and the Web UI were all impacted.

Outage Description

The PagerTree email parser has a feature to transform text emails to html. The parser/linker is a synchronous process. For very large text emails this can become quite expensive, especially if there are lots of links.

We received some very large text emails, that exposed this bug in the email parser at a rate of about 1 every for 4 minutes. On average, the CPU reached 100% and lasted for 2 minutes.

This in turn caused other requests made to the 100% CPU affected machine to be virtually dropped (our web proxy server would have the request time out).

This affected all users of PagerTree, and resulted in dropped incoming integration requests, dropped user interface requests, and the possibility of missed notifications.

Affected System & Users

  • Potentially all customers.
  • Systems Affected
    • Web UI
    • HTTP Requests
    • Notifications

Start Date/Time

Wednesday 17 July, 2019 10:08:02 UTC

End Date/Time

Wednesday 17 July, 2019 19:11:27 UTC

Duration

9 hours 3 minutes 25 seconds

Status

Resolved

Timeline

  1. Wednesday 17 July, 2019 10:08:02 UTC - Initial alert is fired by AWS Cloudwatch
  2. Approximately 13:45 UTC - Austin Miller reads SMS from notification from AWS Cloudwatch
  3. Approximately 13:48 UTC - Austin Miller begins investigation, pulling logs from EC2 instances
  4. Approximately 14:04 UTC - Austin Miller sends command to replace instances since they have been up 152 days. Suspects it could be issue cause by machines being up for so long.
  5. 14:14:25 UTC - Austin Miller posts to status page an open incident. “Some users might be experiencing dropped requests to our API.”
  6. Approximately 14:20 UTC - Austin Miller determines this is not the issue after seeing 100% CPU on a machine that has been recently replaced. He begins investigating logs and code deeper.
  7. Approximately 14:54 UTC - Austin Miller identifies that it is the incoming email request that is causing the 100% CPU usage.
  8. Between 14:54 UTC and 16:26 UTC - Austin Miller investigates which part of the code could be causing the 100% CPU usage.
  9. 15:10:48 UTC - Austin Miller posts update to status page. “We have identified the cause of the issue.”
  10. 16:26:48 UTC - Austin Miller posts update to status page. “We have fully identified the issue and causes. We are still working on a proper solution to the problem.”
  11. Approximately 16:38 UTC - Austin Miller adds 3 more EC2 instances to mitigate the number of requests being dropped.
  12. Approximately 17:00 UTC - Austin Miller deploys debugging code to confirm that it is in fact the email parsing code that is causing the problem.
  13. Approximately 18:15 UTC - Austin Miller deploys code fix
  14. Approximately 18:15 UTC - Deployment of code fix is complete. Austin Miller monitors.
  15. 18:22:36 UTC - Austin Miller posts update to status page. “We’ve implemented a fix and deployed it to the production environment. We will continue to monitor.”
  16. Approximately 19:00 UTC - No other emails have cause 100% CPU usage. The incident is resolved. Austin Miller sends the scale down command.
  17. 19:05:22 UTC - Austin Miller posts update to status page. “The issue has been resolved. We will publish a postmortem by the end of the week.”
  18. 19:11:27 UTC - EC2 fleet completes scale down command.

Contributing Conditions Analysis

  1. Really long text email with text that looked like links
  2. Option set by default by the email parsing library to convert text to html

Resources

Elastic Beanstalk EC2 CPU Usage Gap in request time SES Call

What went well?

  • Identification of the issue
  • Technical resolution of the issue
  • Mitigation of the issue
  • Communication via our status page

What could have gone better?

  • Responding to the initial AWS notification could have been faster.
  • Communication via our status page. We need all users subscribed (or aware) of an outage.
  • Duration of incident needs to be shortened significantly

Recommendations

  • Make use of the SorryApp banner, for our application site
  • Have CloudWatch alarms go to PushOver app to create an Urgent Notification

Names of people involved

  • Austin Miller