DEV Community

Cover image for How to Create and Send AWS CloudWatch Alarm Notifications to PagerDuty
Omar Omar
Omar Omar

Posted on

How to Create and Send AWS CloudWatch Alarm Notifications to PagerDuty

Introduction:

AWS CloudWatch provides monitoring and observability services for AWS resources. It is built for Site Reliability Engineers, DevOps Engineers and developers in mind. The CloudWatch service collects data and gains insights. It's capable of alerting and resolving operational issues for users. It also gives a system wide visibility into resource utilization. For more information about AWS CloudWatch, please refer to AWS documentation.

PagerDuty is an on-call management and incident response system. The main goal of this tutorial is to walk you through how to configure a CloudWatch alarm, set a threshold and how to send alarm notifications to create and resolve incidents in PagerDuty. I chose 'AutoScaling Group Terminating Instances' as an example for this tutorial. When an EC2 instance in an AutoScaling group is terminated due to any unforeseen reason, a CloudWatch alarm is triggered and ;as a result, an incident is created in PagerDuty for the on-call engineer to inspect and resolve.

With that being said, the same concept and steps can be applied to create other CloudWatch alarm notifications to PagerDuty for any CloudWatch metrics in AWS namespaces or custom namespaces. Please, refer to AWS documentation about AWS services that publish CloudWatch metrics.

Alright, since there are lots of materials/steps to cover, the tutorial is to the point, short and succinct. Enjoy it. 😉

The Architectural Diagram

Image description


Step 1: Creating SNS Topic

  1. From the SNS console, select Topics. Then, click on Create topic.

Image description

  1. Select Standard for the type of the SNS topic and give it a name. Then, click Create topic.

Image description

Now, we have successfully created an SNS topic for our AutoScaling Group Alarms (or any other alarm of your choosing). We will subscribe to the topic once we create an alarm.


Step 2: Creating a CloudWatch Alarm:

  1. From the CloudWatch console, click on All alarms under Alarms section. Then, click Create alarm.

Image description

  1. On Specify metric and conditions, click on Select metric.
  2. In the search box under metrics, type in autoscaling and hit enter.

Image description

  1. Select Auto Scaling > Group Metrics from the options.

  2. Scroll down to select GroupTerminatingInstances for your AutoScaling group. In my case, the name of my AutoScaling group is myAutoScaling. Then, click on Select metric.

Image description

  1. The Specify metric and conditions screen is where we would customize and configure the metric alarm by specifying thresholds, period and other conditions.

Image description

A. Change the period to 1 minute.

B. I would define the threshold value to 0. I would like to be alerted if an EC2 is terminated within my autoscaling group.

C. The Datapoints to alarm is 1 out of 1.

D. The Missing data treatment is Treat missing data as ignore... This for dev environment but it's not recommended for prod environment. No further changes are needed. Nevertheless, you may configure the alert however you like. Next, click Next.

Image description

  1. On Configure actions screen, leave the Alarm state trigger to default which is In alarm.
  2. Under Select an SNS topic, choose Select an existing SNS topic. If you insert the cursor into the Send a notification to textbox, you could select the SNS topic that we created in step 1. But, if it's not showing up for any reason, paste the Amazon Resource Number (ARN) for your SNS topic.

Image description

  1. We will add a second Alarm state trigger by clicking on Add notification. Note, we will have one trigger to send notifications of the In alarm state and a second trigger to resolve the first alarm to the OK state. Therefore, select OK and select the same SNS topic as shown below. Then, click Next.

Image description

  1. Give the alarm a name and description.

Image description

  1. Finally, review and click on Create alarm. CloudWatch will start gathering more data about the metric/alarm as showing on the State. It will take a minute or two for the state to go to the OK state.

Image description

As of right now, we have successfully created a CloudWatch alarm for AutoScaling Group Terminating Instances. Any terminated EC2 instance within the specified AutoScaling group will trigger the alarm.


Step 3: Creating PagerDuty - CloudWatch Integration:

According to PagerDuty, there are two ways to integrate AWS CloudWatch:

A. Integrate with a PagerDuty Event Rule.

B. Integrate with a PagerDuty Service.

For this tutorial, we will utilize PagerDuty service as a method of CloudWatch integration.

  1. On the PagerDuty console, click on services to locate the service you would like to add CloudWatch integration to. Then, click on the service name.

Image description

  1. Select Integrations and click on Add an Integration.

Image description

  1. Type in CloudWatch in the search box under Select the integration(s) you use to send alerts to this service. Once AWS CloudWatch is selected, click Add.

Image description

  1. Copy the Integration URL. We will use this endpoint to subscribe to our SNS topic.

Image description

So far, we have successfully create a CloudWatch integration with our PagerDuty service. Now, it's time to head back to the AWS console and to SNS console to be specific.


Step 4: Subscribing PagerDuty-CloudWatch Integration to SNS topic:

  1. On the AWS SNS console, click on the name of the SNS topic we have created in step 1.
  2. Click on Create subscription

Image description

  1. Select HTTPS as a protocol and paste the Integration URL/endpoint that we saved from step 3. Then, click Create subscription.

Note: make sure that Enable raw message delivery is unchecked. It should be unchecked by default.

Image description

  1. Refresh the page until the Status is shown as Confirmed.

Image description

We have successfully created an SNS topic, a CloudWatch alarm and a PagerDuty integration for CloudWatch. We have also subscribed a PagerDuty integration endpoint to the SNS topic. Don't do your victory dance yet, let's test the alarm and the integration first. 😉


Step 5: Testing CloudWatch Alarm to PagerDuty Integration:

It has been journey, and I would like to congratulate you for getting this far. It's the moment of truth. The time has come to test our integration, fingers crossed.

  1. In order for us to trigger a CloudWatch alarm for the AutoScaling Group Terminating Instances, we need to terminate an instance within our AutoScaling group. I'm sure you're not doing this in prod environment 😉 I will go ahead and terminate one of the two EC2 instances I have in the AutoScaling group.

Image description

  1. Let's head to the CloudWatch console to monitor the alarm status and progress. As we see on the image below, the alarm status is still OK. The delay for the alarm to go off is due to the default Health check grace period, which is 300 seconds (5 mins) by default. We can lower the health check grace period to deem the EC2 unhealthy quicker. Another important term is scaling cooldown period, which prevents AutoScaling groups from launching or terminating instances before the effects of previous activities are visible. We can control the cooldown period by using different scaling policies. Note that the default cooldown period is 300 seconds (5 mins), which can be changed. If you would like to learn more about scaling cooldown period, please refer to AWS documentation. If the alarm did not go off, then lower the Health check grace period to 60 seconds instead of default 300 seconds for your AutoScaling group.

Image description

  1. Once the instance was deemed unhealthy by the AutoScaling group, the CloudWatch alarm went off as shown below.

Image description

  1. We should have also received a PagerDuty incident notification by now. This was an indication that we had completed the integration properly.

Note, the PagerDuty notification methods whether it's an SMS, an email or a phone call, they are depend on how we configured our escalation policy and notification preference in PagerDuty.

Image description

  1. The last portion of testing is to wait for the AutoScaling group to recover this instance and for the CloudWatch alarm to send a notification for state OK which should resolve the PagerDuty incident automatically.
  • The below image shows that the alarm state went to the OK state.

Image description

  • The below images shows that the PagerDuty incident is resolved automatically.

Image description


Conclusion:

During this tutorial, we have successfully created an SNS topic, a CloudWatch alarm for AutoScaling Group Terminating Instances metric and integrated them with PagerDuty CloudWatch integration.
We have tested the configuration and confirmed the following:

  1. Once an instance was terminated, a CloudWatch alarm went off and a PagerDuty incident was created. And, the engineer on-call freaked out. :laugh:
  2. As soon as the AutoScaling group replaced the instance and deemed healthy, the CloudWatch alarm state went from In Alarm to OK state. As a result, the PagerDuty incident was automatically resolved and the engineer on-call went back to bed 😂

Thank you for taking this journey with me, and I hope it was beneficial to you. Oh, one last thing, now you can do your victory dance 😜

alt text

Top comments (0)