Monitoring EC2 instances deployed with Blue/Green deployment

#eventbridge #lambda #cloudwatch

Introduction:

In this post, we have a configuration for monitoring EC2 instances deployed with the Blue/Green deployment strategy. This configuration consists resources:
Lambda function with necessary access which does:

get the instance IDs based on the instance names,
get the metric's value for the instance from the AWS/EC2 CloudWatch namespace,
enhance the metric’s data with additional information,
put changed metric's data to custom ws-deployment namespace;

EventBridge schedule rule which runs the lambda function every 5 minutes;
CloudWatch alarms which monitor the EC2 instances based on the metrics from the custom ws-deployment namespace.

The reason for this configuration is that for the AWS/EC2 namespace as metric dimension we have only InstanceId, not InstanceName, more information about CloudWatch metrics is here :

    aws cloudwatch list-metrics --namespace AWS/EC2 --metric-name CPUUtilization   
    {
        "Metrics": [
            {
                "Namespace": "AWS/EC2",
                "MetricName": "CPUUtilization",
                "Dimensions": [
                    {
                        "Name": "InstanceId",
                        "Value": "i-1234567890abcde"
                    }
                ]
            },

This post is the third part series of posts about Blue/Green deployment on AWS EC2 instances with the Systems Manager Automation runbook, the first part is here, and the second part is here.

About the project:

All infrastructure is created with CloudFormation template infrastructure/ec2_monitoring.yaml and has independent deployment from the ec2-bluegreen-deployment stack. In the Systems Manager Automation runbook configuration, we have only one EC2 instance for creation, but the lambda function can work with many instances, for this, we only need to specify instance names as a comma-separated list of the “InstanceNames” parameter.

ec2_monitoring.yaml template:

    AWSTemplateFormatVersion: '2010-09-09'
    Description: 'CloudWatch metrics and alarms for monitoring the deployed EC2 instances'

    Parameters:
      TransformMetricsLfName:
        Type: String
        Default: 'TransformEc2Metrics'
      CustomNamespace:
        Type: String
        Default: 'ws-deployment'
      InstanceNames:
        Type: String
        Description: 'Comma-separated list of the instance names'
        Default: 'ws-instance'
      MetricNames:
        Type: String
        Description: 'Comma-separated list of the metric names'
        Default: 'CPUUtilization,StatusCheckFailed_Instance,StatusCheckFailed_System'
      MetricUnits:
        Type: String
        Description: 'Comma-separated list of the metric units'
        Default: 'Percent,Count,Count'
      SnsTopicName:
        Type: String
        Default: 'blue-green-deployment-notifications'

    Resources:
    #####################################
    #  Lambda Function configuration
    #####################################
      TransformEc2MetricsRole:
        Type: AWS::IAM::Role
        Properties:
          RoleName: TransformEc2MetricsRole
          AssumeRolePolicyDocument:
            Version: '2012-10-17'
            Statement:
              - Effect: Allow
                Principal:
                  Service: lambda.amazonaws.com
                Action: sts:AssumeRole
          ManagedPolicyArns:
            - 'arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole'
          Policies:
            - PolicyName: TransformEc2Metricspolicy
              PolicyDocument:
                Version: '2012-10-17'
                Statement:
                  - Effect: Allow
                    Action:
                      - ec2:DescribeInstances
                      - cloudwatch:GetMetricStatistics
                      - cloudwatch:PutMetricData
                      - cloudwatch:GetMetricData
                    Resource: '*'

      LambdaLogsGroup:
        Type: AWS::Logs::LogGroup
        DeletionPolicy: Delete
        UpdateReplacePolicy: Retain
        Properties:
          LogGroupName: !Sub '/aws/lambda/${TransformMetricsLfName}'
          RetentionInDays: '7'

      TransformCustomMetrics:
        Type: AWS::Lambda::Function
        Properties:
          FunctionName: !Ref TransformMetricsLfName
          Description: 'transforming and processing metrics from AWS/EC2 namespace'
          Runtime: python3.12
          Handler: index.lambda_handler
          Timeout: 30
          Role: !GetAtt TransformEc2MetricsRole.Arn
          LoggingConfig:
            LogGroup: !Sub '/aws/lambda/${TransformMetricsLfName}'
          Environment:
            Variables:
              instance_names: !Ref InstanceNames
              metric_names: !Ref MetricNames
              metric_units: !Ref MetricUnits
              custom_namespace: !Ref CustomNamespace
          Code:
            ZipFile: |
              import boto3
              import os
              from datetime import datetime, timedelta

              def lambda_handler(event, context):
                  try:
                      instance_names = [name.strip() for name in os.environ['instance_names'].split(',')]
                      metric_names =  [metric.strip() for metric in os.environ['metric_names'].split(',')]
                      metric_units =  [unit.strip() for unit in os.environ['metric_units'].split(',')]
                      custom_namespace = os.environ['custom_namespace']

                      # Initialize EC2 and CloudWatch clients
                      ec2_client = boto3.client('ec2')
                      cloudwatch_client = boto3.client('cloudwatch')

                      # Get instance IDs based on the instance names
                      instance_ids, instance_names_result = get_instance_ids(instance_names, ec2_client)
                      if not instance_ids:
                          print("[INFO] No instances found for transforming metrics.")
                          return

                      # Get metrics for each instance
                      for instance_id, instance_name in zip(instance_ids, instance_names_result):
                          metrics = get_instance_metrics(instance_id, metric_names, metric_units, instance_name, cloudwatch_client)

                          # Put formatted metrics to custom CloudWatch namespace
                          put_custom_metrics(metrics, custom_namespace, cloudwatch_client)

                  except Exception as e:
                      print(f"Error with proceeding metrics transformation: {str(e)}")

              def get_instance_ids(instance_names, ec2_client):
                  instance_ids = []
                  instance_names_result = []

                  for instance_name in instance_names:
                      response = ec2_client.describe_instances(
                          Filters=[
                              {'Name': 'tag:Name', 'Values': [instance_name]},
                              {'Name': 'instance-state-name', 'Values': ['running']}
                          ]
                      )

                      # Extract instance IDs from the response
                      ids = [instance['InstanceId'] for reservation in response['Reservations'] for instance in reservation['Instances']]

                      # append instance IDs
                      if ids:
                          instance_ids.extend(ids)
                          instance_names_result.append(instance_name)

                  return instance_ids, instance_names_result

              def get_instance_metrics(instance_id, metric_names, metric_units, instance_name, cloudwatch_client):
                  end_time = datetime.utcnow()
                  start_time = end_time - timedelta(minutes=10)

                  metrics_dict = {}

                  for metric_name, unit in zip(metric_names, metric_units):
                      # take necessary metric values
                      id_for_query = metric_name.lower()

                      query = {
                          "Id": id_for_query,
                          "MetricStat": {
                              "Metric": {
                                  "Namespace": "AWS/EC2",
                                  "MetricName": metric_name,
                                  "Dimensions": [
                                      {"Name": "InstanceId", "Value": instance_id}
                                  ]
                              },
                              "Period": 300,
                              "Stat": "Average",
                              "Unit": unit
                          },
                          "ReturnData": True
                      }

                      response = cloudwatch_client.get_metric_data(
                          MetricDataQueries=[query],
                          StartTime=start_time,
                          EndTime=end_time
                      )

                      # Extract data from the response
                      metric_data_results = response.get('MetricDataResults', [])

                      if not metric_data_results:
                          print(f"No data available for metric: {metric_name} related to {instance_name}")
                          continue

                      values = metric_data_results[0].get('Values', [])

                      if not values:
                          print(f"No values available for metric: {metric_name} related to {instance_name}")
                          continue

                      # Get the latest value
                      latest_value = values[-1]

                      # Combine metric name, value, unit, instance name into a dictionary
                      metrics_dict[id_for_query] = {
                          'MetricName': metric_name,
                          'Value': latest_value,
                          'Unit': unit,
                          'InstanceId': instance_id,
                          'InstanceName': instance_name
                      }
                  return metrics_dict

              def put_custom_metrics(metrics_dict, custom_namespace, cloudwatch_client):
                  for metric_id, metric_info in metrics_dict.items():
                      metric_name = metric_info['MetricName']
                      value = metric_info['Value']
                      dimensions = [
                          {'Name': 'InstanceName', 'Value': metric_info['InstanceName']}
                      ]
                      unit = metric_info['Unit']
                      instance_name = metric_info['InstanceName']

                      response = cloudwatch_client.put_metric_data(
                          Namespace=custom_namespace,
                          MetricData=[
                              {
                                  'MetricName': metric_name,
                                  'Dimensions': dimensions,
                                  'Value': value,
                                  'Unit': unit
                              }
                          ]
                      )

                      # Print information about the success or failure process
                      if response['ResponseMetadata']['HTTPStatusCode'] == 200:
                          print(f"Successfully put metric data for {metric_name} in {custom_namespace} related to {instance_name}")
                      else:
                          print(f"Failed to put metric data for {metric_name} in {custom_namespace}. Response: {response}")

      LambdaInvokePermission:
        Type: AWS::Lambda::Permission
        Properties:
          Action: lambda:InvokeFunction
          FunctionName: !Ref TransformCustomMetrics
          Principal: events.amazonaws.com
          SourceArn: !GetAtt ScheduleRule.Arn

      ScheduleRule:
        Type: AWS::Events::Rule
        Properties:
          Name: TransformCustomMetricsScheduleRule
          ScheduleExpression: 'rate(5 minutes)'
          Targets:
            - Arn: !GetAtt TransformCustomMetrics.Arn
              Id: TransformCustomMetricsTarget

    #####################################
    #  CloudWatch Alarms
    #####################################
      CPUAlarm: 
        Type: AWS::CloudWatch::Alarm
        Properties:
          AlarmName: 
            !Sub 
              - '${InstanceName} - High CPU Usage'
              - InstanceName: !Select [0, !Split [",", !Ref InstanceNames]]
          AlarmDescription: 'High CPU Usage'
          AlarmActions:
          - !Sub 'arn:${AWS::Partition}:sns:${AWS::Region}:${AWS::AccountId}:${SnsTopicName}'
          OKActions:
          - !Sub 'arn:${AWS::Partition}:sns:${AWS::Region}:${AWS::AccountId}:${SnsTopicName}'
          MetricName: !Select [0, !Split [",", !Ref MetricNames]]
          Unit: !Select [0, !Split [",", !Ref MetricUnits]]
          Namespace: !Ref CustomNamespace
          Statistic: Average
          Period: 300
          EvaluationPeriods: 3
          Threshold: 90
          ComparisonOperator: GreaterThanOrEqualToThreshold
          Dimensions:
          - Name: InstanceName
            Value: !Select [0, !Split [",", !Ref InstanceNames]]

      SystemStatusAlarm:
        Type: AWS::CloudWatch::Alarm
        Properties:
          AlarmName: 
            !Sub 
              - '${InstanceName} - System Status Check Failed'
              - InstanceName: !Select [0, !Split [",", !Ref InstanceNames]]
          AlarmDescription: 'System Status Check Failed'
          Namespace: !Ref CustomNamespace
          MetricName: !Select [1, !Split [",", !Ref MetricNames]]
          Unit: !Select [1, !Split [",", !Ref MetricUnits]]
          Statistic: Minimum
          Period: 300
          EvaluationPeriods: 1
          ComparisonOperator: GreaterThanThreshold
          Threshold: 0
          AlarmActions:
          - !Sub 'arn:${AWS::Partition}:sns:${AWS::Region}:${AWS::AccountId}:${SnsTopicName}'
          OKActions:
          - !Sub 'arn:${AWS::Partition}:sns:${AWS::Region}:${AWS::AccountId}:${SnsTopicName}'
          Dimensions:
          - Name: InstanceName
            Value: !Select [0, !Split [",", !Ref InstanceNames]]

      InstanceStatusAlarm:
        Type: AWS::CloudWatch::Alarm
        Properties:
          AlarmName: 
            !Sub 
              - '${InstanceName} - Instance Status Check Failed'
              - InstanceName: !Select [0, !Split [",", !Ref InstanceNames]]
          AlarmDescription: 'Instance Status Check Failed'
          Namespace: !Ref CustomNamespace
          MetricName: !Select [2, !Split [",", !Ref MetricNames]]
          Unit: !Select [2, !Split [",", !Ref MetricUnits]]
          Statistic: Minimum
          Period: 300
          EvaluationPeriods: 1
          ComparisonOperator: GreaterThanThreshold
          Threshold: 0
          AlarmActions:
          - !Sub 'arn:${AWS::Partition}:sns:${AWS::Region}:${AWS::AccountId}:${SnsTopicName}'
          OKActions:
          - !Sub 'arn:${AWS::Partition}:sns:${AWS::Region}:${AWS::AccountId}:${SnsTopicName}'
          Dimensions:
          - Name: InstanceName
            Value: !Select [0, !Split [",", !Ref InstanceNames]]

Infrastructure schema and list of metrics from the ws-deployment namespace:

Deployment:

clone the repository (if you don’t have already cloned it).

    git clone https://gitlab.com/Andr1500/ssm_runbook_bluegreen.git

Put “dummy” metrics data with AWS CLI into the custom namespace for each metric. It is necessary because without this data CloudWatch alarms were not created correctly.

    aws cloudwatch put-metric-data \
        --namespace "ws-deployment" \
        --metric-name "CPUUtilization" \
        --dimensions "InstanceName=ws-instance" \
        --value 70 --unit Percent

Create CloudFormation stack.

    aws cloudformation create-stack \
        --stack-name ws-ec2-monitoring \
        --template-body file://ec2_monitoring.yaml \
        --capabilities CAPABILITY_NAMED_IAM --disable-rollback

Conclusion:

In this post, we showed how we can monitor EC2 instances deployed with the Blue/Green deployment strategy. To be sure that the lambda function works correctly we can add the configuration of the “InsufficientDataActions” parameter in the CloudWatch alarms for sending notifications in case of changing the CloudWatch alarm state to “Insufficient data”. If you need to have more specific CloudWatch metrics from the EC2 instances — here is my post about Monitoring Disk Space (as example) with CloudWatch agent.

If you found this post helpful and interesting, please click the reaction button below to show your support for the author. Feel free to use and share this post!

DEV Community

Monitoring EC2 instances deployed with Blue/Green deployment

Top comments (0)

Read next

[Unity] UI Animation - How does it work?

Wrapping Up C and Moving into C++

Crack Coding Challenges: Master Data Structures & Algorithms with Ease!

Understanding Black Box Testing: Enhancing Software Quality with Keploy