DEV Community

Muhammad Ahmad Khan
Muhammad Ahmad Khan

Posted on • Edited on

Monitoring & Alerting for AWS EKS Using Grafana, Prometheus & Alertmanager with SNS Integration

In this blog post, we'll walk through setting up a robust monitoring and alerting system for a Kubernetes cluster on AWS EKS. We'll use the kube-prometheus-stack to deploy Prometheus and Grafana, configure an ALB Ingress for external access, and set up Amazon SNS as the receiver for Alertmanager to handle alerts.

Using Amazon SNS as the alert receiver is particularly useful when an organization does not use Slack or wants a centralized way to distribute alerts via email. While Alertmanager supports email notifications, setting up email as a direct receiver requires configuring an SMTP configuration resulting in administrative overhead. With SNS, you can subscribe multiple email addresses to a single topic without additional SMTP setup, making it a simpler and more scalable solution.

However, it's important to note that SNS has limitations when handling HTML content. Alerts sent via SNS will be received as plain text rather than formatted HTML, which may impact readability.


Prerequisites

Before we begin, ensure you have the following:

  1. An AWS account with a running EKS cluster.
  2. kubectl and helm installed on your local machine.
  3. AWS CLI configured with the necessary credentials.
  4. AWS Load Balancer Controller installed in the EKS cluster.
  5. kubectl pointing to the correct cluster context.

Step 1: Install the kube-prometheus-stack Using Helm

The kube-prometheus-stack is a collection of Kubernetes manifests, Grafana dashboards, and Prometheus rules that provide easy deployment and management of Prometheus and Grafana on Kubernetes.

1.1 Add the Prometheus Community Helm Repository

First, add the Prometheus Community Helm repository and update it:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
Enter fullscreen mode Exit fullscreen mode

1.2 Deploy the kube-prometheus-stack

Create a prom-operator-values.yaml file to customize the deployment. For example, set up the Grafana admin password, some custom dashboards, grafana ingress and other configurations:

grafana:
  adminPassword: "your-secure-password"
  ### Provision grafana-dashboards-kubernetes ###
  dashboardProviders:
    dashboardproviders.yaml:
      apiVersion: 1
      providers:
      - name: 'grafana-dashboards-kubernetes'
        orgId: 1
        folder: 'Kubernetes'
        type: file
        disableDeletion: true
        editable: true
        options:
          path: /var/lib/grafana/dashboards/grafana-dashboards-kubernetes
  dashboards:
    grafana-dashboards-kubernetes:
      k8s-system-api-server:
        url: https://raw.githubusercontent.com/dotdc/grafana-dashboards-kubernetes/master/dashboards/k8s-system-api-server.json
        token: ''
      k8s-system-coredns:
        url: https://raw.githubusercontent.com/dotdc/grafana-dashboards-kubernetes/master/dashboards/k8s-system-coredns.json
        token: ''
      k8s-views-global:
        url: https://raw.githubusercontent.com/dotdc/grafana-dashboards-kubernetes/master/dashboards/k8s-views-global.json
        token: ''
      k8s-views-namespaces:
        url: https://raw.githubusercontent.com/dotdc/grafana-dashboards-kubernetes/master/dashboards/k8s-views-namespaces.json
        token: ''
      k8s-views-nodes:
        url: https://raw.githubusercontent.com/dotdc/grafana-dashboards-kubernetes/master/dashboards/k8s-views-nodes.json
        token: ''
      k8s-views-pods:
        url: https://raw.githubusercontent.com/dotdc/grafana-dashboards-kubernetes/master/dashboards/k8s-views-pods.json
        token: ''
  ingress:
    annotations:
      kubernetes.io/ingress.class: alb
      alb.ingress.kubernetes.io/load-balancer-name: grafana-alb
      alb.ingress.kubernetes.io/scheme: internet-facing
      alb.ingress.kubernetes.io/target-type: ip
      alb.ingress.kubernetes.io/listen-ports: '[{"HTTP": 80}, {"HTTPS": 443}]'
      alb.ingress.kubernetes.io/ssl-redirect: '443'
      alb.ingress.kubernetes.io/certificate-arn: arn:aws:acm:<AWS_REGION>:<ACCOUNT_ID>:certificate/a0ff498e-62fd-4397-8d4a-626360465d32
    enabled: true
    hosts:
    - grafana-test.apps.xyz-company.com
    labels: {}
    path: /
Enter fullscreen mode Exit fullscreen mode

Deploy the kube-prometheus-stack using Helm:

helm upgrade --install k-prom-stack prometheus-community/kube-prometheus-stack \
  --namespace monitoring --create-namespace \
  --values prom-operator-values.yaml
Enter fullscreen mode Exit fullscreen mode

This command installs Prometheus, Grafana, and Alertmanager in the monitoring namespace.

1.3 Access Prometheus and Alertmanager UIs

To access the Prometheus and Alertmanager web UIs, use kubectl port-forward:

  • Prometheus UI:
  kubectl port-forward svc/prometheus-operated 9090:9090 --namespace monitoring
Enter fullscreen mode Exit fullscreen mode

Open your browser and go to http://localhost:9090.

  • Alertmanager UI:
  kubectl port-forward svc/alertmanager-operated 9093:9093 --namespace monitoring
Enter fullscreen mode Exit fullscreen mode

Open your browser and go to http://localhost:9093.

To access the Grafana web UI:

  • Create a DNS entry mapping the domain define in ingress with created loadbalancer url.

  • Use admin as the username and the password defined in prom-operator-values.yaml to log in to Grafana.


Step 2: Set Up Amazon SNS for Alertmanager

To receive alerts via email or other channels, we'll configure Amazon SNS as the receiver for Alertmanager.

2.1 Create an SNS Topic

Create an SNS topic to receive alerts:

aws sns create-topic --name alertTopic
Enter fullscreen mode Exit fullscreen mode

Note the ARN (Amazon Resource Name) from the output, as it will be needed in later steps.

2.2 Create an Email Subscription for SNS

Subscribe your email to the SNS topic to receive notifications:

aws sns subscribe \
    --topic-arn arn:aws:sns:<AWS_REGION>:<ACCOUNT_ID>:alertTopic \
    --protocol email \
    --notification-endpoint <your-email@example.com>
Enter fullscreen mode Exit fullscreen mode

Check your email inbox for a confirmation message and click the link to activate the subscription.

2.3 Create a New IAM Role for EKS Node Group to Assume and Publish to SNS

2.3.1 Create a Trust Relationship

Create a trust relationship policy to allow the EKS node group role to assume the new role:

cat << EOF > trust-policy.json
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::<ACCOUNT_ID>:role/eks-node-group-role"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}
EOF
Enter fullscreen mode Exit fullscreen mode

Create the IAM role:

aws iam create-role \
    --role-name alertmanager_role \
    --assume-role-policy-document file://trust-policy.json
Enter fullscreen mode Exit fullscreen mode

2.3.2 Attach Permissions to Publish to SNS

Create a policy to allow the role to publish messages to the SNS topic:

cat << EOF > sns-policy.json
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": "sns:Publish",
      "Resource": "arn:aws:sns:<AWS_REGION>:<ACCOUNT_ID>:alertTopic"
    }
  ]
}
EOF
Enter fullscreen mode Exit fullscreen mode

Attach the policy to the new role:

aws iam put-role-policy \
    --role-name alertmanager_role \
    --policy-name SNSPublishPolicy \
    --policy-document file://sns-policy.json
Enter fullscreen mode Exit fullscreen mode

2.4 Create a Resource-Based Policy for SNS

Allow this new role named alertmanager_role to publish alerts to the SNS topic by attaching a resource-based policy:

cat << EOF > sns-resource-policy.json
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::<ACCOUNT_ID>:role/alertmanager_role"
      },
      "Action": "sns:Publish",
      "Resource": "arn:aws:sns:<AWS_REGION>:<ACCOUNT_ID>:alertTopic"
    }
  ]
}
EOF
Enter fullscreen mode Exit fullscreen mode

Apply the policy to the SNS topic:

aws sns set-topic-attributes \
    --topic-arn arn:aws:sns:<AWS_REGION>:<ACCOUNT_ID>:alertTopic \
    --attribute-name Policy \
    --attribute-value file://sns-resource-policy.json
Enter fullscreen mode Exit fullscreen mode

2.5 Update the IAM Role for the EKS Node Group

Allow the EKS node group to assume this new role named alertmanager_role by attaching a policy:

cat << EOF > eks-assume-policy.json
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": "sts:AssumeRole",
      "Resource": "arn:aws:iam::<ACCOUNT_ID>:role/alertmanager_role"
    }
  ]
}
EOF
Enter fullscreen mode Exit fullscreen mode

Attach the policy to the EKS node group role:

aws iam put-role-policy \
    --role-name eks-node-group-role \
    --policy-name assume_alertmanager_role_policy \
    --policy-document file://eks-assume-policy.json
Enter fullscreen mode Exit fullscreen mode

Note:

  • your EKS node Group role might be different than mine e.g. eks-node-group-role

Step 3: Configure Alertmanager to Send Alerts to SNS

Update the alertmanager.yml configuration to send alerts to the SNS topic in same helm values file named as prom-operator-values.yaml:

alertmanager:
  config:
    global:
      resolve_timeout: 5m
    route:
      group_by: ['job', 'alertname', 'priority']
      group_wait: 30s
      group_interval: 5m
      repeat_interval: 12h
      receiver: 'sns-receivers'
    receivers:
    - name: 'null'  # Add this to your config as well
    - name: sns-receivers
      sns_configs:
        - api_url: https://sns.<AWS_REGION>.amazonaws.com
          topic_arn: arn:aws:sns:<AWS_REGION>:<ACCOUNT_ID>:alertTopic
          sigv4:
            region: <AWS_REGION>
            role_arn: arn:aws:iam::<ACCOUNT_ID>:role/alertmanager_role
          subject: '[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}]'
          message: |-
            {{ if gt (len .Alerts.Firing) 0 }}
            Alerts Firing:
            {{ template "__text_alert_list_markdown" .Alerts.Firing }}
            {{ end }}
            {{ if gt (len .Alerts.Resolved) 0 }}
            Alerts Resolved:
            {{ template "__text_alert_list_markdown" .Alerts.Resolved }}
            {{ end }}
          send_resolved: true  # Sends notification when alert is resolved
Enter fullscreen mode Exit fullscreen mode

Apply the updated configuration:

helm upgrade --install k-prom-stack prometheus-community/kube-prometheus-stack \
  --namespace monitoring --create-namespace \
  --values prom-operator-values.yaml
Enter fullscreen mode Exit fullscreen mode

Step 4: Configure custom rules for custom alerts

Update the prom-operator-values.yaml to add:

additionalPrometheusRulesMap:
  custom-rules:
    groups:
    - name: customGroupA.rules
      rules:
      - alert: Custom-Alert Instance High CPU Utilization
        annotations:
          description: CPU usage had been over 75% for 5 minutes | Current usage is {{ $value | printf "%.2f" }}%
          summary: CPU usage is over 75% (instance {{ $labels.instance }})
        expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 75
        for: 5m
        labels:
          severity: critical
      - alert: Custom-Alert Instance High Memory Utilization
        annotations:
          description: Memory usage had been over 75% for 5 minutes | Current usage is {{ $value | printf "%.2f" }}%
          summary: Memory usage is over 75% (instance {{ $labels.instance }})
        expr: 100 - (sum by(instance) (node_memory_MemAvailable_bytes) / sum by(instance) (node_memory_MemTotal_bytes) * 100) > 75
        for: 5m
        labels:
          severity: critical
Enter fullscreen mode Exit fullscreen mode

Apply the updated configuration:

helm upgrade --install k-prom-stack prometheus-community/kube-prometheus-stack \
  --namespace monitoring --create-namespace \
  --values prom-operator-values.yaml
Enter fullscreen mode Exit fullscreen mode

The complete helm values file can be found on GitHub here

Conclusion

In this blog post, we set up a comprehensive monitoring and alerting system for a Kubernetes cluster running on AWS EKS. We deployed Prometheus and Grafana using the kube-prometheus-stack, configured an ALB Ingress for external access to Grafana, and set up Amazon SNS as the receiver for Alertmanager to receive alerts.

With this setup, you can now monitor your Kubernetes cluster, visualize metrics using Grafana, and receive alerts via Alertmanager and SNS. This ensures that your cluster is both observable and resilient to issues.

Top comments (0)