A CloudWatch alarm is triggered. Now what? I am not the first person to tell you that observability is essential to your cloud infrastructure. You are not done when you have set up CloudWatch alarms!
Who will act on those alarms?
Observing some metrics and raising an alarm if a certain threshold is breached is just the start! You must also consider who will act if this alarm is breached. In some cases, you might be able to automate your actions. For example, you could think of a high CPU load on your application servers. The CloudWatch alarm is then triggered, and you can scale up your cluster.
These auto-remediation actions work very well for all known issues that can go wrong. You can detect them and build remediation actions that you trigger based on these CloudWatch alarms.
But in some cases, you just need a human to jump online and figure out what happened. When this happens, you don’t want to start a scavenger hunt in your AWS account to find the problem. You want a clear starting point and, from there, guidance in the right direction.
CloudWatch Dashboards
CloudWatch dashboards can guide you in the right direction during production issues. For example, assume we have a SQS Queue that contains messages. A lambda function is processing these messages. If the process fails, it will retry, and after three attempts, it will deliver the message to the dead-letter queue.
When one or more messages are in the dead-letter queue, a CloudWatch alarm is triggered, and we know something went wrong in our application.
This alarm can trigger an SNS topic, and from there, you can reach an engineer who can investigate this issue. A dashboard that shows these components makes it easy to navigate to the issue.
Here, you can see the dead-letter queues from our application. MyLambda is in an alarm state, and on the right, you see a link that brings you directly to that function's LogGroup.
It turns out that someone pushed code with an exception. We fixed the code and started re-driving the messages in the dead-letter queue back to the original queue. The lambda function is now processing the messages as it should, and the alarm will go back to an OK state.
Conclusion
When you build cloud infrastructure, you also need to think about what can go wrong. Design for failure, build auto-remediation for all known failures and loop in a person when unknown issues appear. But be kind for your fellow engineer and don’t sent him on a scavenger hunt. Guide him to problem and give context about the failures in a CloudWatch dashboard.
Photo by Gill Heward.
Top comments (0)