Top AWS CloudWatch Anti-Patterns that could derail your Observability strategy

#awscloudoperations #cloudwatch #awsobservability #aws

AWS CloudWatch provides a comprehensive service stack to enable end-to-end full-stack observability for your applications, regardless of whether they are deployed as server-based (EC2), container-based (ECS), or serverless (Lambda, EKS/ECS with Fargate). When you start a new project, you can follow a standard approach to enable full-stack observability via CloudWatch.

First, you need to enable telemetry for your application. CloudWatch can’t do anything unless your application emits logs, metrics, and traces. At a high level, you can leverage the CloudWatch Agent or AWS Distro for OpenTelemetry for instrumentation to collect logs, metrics, and traces. You can also enable Real User Monitoring (RUM) to capture digital experience-related metrics. The good thing about CloudWatch is that, whether server-based or serverless, all your infrastructure-related direct/indirect metrics are automatically available to you. It’s also a best practice to leverage the insights and analytics provided, such as Container Insights, Lambda Insights, and Application Insights. These are great features for obtaining telemetry data.

Second, it’s about enabling your basic observability, which includes alerting, dashboards, and automation. These days, AIOps use cases like anomaly detection can be leveraged as well. You can create synthetic monitors, define Service Level Objectives (SLOs), and create intelligent alerts.

In general, AWS provides the building blocks for you to develop a comprehensive full-stack observability solution.

Let’s now look at some of the anti-patterns related to CloudWatch that you need to avoid:

Not having an observability plan and relying on pockets of good practices – a “feel-good” approach.
Not keeping track of CloudWatch updates – you’re going to miss a lot of important information.
Not embracing automatic instrumentation when designing and developing your systems.
Not moving away from static thresholds.
Configuring alerts without considering the customer’s needs.
Underestimating the role of GenAI in operations, especially in AWS.

Let's dive into the details of each of these anti-patterns:

1). Focus on pockets of observability instead of focusing on full-stack observability. Yes, full-stack observability is what you should strive for.

Full-stack observability includes frontend performance monitoring, which involves:
Frontend Performance Monitoring: Use Real User Monitoring (RUM).
Configure Synthetic Canaries for additional proactive monitoring.
Enable Application Performance Monitoring (APM) to create service and service maps. This will also provide you with tracers and trace maps.
Send logs to CloudWatch for centralized monitoring.
Metrics: Now you should have all the metrics needed to perform comprehensive observability in your application. Ensure you follow the four golden signals approach when developing your metrics: traffic, error rate, latency, and saturation (capacity-related).

2). Not keeping track of what AWS is doing with CloudWatch. It’s the easiest way to miss out on some great capabilities. Yes, you may wonder why I'm mentioning this, but it's based on my experience. AWS frequently releases new updates, and I often tend to miss them.
In case you missed it, recent CloudWatch changes include the ability to maintain context in observability data, near real-time network monitoring, database insights for Amazon Aurora, enhanced observability for ECS, enhanced application signals providing transaction spans, and centralized telemetry configuration and visibility.

3). Enabling telemetry can be challenging, but it doesn’t have to be difficult. Failing to leverage the automatic telemetry instrumentation provided by CloudWatch Application Signals is a major mistake.
CloudWatch Application Signals is a great feature that allows you to automatically instrument your application. This removes the burden of manually enabling metrics, logs, and tracers. Application Signals supports various platforms such as EKS, EC2-based, Kubernetes, Lambda, ECS, and some custom hosting services.

All the services that are automatically discovered will be enabled with four golden signals-aligned metrics, service pages, and service maps. This is a great feature to fast-track your observability implementation.

4). When it comes to alerting, focusing on static thresholds is probably the biggest anti-pattern of all. Not embracing built-in AIOps capabilities, like metric and log anomaly detection, is a mistake you can’t undo.
Static thresholds are a thing of the past. We are now in the era of AI, and anomaly detection plays a major role. It’s about balancing your performance and getting alerted when the baseline is breached (either upwards or downwards). CloudWatch provides metric anomaly detection, which is a great capability you must use. CloudWatch log anomaly detection is another excellent feature to stay on top of your logs. Let CloudWatch alert you whenever there is a new error appearing or an increase in existing error conditions. You can enable this for all your log groups.

5). Developing a bunch of random alerts covering all corners of the application but still struggling to identify customer experience-related failures.
We are very good at creating a lot of alerts, but some of them don’t make sense. For example, if a workload goes down, autoscaling will bring it back up. Yes, you need to find the root cause, but it might have minimal impact on the customer experience. Similarly, high resource utilization is often not a problem, since the system runs just fine. Don’t get me wrong, I’m not saying we shouldn’t monitor these things, but time is precious, and we need to focus on what really matters. Instead of being flooded with non-actionable alerts, we need to focus on issues that directly impact users.

This is where Service Level Objectives (SLOs) come in. SLOs define what "good" means for your systems and are closely correlated with end-user experience. CloudWatch provides the ability to create and track SLOs. You should focus more on developing SLOs and then build an alert framework around them.

6). Skipping GenAI in cloud operations is a mistake. GenAI is here to stay, and AWS has already integrated it to provide you with AI Operations capabilities.
CloudWatch is now integrated with Amazon Q Developer. With the new GenAI integration, you'll be able to tap into Q to provide intelligent insights when troubleshooting issues using the telemetry data already residing within CloudWatch. It may take a little time to get used to, but once you get past the initial phase, it will definitely help expedite root cause identification. The world is moving towards AIOps, and this is a great feature to experiment with to reduce SME dependencies as well.

That’s it! Those are the top 6 CloudWatch anti-patterns I think you should avoid. AWS CloudWatch is one of the best observability suites available in the market. While AWS provides great services, we need to use them correctly to get the best results.

DEV Community

Top AWS CloudWatch Anti-Patterns that could derail your Observability strategy

Top comments (0)

Read next

Handling API Gateway's "Missing Authentication Token" Error (404) Correctly

Ransomware: A Wake-Up Call for the Digital Age

Building a Scalable Web Application on AWS Using Core Services [Perfect for Beginners]

Playwright on Cloud: Automating Review Data Extraction