DEV Community

Cover image for Learn Serverless Tracability with "The X-Ray Tracer" CDK Pattern - Lambda, SNS, SQS, DynamoDB
Matt Coulter for CDK Patterns

Posted on • Edited on

Learn Serverless Tracability with "The X-Ray Tracer" CDK Pattern - Lambda, SNS, SQS, DynamoDB

The X-Ray Tracer

High Level Image

This is a pattern not defined by the components used but how they send information back to the AWS X-Ray service to help you make your application perform better when viewed through the Serverless Well-Architected lens. A fully well architected solution would use embedded metric format for the logs like in the Julian Wood reference below but I am saving that for another pattern so as not to confuse the concepts.

Some useful references:

Author Link
AWS X-Ray Developer Guide X-Ray Developer Guide
AWS X-Ray Concepts X-Ray Concepts
AWS Serverless Lens Whitepaper Serverless Lens Whitepaper
AWS Well Architected Whitepaper Well Architected Whitepaper
Julian Wood Building Well Architected Applications
AWS Developer Blog Category: AWS X-Ray
AWS Training Introduction to AWS X-Ray

Available CDK Versions

Alternatively use the cdkp command line tool to download the code:

TypeScript

npx cdkp init the-xray-tracer
Enter fullscreen mode Exit fullscreen mode

Python

npx cdkp init the-xray-tracer --lang=python
Enter fullscreen mode Exit fullscreen mode

Well Architected

The AWS Well-Architected Framework helps you understand the pros and cons of
decisions you make while building systems on AWS. By using the Framework, you will learn architectural best practices for designing and operating reliable, secure, efficient, and cost-effective systems in the cloud. It provides a way for you to consistently measure your architectures against best practices and identify areas for improvement.

We believe that having well-architected systems greatly increases the likelihood of business success.

Serverless Lens Whitepaper

Well Architected Whitepaper

The Operational Excellence Pillar

Note - The content for this section is a subset of the Serverless Lens Whitepaper with some minor tweaks.

The operational excellence pillar includes the ability to run and monitor systems to deliver business value and to continually improve supporting processes and procedures.

OPS 1: How do you understand the health of your Serverless application?

Distributed Tracing

Similar to non-serverless applications, anomalies can occur at larger scale in distributed systems. Due to the nature of serverless architectures, it’s fundamental to have distributed tracing.

Making changes to your serverless application entails many of the same principles of deployment, change, and release management used in traditional workloads. However, there are subtle changes in how you use existing tools to accomplish these principles.

Active tracing with AWS X-Ray should be enabled to provide distributed tracing capabilities as well as to enable visual service maps for faster troubleshooting.

X-Ray helps you identify performance degradation and quickly understand anomalies, including latency distributions.

Example Trace

Service Maps are helpful to understand integration points that need attention and resiliency practices. For integration calls, retries, backoffs, and possibly circuit breakers are necessary to prevent faults from propagating to downstream services.

Another example is networking anomalies. You should not rely on default timeouts and retry settings. Instead, tune them to fail fast if a socket read/write timeout happens where the default can be seconds if not minutes in certain clients.

X-Ray also provides two powerful features that can improve the efficiency on identifying anomalies within applications: Annotations and Subsegments.
Subsegments are helpful to understand how application logic is constructed and what external dependencies it has to talk to. Annotations are key-value pairs with string, number, or Boolean values that are automatically indexed by AWS X-Ray.

Combined, they can help you quickly identify performance statistics on specific
operations and business transactions, for example, how long it takes to query a
database, or how long it takes to process pictures with large crowds.

Alt Text

What is Included In This Pattern?

I wanted to make this pattern as "real" as possible for people so I included most of the serverless components you will use everyday. I have included:

  • API Gateway -> SNS -> Lambda (not SQS for reasons documented later)
  • Lambda -> DynamoDB
  • Lambda -> SQS -> Lambda
  • Lambda -> External Http Endpoint
  • Lambda -> SNS -> Lambda

When I map the flow in a high level conceptual image it looks like this:

Flow

After deployment, the X-Ray service map looks something like (You get two circles per Lambda):

x-ray service map

Or if you look at CloudWatch Service Lens:

cloudwatch service lens

You can see that these two diagrams aren't a massive distance away from my high level conceptual flow, the difference is the X-Ray generated diagram is 100% accurate because it is created from real traces based on user flow. When viewed through the AWS Console, it cannot become the out of date diagram you found on a wiki last updated 12 months ago - it is always accurate. If a developer checks in a piece of code that changes the flow, you will see it immediately.

Testing This Pattern

After you deploy this pattern you will have an API Gateway with the URL being output in the CDK Deploy logs.

Any URL you hit on that gateway will trigger this flow, it uses your URL as the message sent to SNS.

This URL is inserted into DynamoDB with a counter of how many times it was hit, the SNS and the SQS consumer lambdas both log the message to CloudWatch.

To see the random service errors try to hit the URL at least 10 times then you can either navigate to the X-Ray section of the AWS Console followed by clicking on "Service Map" in the sidebar. Please be aware that sometimes there can be a 30 second or so delay before your calls show up in service map.

Alternatively you can go to CloudWatch in the AWS Console and click "Service Lens" in the sidebar. Both offer views onto the system.

Deliberate Random Errors

I introduced a random SSL Cert error into the Lambda that connects to the External Http Endpoint to let you experiment with using X-Ray to source an error

Service map showing something isn't healthy:


Unhealthy service map

Trace details showing error:



error showing in hover in console

Enabling X-Ray

Depending on the component you are using and what it is integrating with you need to enable X-Ray in a different way.

API Gateway

This is done by simply setting a property of tracingEnabled: true on deployOptions:

let gateway = new apigw.RestApi(this, 'xrayTracerAPI', {
    deployOptions: {
    metricsEnabled: true,
    loggingLevel: apigw.MethodLoggingLevel.INFO,
    dataTraceEnabled: true,
    tracingEnabled: true,
    stageName: 'prod'
    }
});
Enter fullscreen mode Exit fullscreen mode

Lambda

You set a tracing property to lambda.Tracing.ACTIVE

this.httpLambda = new lambda.Function(this, 'httpLambdaHandler', {
    runtime: lambda.Runtime.NODEJS_12_X,
    code: lambda.Code.asset('lambdas'),
    handler: 'http.handler',
    tracing: lambda.Tracing.ACTIVE
});
Enter fullscreen mode Exit fullscreen mode

SNS/SQS/DynamoDB

These just pick up when being called from a component with tracing enabled, they do not need a specific setting to enable it.

AWS SDK Calls

You need to make sure your AWS SDK code is wrapped with X-Ray during invocation. This is true of any SDK calls e.g. Lambda to Lambda direct invoke, DynamoDB queries, Publishing to SNS etc

const AWSXRay = require('aws-xray-sdk');
// Wrap AWS SDK with X-Ray
const AWS = AWSXRay.captureAWS(require('aws-sdk'));

exports.handler = async function(event:any) {

  // Create an SQS service object as normal
  var sqs = new AWS.SQS({apiVersion: '2012-11-05'});
Enter fullscreen mode Exit fullscreen mode

External HTTP Requests

You need to wrap the https module with X-Ray:

const AWSXRay = require('aws-xray-sdk');
// Wrap HTTPS module with X-Ray
var https = AWSXRay.captureHTTPs(require('https'));

exports.handler = async function(event:any) {

    // Make a call to a webservice as normal
    const req = https.get("https://url.com", (res:any) => {
Enter fullscreen mode Exit fullscreen mode

Subsegments, Metadata and Annotations

Subsegments

A segment can break down the data about the work done into subsegments. Subsegments provide more granular timing information and details about downstream calls that your application made to fulfill the original request. A subsegment can contain additional details about a call to an AWS service, an external HTTP API, or an SQL database. You can even define arbitrary subsegments to instrument specific functions or lines of code in your application.

I have included some custom subsegments in this pattern, like "external HTTP Request" below:



Subsegment showing in console

These are easy to create inside the Lambda Functions:

const subsegment = segment.addNewSubsegment('external HTTP Request');

let response = await new Promise((resolve:any, reject:any) => {
    // Make a call to a webservice
    const req = https.get("https://url.com", (res:any) => {
        ... //resolve promise
    });
    ... //reject promise
});

subsegment.addMetadata("response", response)
subsegment.close();
Enter fullscreen mode Exit fullscreen mode

Metadata

You are allowed to put whole objects inside metadata, this is brilliant for showing things like the response from a webservice.

Metadata are key-value pairs that can have values of any type, including objects and lists, but are not indexed for use with filter expressions. Use metadata to record additional data that you want stored in the trace but don't need to use with search.



metadata on segment in console

Annotations

Annotations are key-value pairs with string, number, or Boolean values. Annotations are indexed for use with filter expressions. Use annotations to record data that you want to use to group traces in the console, or when calling the GetTraceSummaries API.



annotations on segment in console

X-Ray Groups

X-Ray groups enable customers to slice and dice their X-Ray service graph and focus on certain workflows, applications, or routes.

Customers can create a group by setting a filter expression. All the traces that match the set filter expression will be part of that group. Customers can then view service graphs for the selected group, and understand performance bottlenecks, errors, or faults in services belonging to that service graph.

Deep dive into AWS X-Ray groups and use cases

A useful tip is that Annotations can be used in filter expressions so you can easily create groups for smaller bounded contexts within a larger service map and then create CloudWatch alerts per group.

Known X-Ray "Quirks"

There are a couple of X-Ray quirks that I need to document, I thought it better to show them than refactor the pattern to hide them then you hit one later. If these are a deal breaker for you there are other tools out there for tracing that I have been promised will integrate with no extra code changes like Epsagon

SQS -> Lambda Traces

There is a known bug where this doesn't connect and you end up with two paths on your service map.

I have included some logic inside the SQS subscriber lambda to move an X-Ray custom subsegment trace circle from the new second flow to where it should be but this is a workaround and hopefully that bug gets closed sooner than later.

API Gateway -> SNS through direct integration

X-Ray does work as expected with SNS when using the AWS SDK but for some reason when I do a direct integration with API Gateway through VTL the service map shows the subscribers of the SNS topic as being connected to API GW rather than SNS which is fine because I am not missing information but it's not correct. If I workout a fix for this I will update the pattern.

sns bug

Service Map Changes Design Randomly

Unlike other tracing solutions I have used that let you position all of the circles where they give a feeling of inner calmness based on personal OCD; X-Ray randomly positions the circles on every refresh which can lead to some interesting map layouts. The important thing is being able to spot anomalies which you can definitely still do so this is purely aesthetic.

service map positioning

Top comments (0)