Matheus das Mercês

Posted on Feb 17

Optimizing Amazon CloudWatch Costs for High-Traffic Lambda Functions with Advanced Logging Controls

#aws #devops #serverless #cloud

Are you happy with your CloudWatch bill? If you have one or more high-traffic Lambda functions, you pay for their duration and number of requests per month. However, there is a chance these functions are also increasing your CloudWatch costs - and this is why you need Advanced Logging Controls.

In this article, I explain how to optimize CloudWatch costs while respecting compliance, and leveraging a simple AWS Systems Manager Automation Runbook to achieve full control of your logs.

Introduction

CloudWatch log groups have two storage classes and their costs vary from region to region, but according to the CoudWatch pricing page, in Ireland (eu-west-1):

Standard class, where you can have all log group features, like a subscription filter, for example, you pay $0.57 per GB.
Infrequent-Access class, where features are limited and you can not have a subscription filter in place and the data can only be queried using Log Insights, the cost is $0.285 per GB.

Knowing that you pay per GB for the data ingested into a log group, how can you determine the size of a message?

The short answer is doing that before sending it to CloudWatch is not very handy and there are situations where you can not or do not want to reduce its size. Instead, you can control what kind of data is ingested into log groups, when, and for how long - and this is where the benefits are.

It is all about bytes

Whenever your Lambda function sends logs to a log group, it uses the CloudWatch sub-feature of Collect (data ingestion). Under the hoods, your Lambda uses the PutLogEvents API from CloudWatch, generating a metric called DataProcessing-Bytes for the Standard Class and DataProcessingIA-Bytes for Infrequent-Access. Then, based on these two metrics, AWS creates your bill (more about monitoring these metrics later in this article).

To put this into perspective, let's imagine that you have a high-traffic Lambda function that executes 5 million times a day. I've separated some examples, including the number of bytes that message can generate and how much they can cost you at scale.

Application logs

Application logs are custom messages generated by your Lambda function. For instance, you want to log the event received, or debug messages you add to your code along the way, and so on.

Imagine that your high-traffic Lambda function handles API Gateway requests. For some reason, you want to log the event received. Let's have a look at an example of that payload:

{
  "resource": "/my/path",
  "path": "/my/path",
  "httpMethod": "GET",
  "headers": {
    "header1": "value1",
    "header2": "value1,value2"
  },
  "multiValueHeaders": {
    "header1": [
      "value1"
    ],
    "header2": [
      "value1",
      "value2"
    ]
  },
  "queryStringParameters": {
    "parameter1": "value1,value2",
    "parameter2": "value"
  },
  "multiValueQueryStringParameters": {
    "parameter1": [
      "value1",
      "value2"
    ],
    "parameter2": [
      "value"
    ]
  },
  "requestContext": {
    "accountId": "123456789012",
    "apiId": "id",
    "authorizer": {
      "claims": null,
      "scopes": null
    },
    "domainName": "id.execute-api.us-east-1.amazonaws.com",
    "domainPrefix": "id",
    "extendedRequestId": "request-id",
    "httpMethod": "GET",
    "identity": {
      "accessKey": null,
      "accountId": null,
      "caller": null,
      "cognitoAuthenticationProvider": null,
      "cognitoAuthenticationType": null,
      "cognitoIdentityId": null,
      "cognitoIdentityPoolId": null,
      "principalOrgId": null,
      "sourceIp": "IP",
      "user": null,
      "userAgent": "user-agent",
      "userArn": null,
      "clientCert": {
        "clientCertPem": "CERT_CONTENT",
        "subjectDN": "www.example.com",
        "issuerDN": "Example issuer",
        "serialNumber": "a1:a1:a1:a1:a1:a1:a1:a1:a1:a1:a1:a1:a1:a1:a1:a1",
        "validity": {
          "notBefore": "May 28 12:30:02 2019 GMT",
          "notAfter": "Aug  5 09:36:04 2021 GMT"
        }
      }
    },
    "path": "/my/path",
    "protocol": "HTTP/1.1",
    "requestId": "id=",
    "requestTime": "04/Mar/2020:19:15:17 +0000",
    "requestTimeEpoch": 1583349317135,
    "resourceId": null,
    "resourcePath": "/my/path",
    "stage": "$default"
  },
  "pathParameters": null,
  "stageVariables": null,
  "body": "Hello from Lambda!",
  "isBase64Encoded": false
}

This message generates 1.903 bytes. If this Lambda executes an average of 5 million a day, by the end of the month, the size in GB will be 274.66GB (1903 x 5000000 = 8.86GB a day, times 31 = 274.66GB).

Looking at the pricing for the Standard class, by the end of the month, this single log from a single Lambda cost you $156.55 (274.66GB x $0.57).

If you think this is an undesired cost - and somehow you want to keep logging these messages for debugging purposes, you need Advanced Logging Controls.

System logs

System logs are log messages generated by the Lambda service by default. For instance, Lambda reports data with the duration, billed duration, memory, and start and end time. Even if your Lambda code does not generate any application logs, by default, the system logs will always appear in your log group.

Let's have a look at a report message generated by a warm Lambda function in plain text (a cold Lambda has a different report message):

START RequestId: 1716e630-0997-4bd6-aae3-0f681ef1e69c Version: $LATEST
END RequestId: 1716e630-0997-4bd6-aae3-0f681ef1e69c
REPORT RequestId: 1716e630-0997-4bd6-aae3-0f681ef1e69c Duration: 1.80 ms Billed Duration: 2 ms Memory Size: 128 MB Max Memory Used: 68 MB

These logs created by one warm Lambda execution generate 262 bytes. If this Lambda executes an average of 5 million a day (not considering cold starts), by the end of the month, the size in GB will be 37.82GB (262 x 5000000 = 1.22GB a day, times 31 = 37.82GB).

Looking at the pricing for the Standard class, by the end of the month, this report messages from a single Lambda cost you $21.55 (37.82GB x $0.57).

Not a fortune, right? The important question to ask yourself is, do you need these log messages? Commonly, you would want to analyze the duration, and memory consumed for a Lambda to reduce costs related to Lambda execution, for example. But if you don't actively look at these messages, it is wise to change them to optimize your costs at scale.

Measuring the size of a log message

How can you calculate the size, in bytes, of a message sent to CloudWatch, as I did for the previous examples?

You can use the same query I've used in CloudWatch Log Insights:

FIELDS @message, ingested_bytes
#| filter @message like 'REPORT'
| STATS sum(strlen(@message)) AS ingested_bytes

ATTENTION: Querying data using Insights can also be expensive. For each GB of data scanned, you will pay $0.0057 (eu-west-1). Make sure you run this query in a log group that does not retain much data or only for a short time window.

Finding the perfect balance

There is always a balance between compliance and cost optimization. This is especially true in large organizations, where there might be constraints related to:

Retention period, as in how long the data must be stored to be compliant with the company's policies
The type of data ingested, which is more often related to the company's technical guidelines for specific AWS services

The important aspect of this trade-off is being able to negotiate. By clearly identifying these constraints alongside the analysis of cost reduction, you have better arguments to start a negotiation that benefits all parties involved.

Setting up a retention policy

Knowing how long you have to keep the data in a log group, you can start by setting up a retention policy for your log groups.

Less expensive than the DataProcessing-Bytes metric, there is another metric called TimedStoraged-BytesHrs metric, which is the amount of time that the data is stored - which AWS also charges you for. The cost for the Ireland region is $0.03 per GB compressed (0.15 compression ratio for each uncompressed byte).

Having a retention policy can help your application optimize some costs and also be more sustainable. Especially when using CDK (AWS Cloud Development Kit) as your IaC tool (Infrastructure as Code), when the default configuration from a log group is to never expire the data stored. If you don't use it after some period, it's best to set up a retention policy.

Below you can find an example, in TypeScript, of how to change the CDK default when creating a Log Group and setting the retention policy you define:

new LogGroup(this, 'MyLogGroup', {
  logGroupName: '/aws/lambda/advanced-logging-control',
  retention: RetentionDays.ONE_WEEK,
})

Instrumenting your code

One common mistake when developing a Lambda function using JavaScript (whether you use or not TypeScript features) is to log everything using the console.log method. In order to fully leverage Advanced Logging Controls and be able to manipulate the log level, it is important to use the correct methods when logging with JavaScript.

Something not very well-known is that, natively, the Lambda function logging configuration has 6 application log levels that you can define: INFO, WARN, ERROR, TRACE, DEBUG, and FATAL.

I have tested all JavaScript console methods when sending to a CloudWatch log group, and I separated a few examples from the console object documentation and how they match the Lambda application log-level configuration. Depending on the log level you set, different console methods will affect your CloudWatch log group. When using them properly, you make your Lambda function ready to switch the log level on the fly:

INFO level shows messages from all JavaScript console methods, except for console.debug.
WARN level shows messages from console.warn, console.assert, and console.error.
ERROR level shows messages from console.error only.
TRACE level shows messages from all JavaScript console methods.
DEBUG level shows messages from all JavaScript console methods.
FATAL level does not show any messages from any JavaScript console method.

Using the proper JavaScript console methods for specific situations not only makes your Lambda function ready to use Advanced Logging Controls but also helps your JavaScript comply with JavaScript's semantics.

Switching log level on the fly

Now, let's get down to business and see how to natively switch the log level of your Lambda function when needed.

Initial setup

I've created a very simple Lambda function with a few examples of using the JavaScript console methods:

export const handler = async (
    event: any,
): Promise<string> => {
    console.log('Received event:', JSON.stringify(event, null, 2));
    console.info('Info: Processing event');
    console.debug('Debug: Event details', event);
    console.warn('Warning: This is a sample warning message');
    console.error('Error: This is a sample error message');
    console.assert(false, 'Assert: This is a sample assert message');

    return 'Hello World!';
};

Using CDK, I set the System Log Level to WARN and the Application Log Level to ERROR:

new NodejsFunction(this, 'AdvancedLoggingControlFunction', {
  functionName: 'advanced-logging-control',
  entry: 'src/hello-world/handler.ts',
  handler: 'handler',
  //set application log level to ERROR and system log level to WARN
  applicationLogLevelV2: ApplicationLogLevel.ERROR,
  systemLogLevelV2: SystemLogLevel.WARN,
  //logging format must be set to JSON
  loggingFormat: LoggingFormat.JSON,
});

IMPORTANT: When changing the System Log Level and Application Log Level, the logging format must be set to JSON (defaults to Text).

This means that I am overriding CDK's default configuration for logging and reducing the messages sent to the CloudWatch log group. According to the explanation in the session above, now only console.error methods will show up in my CloudWatch log group.

If I execute this Lambda function, this is what I have in my CloudWatch Log Group:

Note that there are no System Log Messages (for instance, Lambda report) because I set the System Log Level to WARN, and only the console.error method affects the Log Group because the Application Log Level is ERROR only.

By only logging what you need, you can significantly reduce the costs of the DataProcessing-Bytes metric for a high-traffic Lambda Function and stop polluting your log group.

However, there are going to be situations where you want your Lambda function to be more verbose, to debug something, or to have extra information on your log group. Instead of always logging everything since the start, you might want to have the least detail possible and change it afterward, by updating your Logging Configuration to DEBUG (most detail) when needed:

aws lambda update-function-configuration \
  --function-name advanced-logging-control \
  --logging-config LogFormat=JSON,ApplicationLogLevel=DEBUG

NOTE: By not specifying the System Log Level in the parameters, AWS will automatically set it to INFO.

After the change, executing the Lambda function now gives us more messages in the log group:

As mentioned before, the DEBUG level will show messages from all console JavaScript methods.

Automating the process

Executing a CLI command every time you need to debug something is not very handy. More than that, you might want to execute this change in a more controlled and auditable way, especially when you do not have/must not have elevated privileges to do so.

To improve operational excellence, I have created a simple System Manager Automation Runbook, where this change can be executed in the AWS environment without the need to manually update the Lambda configuration using your IAM permissions. Instead, I have created an IAM role to be assumed by the SSM Automation Runbook. Plus, by doing that, the changes can be recorded by CloudTrail for compliance reasons.

More importantly, you want to analyze the Lambda function for a short period and reset it to the previous configuration to avoid extra costs with the DataProcessing-Bytes metric.

First of all, defining the IAM role to be assumed by the Automation:

const automationIamRole = new Role(this, 'AutomationIamRole', {
  assumedBy: new ServicePrincipal('ssm.amazonaws.com'),
});

automationIamRole.addToPolicy(
  new PolicyStatement({
    actions: [
      'lambda:GetFunctionConfiguration',
      'lambda:UpdateFunctionConfiguration',
    ],
    resources: [`arn:aws:lambda:${Aws.REGION}:${Aws.ACCOUNT_ID}:function:*`],
  })
);

NOTE: Achieving the least privileges can be a challenge because the Automation does not know upfront what Lambda function will be manipulated. Do you have an idea of how can this be solved? Let me know in the comments!

The Automation Runbook definition in CDK looks like this:

new CfnDocument(this, 'ModifyLambdaLogLevelDocument', {
  documentType: "Automation",
  name: 'ModifyLambdaLogLevelDocument',
  documentFormat: "YAML",
  updateMethod: "NewVersion",
  content: {
    schemaVersion: "0.3",
    description: "Modify the log level of a Lambda function temporarily. After 10 minutes, the log level will be reset to the original value.",
    assumeRole: automationIamRole.roleArn,
    parameters: {
      FunctionName: {
        type: "String",
        description: "The name of the Lambda Function",
      },
      LogLevel: {
        type: "String",
        description: "The log level to set",
        allowedValues: [
          "DEBUG",
          "INFO",
          "WARN",
        ],
      },
      Reason: {
        type: "String",
        description: "The reason for the change",
      },
    },
    mainSteps: [
      {
        name: "GetCurrentLoggingConfig",
        action: "aws:executeAwsApi",
        inputs: {
          Service: "Lambda",
          Api: "getFunctionConfiguration",
          FunctionName: "{{FunctionName}}",
        },
        outputs: [
          {
            Name: "CurrentLoggingConfig",
            Selector: "$.LoggingConfig",
            Type: "StringMap",
          },
        ],
      },
      {
        name: "ModifyLogLevel",
        action: "aws:executeAwsApi",
        inputs: {
          Service: "Lambda",
          Api: "updateFunctionConfiguration",
          FunctionName: "{{FunctionName}}",
          Description: "Update log level to {{LogLevel}}",
          LoggingConfig: {
            ApplicationLogLevel: "{{LogLevel}}",
            LogFormat: "JSON",
            SystemLogLevel: "{{LogLevel}}",
          },
        },
      },
      {
        name: "Wait10Minutes",
        action: "aws:sleep",
        inputs: {
          Duration: "PT10M",
        },
      },
      {
        name: "ResetLogLevel",
        action: "aws:executeAwsApi",
        inputs: {
          Service: "Lambda",
          Api: "updateFunctionConfiguration",
          FunctionName: "{{FunctionName}}",
          Description: "Reset log level to original value",
          LoggingConfig: "{{GetCurrentLoggingConfig.CurrentLoggingConfig}}",
        },
      }
    ]
  },
});

This document executes the following steps:

Get the current logging configuration for the provided Lambda function, saving it in a variable. This value will be used in the last step to reset the logging configuration to its original value.
Executes the API call to update the Lambda logging configuration with the provided log level. It accepts DEBUG, INFO, and WARN, which are more important levels to change on the fly.
Sleep for 10 minutes before changing it back. If 10 minutes is not enough time to debug, consider increasing the time or receiving the duration as a parameter in the document.
Reset the log level to its original value.

NOTE: It is also possible to add an approval step to be approved by one of your colleagues, although I have not included that in this example.

The full example, including the Lambda function and the SSM Automation Runbook can be found here.

Monitoring CloudWatch costs

Using the Cost Explorer console can give you an overview of the DataProcessing-Bytes and TimedStoraged-BytesHrs metrics and how their costs increase/decrease over time.

You can select, under the "Usage Type" filter both of the metrics:

Conclusion

Managing CloudWatch costs efficiently is crucial, especially for high-traffic Lambda functions. By implementing Advanced Logging Controls, you can significantly reduce unnecessary log ingestion and storage costs while maintaining compliance.

Key takeaways:

Monitor metrics like DataProcessing-Bytes and TimedStoraged-BytesHrs to track expenses.
Ensure logs are only stored for the necessary period to avoid excessive storage fees.
Leverage Lambda’s built-in log levels to filter out unnecessary logs and avoid polluting CloudWatch.
Utilize AWS Systems Manager Automation Runbooks to temporarily adjust log levels when debugging, without requiring constant manual intervention.
Use Cost Explorer to track trends and make informed decisions on further optimizations.

By adopting these practices, you can strike the right balance between compliance and cost efficiency, ensuring that your CloudWatch bills remain manageable while still providing the insights your applications need.

DEV Community