DEV Community

Habil BOZALİ
Habil BOZALİ

Posted on • Originally published at habil.Medium on

Analyzing CloudFront Logs with Amazon Athena


Photo by Adrien on Unsplash

You are actively using the CloudFront service and noticed an increase in 4xx records in CloudFront statistics. This is usually due to incorrect configurations. However, for more details, you can open CloudFront logs and examine incoming requests. In this article, we will focus on how to perform log review most practically. Let’s start.

Why Athena for CloudFront Logs?

CloudFront generates detailed access logs that contain valuable information about requests made to your distribution. However, these logs are:

  • Delivered as compressed files to your S3 bucket
  • Generated in large volumes
  • Written in a specific format that’s not immediately queryable

Amazon Athena provides a serverless solution to analyze these logs using standard SQL, without the need to set up complex data processing pipelines.

Setting Up the Environment

Let’s break down the process into clear steps:

Step 1: Enable CloudFront Logging

First, ensure that CloudFront logging is enabled for your distribution:

  1. Navigate to the CloudFront console
  2. Select your distribution
  3. Click on the “Behaviors” tab
  4. Edit the behaviour settings and enable logging
  5. Specify an S3 bucket for your logs
{  
 “Logging”: {  
 “Enabled”: true,  
 “IncludeCookies”: false,  
 “Bucket”: “your-logs-bucket.s3.amazonaws.com”,  
 “Prefix”: “cloudfront-logs/”  
 }  
}  
Enter fullscreen mode Exit fullscreen mode

Step 2: Create Athena Database and Table

Once logs are being delivered to your S3 bucket, you need to create a database and table in Athena:

  1. Navigate to the Athena console
  2. Create a new database:
CREATE DATABASE cloudfront_logs;  
Enter fullscreen mode Exit fullscreen mode
  1. Create a table that maps to the CloudFront log format:
CREATE EXTERNAL TABLE IF NOT EXISTS cloudfront_logs.cf_access_logs (  
 `date` DATE,  
 time STRING,  
 location STRING,  
 bytes BIGINT,  
 request_ip STRING,  
 method STRING,  
 host STRING,  
 uri STRING,  
 status INT,  
 referrer STRING,  
 user_agent STRING,  
 query_string STRING,  
 cookie STRING,  
 result_type STRING,  
 request_id STRING,  
 host_header STRING,  
 request_protocol STRING,  
 request_bytes BIGINT,  
 time_taken FLOAT,  
 xforwarded_for STRING,  
 ssl_protocol STRING,  
 ssl_cipher STRING,  
 response_result_type STRING,  
 http_version STRING,  
 fle_status STRING,  
 fle_encrypted_fields INT,  
 c_port INT,  
 time_to_first_byte FLOAT,  
 x_edge_detailed_result_type STRING,  
 sc_content_type STRING,  
 sc_content_len BIGINT,  
 sc_range_start BIGINT,  
 sc_range_end BIGINT  
)  
ROW FORMAT DELIMITED   
FIELDS TERMINATED BY ‘\t  
LOCATION s3://your-logs-bucket/cloudfront-logs/  
TBLPROPERTIES (skip.header.line.count=2');  
Enter fullscreen mode Exit fullscreen mode

Note: The table structure follows CloudFront’s log format. Adjust the LOCATION to match your S3 bucket path.

Analyzing 4xx Errors

Now that your environment is set up, let’s write some queries to analyze those 4xx errors:

Query 1: Count of 4xx Errors by Status Code

SELECT   
 status,  
 COUNT(*) as error_count  
FROM   
 cloudfront_logs.cf_access_logs  
WHERE   
 status BETWEEN 400 AND 499  
 AND date BETWEEN DATE 20230101 AND DATE 20230131  
GROUP BY   
 status  
ORDER BY   
 error_count DESC;  
Enter fullscreen mode Exit fullscreen mode

This query will show you the distribution of different 4xx error codes.

Query 2: Top URIs Generating 4xx Errors

SELECT   
 uri,  
 status,  
 COUNT(*) as error_count  
FROM   
 cloudfront_logs.cf_access_logs  
WHERE   
 status BETWEEN 400 AND 499  
 AND date BETWEEN DATE 20230101 AND DATE 20230131  
GROUP BY   
 uri, status  
ORDER BY   
 error_count DESC  
LIMIT 20;  
Enter fullscreen mode Exit fullscreen mode

This helps identify problematic endpoints or resources.

Query 3: Error Distribution by Time

SELECT   
 date,  
 HOUR(from_iso8601_timestamp(concat(date, T, time, Z))) as hour,  
 COUNT(*) as error_count  
FROM   
 cloudfront_logs.cf_access_logs  
WHERE   
 status BETWEEN 400 AND 499  
 AND date BETWEEN DATE 20230101 AND DATE 20230131  
GROUP BY   
 date, HOUR(from_iso8601_timestamp(concat(date, T, time, Z)))  
ORDER BY   
 date, hour;  
Enter fullscreen mode Exit fullscreen mode

This query helps identify patterns or spikes in errors by time.

Query 4: 4xx Errors by User Agent

SELECT   
 REGEXP_EXTRACT(user_agent, ([^/]+)) as browser,  
 COUNT(*) as error_count  
FROM   
 cloudfront_logs.cf_access_logs  
WHERE   
 status BETWEEN 400 AND 499  
 AND date BETWEEN DATE 20230101 AND DATE 20230131  
GROUP BY   
 REGEXP_EXTRACT(user_agent, ([^/]+))  
ORDER BY   
 error_count DESC  
LIMIT 10;  
Enter fullscreen mode Exit fullscreen mode

This can help identify if errors are related to specific browsers or bots.

Query 5: Geographic Distribution of Errors

If you have enabled location fields in your logs:

SELECT   
 location,  
 COUNT(*) as error_count  
FROM   
 cloudfront_logs.cf_access_logs  
WHERE   
 status BETWEEN 400 AND 499  
 AND date BETWEEN DATE 20230101 AND DATE 20230131  
GROUP BY   
 location  
ORDER BY   
 error_count DESC  
LIMIT 10;  
Enter fullscreen mode Exit fullscreen mode

Advanced Analysis: Finding Root Causes

Let’s dig deeper to find patterns that might explain the increase in 4xx errors:

Query 6: Analyze Referrer Patterns

SELECT   
 referrer,  
 COUNT(*) as error_count  
FROM   
 cloudfront_logs.cf_access_logs  
WHERE   
 status BETWEEN 400 AND 499  
 AND date BETWEEN DATE 20230101 AND DATE 20230131  
 AND referrer != -  
GROUP BY   
 referrer  
ORDER BY   
 error_count DESC  
LIMIT 20;  
Enter fullscreen mode Exit fullscreen mode

This can identify broken links from other websites.

Query 7: Identify Common Query Parameters in Failed Requests

SELECT   
 query_string,  
 COUNT(*) as error_count  
FROM   
 cloudfront_logs.cf_access_logs  
WHERE   
 status BETWEEN 400 AND 499  
 AND date BETWEEN DATE 20230101 AND DATE 20230131  
 AND query_string != -  
GROUP BY   
 query_string  
ORDER BY   
 error_count DESC  
LIMIT 20;  
Enter fullscreen mode Exit fullscreen mode

This may reveal issues with specific parameters being passed to your application.

Creating Dashboards with QuickSight

For ongoing monitoring, you can connect Amazon QuickSight to your Athena queries:

  1. Navigate to the QuickSight console
  2. Create a new analysis
  3. Select Athena as the data source
  4. Choose your cloudfront_logs database and cf_access_logs table
  5. Build visualizations based on the queries we’ve explored

Some useful visualizations include:

  • Line chart of 4xx errors over time
  • Bar chart of top error-generating URIs
  • Pie chart of error status code distribution
  • Heat map of errors by hour and day

Optimizing Athena Queries

CloudFront logs can grow large, making Athena queries expensive. Here are some optimization tips:

Partitioning Your Table

For more efficient queries, consider partitioning your table by date:

CREATE EXTERNAL TABLE IF NOT EXISTS cloudfront_logs.cf_access_logs_partitioned (  
 `time` STRING,  
 location STRING,  
 bytes BIGINT,  
 — other fields as in the previous definition  
)  
PARTITIONED BY (`date` DATE)  
ROW FORMAT DELIMITED   
FIELDS TERMINATED BY ‘\t  
LOCATION s3://your-logs-bucket/cloudfront-logs/  
TBLPROPERTIES (skip.header.line.count=2');  
Enter fullscreen mode Exit fullscreen mode

Then load partitions:

MSCK REPAIR TABLE cloudfront_logs.cf_access_logs_partitioned;  
Enter fullscreen mode Exit fullscreen mode

Or add partitions manually:

ALTER TABLE cloudfront_logs.cf_access_logs_partitioned ADD  
PARTITION (`date`=20230101') LOCATION ‘s3://your-logs-bucket/cloudfront-logs/2023–01–01/’  
PARTITION (`date`=’2023–01–02') LOCATION s3://your-logs-bucket/cloudfront-logs/20230102/;  
Enter fullscreen mode Exit fullscreen mode

Convert to Columnar Format

Convert your data to a columnar format like Parquet for better performance:

CREATE TABLE cloudfront_logs.cf_access_logs_parquet  
WITH (  
 format = PARQUET,  
 parquet_compression = SNAPPY,  
 external_location = s3://your-logs-bucket/cloudfront-logs-parquet/  
) AS  
SELECT * FROM cloudfront_logs.cf_access_logs  
WHERE date BETWEEN DATE 20230101 AND DATE 20230131;  
Enter fullscreen mode Exit fullscreen mode

Common Troubleshooting Patterns

Based on the analysis of 4xx errors, here are common issues to check:

  1. 404 errors: Check for recently removed resources or broken links in your application
  2. 403 errors: Review CloudFront distribution settings, especially origin access identity configuration
  3. 400 errors: Look for malformed requests, possibly from outdated clients or bots
  4. 401 errors: Check authentication mechanisms and token expiration settings

Conclusion

By using Amazon Athena to analyze CloudFront logs, you can quickly identify the root causes of 4xx errors and take corrective actions. This serverless approach eliminates the need for complex ETL processes while providing powerful SQL-based analysis capabilities.

Remember to optimize your queries and table structure as your log volume grows to keep costs manageable and queries performant.

I hope this article helps you diagnose and resolve your CloudFront 4xx errors efficiently. Happy troubleshooting! 👻

Top comments (0)