MD. Zeaul Hoque Shuvo

Posted on Nov 26, 2024 • Edited on Nov 30, 2024 • Originally published at dev.to

Unlock CloudFront's New Logging Potential with Athena Partition Projection

#aws #cloudfront #athena #s3

When I first set out to use Athena for querying CloudFront logs, I thought it would be a breeze. Just set up the Glue database, create a table, and start querying, right? Wrong!! The problem hit me when I realized the log files had a flat structure—no partitions, no hierarchy. Every query ended up scanning massive amounts of data, even when I just needed a small slice of information from logs.

To make things worse, manually adding partitions for each batch of logs felt like an endless chore. But then came AWS's announcement: Apache Parquet support for CloudFront logs, along with Hive-compatible folder structures in S3 bucket. The new logging capabilities provide native log configurations, eliminating the need for custom log processing.That’s when it clicked—if I combine this with Athena Partition Projection, it would be a total breakthrough.

How CloudFront Logs Work: From Then to Now

Previously, CloudFront logs were delivered in plain text (CSV) format. While this format was simple, it wasn’t optimized for querying large datasets. Besides that, logs were delivered to S3 in a flat structure, requiring manual processing to re-organize them into a more structured format, making it easier to organize the data into partitions.

Flat structure of file name

s3://cloudfront--logs/E123ABC456DEF-2024-11-25-12-00-00-abcdef0123456789.gz

You can find the distribution-id and date-time in the file name. Extracting those information and re-structuring log files for partitioning accordingly, it requires to setup an automation process behind the scnene.

Now, CloudFront can deliver logs in Apache Parquet format.

Parquet is a columnar storage format that improves query performance and reduces storage space significantly.

CloudFront also supports Hive-style partitioning when delivering logs to S3.

Hive-style refers to a folder structure where your data will be organized into directories named after partition keys and their values, like key=value/.

This means your logs are stored in a folder structure like this:

s3://cloudfront--logs/year=2024/month=11/day=25/hour=15/

Even better, you can customize the partitioning field to match your needs. For example, partition by year, month, day, or even by DistributionId:

Example : `{DistributionId}/{yyyy}/{MM}/{dd}/{HH}/`

This flexibility makes querying faster and perfectly tailored to your use case.

Why does "Partition Projection" matter?

Partition Projection is a feature in Athena that automatically understands how your data is organized in S3.

Without partition projection, Athena requires partitions to be explicitly loaded using the MSCK REPAIR TABLE command which can require few minutes if your bucket contains huge data. Instead of manually loading partitions before each query, you simply define your data structure during table creation, and Athena handles the rest automatically.

Beside that, Developers who query Athena expect to see the latest logs, forget to load the partitions sometimes causing logs to appear as "missing." This creates chaos and confusion inside your team. As a DevOps Engineer, you need to provide a platform that is more automated and less hassle for developers, reducing manual steps and ensuring they can access the latest data without worrying about partition management. That’s where Partition Projection becomes your new friend.

How does "Partition Projection" work?

When you create a table in Athena,you simply describe how your data is structured in your log bucket. For example, if your logs are stored by year, you let Athena know upfront like below :

CREATE EXTERNAL TABLE cloudfront_logs (
  <col_name> <col_type>, ....
)
PARTITIONED BY (year STRING)
STORED AS PARQUET
LOCATION 's3://cloudfront--logs/'
TBLPROPERTIES (
  'projection.enabled' = 'true',
  'projection.year.type' = 'string',
  'projection.year.range' = '2020,2030',
  'storage.location.template' = 's3://bucket_name/year=${year}/'
);

Let’s say you’ve got CloudFront logs stored in your bucket like this:

s3://cloudfront--logs/year=2024/
s3://cloudfront--logs/year=2025/

Then, whenever you query the logs for the year of 2024, Athena will only look for the data inside the year=2024 folder and return the latest logs in the query result.

Lets talk about the setup:

To set up Hive-style partitions in CloudFront:

When you enable the logging, choose Amazon S3 as the destination. Then, you need to enable Hive-compatible prefixes for your log files and choose Parquet as output format.Here's an example of suffix path for partitioning cloudfront logs.

{DistributionId}/{yyyy}/{MM}/{dd}/{HH}/

To effortlessly set up an Athena database and table with Partition Projection enabled using IAC, check out this GitHub repo

Wrapping It Up

CloudFront logs just got a lot easier to work with. Whether you're using the new Apache Parquet format with Hive-compatible folders or combining it with Athena Partition Projection, you can now query your logs faster, cheaper, and with way less hassle. It’s been a game-changer for me, and I hope it will be for you too. However, it has few limitations, make sure you are aware of that and it doesn't clash with your use case. You'll find those in Athena's official aws documentation.

But that’s not all. You can also deliver CloudFront logs to CloudWatch in JSON or text format, or even use Kinesis Data Firehose to process logs on the fly. AWS has made it super flexible to work with CloudFront logs, so you can choose the setup that works best for you.

Happy logging! 😊

DEV Community

Unlock CloudFront's New Logging Potential with Athena Partition Projection

How CloudFront Logs Work: From Then to Now

Why does "Partition Projection" matter?

Lets talk about the setup:

Wrapping It Up

Top comments (0)

Read next

Amazon Q Developer Tips: No.17 Choose the right tool

AWS Landing Zone - Overview

Monitoring AWS Lambda Functions with AWS X-Ray and CloudWatch: Advanced Technique

Multi-Region Distributed SQL Transaction Latency