Pizofreude

Posted on Feb 11

Study Notes: Data Query on S3 Bucket Using Athena

#dataengineering #dezoomcamp #athena #aws

Introduction to Athena

Athena is a serverless interactive query service that allows users to query data stored in S3 using standard SQL. It eliminates the need for setting up a database and is highly scalable. Athena is ideal for analyzing structured, semi-structured, and unstructured data stored in S3 without requiring complex ETL (Extract, Transform, Load) processes.

Key Features of Athena

Serverless (no need for infrastructure setup or management).
Uses standard SQL for querying.
Works directly with data stored in S3.
Scales automatically for queries.
Charges based on the amount of data scanned per query.
Supports a wide range of data formats, including CSV, JSON, Parquet, ORC, and Avro.

Steps to Query S3 Data Using Athena

1. Preparing the Data

Before querying data with Athena, it must be stored in an S3 bucket. For this example, we use a Netflix TV Shows and Movies Dataset downloaded from Kaggle. The dataset is in CSV format and contains details such as show_id, title, director, country, release_year, etc.

2. Uploading Data to S3

Log in to AWS Console and navigate to S3.
Create an S3 Bucket (if not already available).
Inside the bucket, create a new folder (e.g., netflix-data).
Click on Upload, select the CSV file, and confirm the upload.
Verify that the file is successfully uploaded.

3. Setting Up Athena

Go to the AWS Athena Console and click Launch Query Editor.
Before querying, a table must be created in Athena. The best way to create a table is through AWS Glue Crawler, which automatically infers the schema and creates a table.

4. Creating a Table Using AWS Glue Crawler

Navigate to the AWS Glue Console.
Click on Crawlers > Add Crawler.
Name the crawler (e.g., netflix_data_crawler).
Select Data Source as S3 and specify the folder path (e.g., s3://your-bucket/netflix-data/).
Choose IAM Role:
- If you have an existing IAM role with S3 and Glue permissions, select it.
- If not, create a new IAM role ensuring it has permissions for S3 read access and Glue service role policies.
Choose Run on Demand as the frequency.
Create a new database (e.g., netflix_db) to store the table.
Click Finish and Run the Crawler.
Once completed, verify that a table has been created inside the netflix_db database.

5. Configuring Athena for Query Execution

Before running queries, Athena requires an output location in S3 to store query results:

Click on Settings in the Athena Console.
Under Query result location, specify an S3 bucket path (e.g., s3://your-bucket/query-results/).
Save changes.

6. Running SQL Queries in Athena

Now that the table is created, you can start querying data using SQL commands.

Basic Queries

Preview Data:

SELECT * FROM netflix_db.netflix_data LIMIT 10;

Filter by Director:

SELECT * FROM netflix_db.netflix_data WHERE director = 'Shamsul Yusof';

Filter by Country:

SELECT * FROM netflix_db.netflix_data WHERE country = 'Malaysia';

Count Total Shows and Movies:

SELECT type, COUNT(*) AS total_count
FROM netflix_db.netflix_data
GROUP BY type;

Find the Most Recent Releases:

SELECT title, release_year
FROM netflix_db.netflix_data
ORDER BY release_year DESC
LIMIT 5;

7. Optimizing Athena Queries

To reduce query costs and improve performance:

Partitioning: Partition data by frequently filtered columns like release_year to improve query speed.
Use Columnar Formats: Store data in Parquet or ORC instead of CSV/JSON for efficient queries.
Compress Data: Use GZIP or Snappy compression for storage efficiency.
Avoid SELECT **: Instead of selecting all columns, query only required fields to reduce data scanning.
Preprocess Data: Clean and normalize data before querying using AWS Glue ETL.

8. Key Benefits of Using Athena

No Infrastructure Management: Fully managed, no need to provision or maintain databases.
Cost-Effective: Pay only for the data scanned.
Fast Queries: Optimized for quick execution with proper indexing and partitioning.
Integration with AWS Services: Works seamlessly with AWS Glue, S3, QuickSight, and Redshift Spectrum.
Secure & Scalable: Uses IAM roles and policies for secure access control.

Conclusion

Amazon Athena is a powerful tool for analyzing data stored in S3 without requiring database migrations. By leveraging AWS Glue Crawlers, users can automatically infer schema and query large datasets efficiently using standard SQL. For best performance and cost efficiency, consider using data partitioning, columnar storage, and compression techniques.

With this guide, even beginners can start using Athena for data analysis on S3 without complex infrastructure setups.

Next Steps

Explore Amazon QuickSight for visualizing Athena query results.
Learn about AWS Glue ETL for preprocessing large datasets.
Experiment with Athena Federated Queries to query data across multiple sources like RDS and DynamoDB.

DEV Community