In sports analytics, the ability to process and analyze vast amounts of data in real time has become a game-changer. Having the power to ingest, store, and query large datasets of NBA statistics seamlessly and also enjoying the scalability and cost-efficiency of serverless architecture is awesome.
In this project, we’ll explore how to build a Serverless NBA Data Lake Application using API Gateway, AWS Lambda, Amazon S3, AWS Glue, and Amazon Athena — all orchestrated with Terraform.
System Architecture Overview
The architecture leverages the following components:
• Amazon S3: Serves as the central data lake for storing raw, processed, and curated NBA data in JSON format.
• AWS Lambda: Lambda functions Fetches NBA Data from sportdata.io, formats it and upload to Amazon S3
• Amazon API Gateway: Provides a RESTful API that triggers the Lambda function to fetch NBA data from sportdata.io and upload to an S3 bucket.
• AWS Glue: Automatically discovers and catalogs the data stored in S3 into a schema using the Glue Database Catalog and Glue crawler for efficient querying.
• Amazon Athena: Enables serverless querying of the data lake using standard SQL, allowing users to retrieve insights from the curated NBA data and store result in an Amazon S3 bucket
Prerequisites:
• AWS account with required access and permission to configure services such as Lambda, S3, Glue API Gateway and Athena.
• Experience with programming languages supported by AWS Lambda, such as Python.
• Terraform installed on your local machine
• AWS CLI Installed and configured on your local machine.
Define Your Lambda function
We will develop a Python script for our Lambda function to retrieve NBA data from sportdata.io, process it, and uploads it to Amazon S3. The complete python code is available in the repository.
Terraform Configuration
We will use Terraform modules for this deployment to ensure modularity, reusability, and maintainability in our infrastructure as code. Each folder in the modules directory will define the infrastructure configurations required for deploying specific AWS services. See below
• API Gateway Module: This module deploys an API Gateway that will serve as a trigger to the lambda function to retrieve data from sportdata.io and upload it in Amazon S3.
• iam_role Module: This module contains the terraform codes that defines the necessary permissions for lambda to be able to retrieve and upload NBA data to Amazon S3 and API Gateway to be able to trigger the lambda function.
• Lambda Module: This module defines the terraform codes for archiving the python code in a zip file and also create a lambda function that retrieves NBA data from sportsdata.io, process it and uploads to Amazon S3.
• S3 module: This module defines the terraform codes that creates the Amazon S3 bucket that will be used to store data retrieved form sportdata.io by the lambda function.
• glue module: This module defines the terraform codes that creates the Amazon Glue catalogs database, Glue crawler and Glue table which automatically discovers the data stored in S3 and catalogs it into a schema for efficient querying.
• athena module: This module defines the terraform codes that creates an Athena workgroup that enables serverless querying of the sport data lake stored in S3 using standard SQL.
Check the link below for the full terraform configurations
https://github.com/OjoOluwagbenga700/sport-data-lake.git
Step 1: Clone the Terraform Code
By cloning the Terraform code, we'll have access to the infrastructure-as-code configurations needed for our deployment process.
Clone Repository: Use the git clone
command to clone the Terraform code repository to your local machine.
Ensure that you have Git installed and configured on your system.
https://github.com/OjoOluwagbenga700/sport-data-lake.git
Change directory to the folder name sport-data-lake.
Ensure you update the terraform.tfvars file with your API Key from sportdata.io
Step 2: Running Terraform Commands
Terraform init: Initialize Terraform in the project directory to download necessary plugins and modules.
Terraform Plan: Generate an execution plan to preview the changes that Terraform will make to the infrastructure.
Terraform Apply: Run terraform apply --auto-approve
to deploy the infrastructure on AWS.
Step 3: Confirm resources deployed on AWS
Lambda Function
Glue crawler
Glue catalog database and Table
S3 Bucket without Data upload
Athena Workgroup
API Gateway
Step 4: Testing the Application
To trigger the lambda function to retrieve, process and upload NBA data to S3, we will send a GET request through the API Gateway Invoke URL.
Copy the API Gateway invoke url to your browser, add /dev/data to indicate the API stage and path and click enter.
https://r3zks22udh.execute-api.us-east-1.amazonaws.com/dev/data
NBA Data Uploaded into S3
Preview data table in Athena
Performing Simple SQL query in Athena
Athena Query Result
Query results are stored in a defined folder in the s3 bucket and can be downloaded accordingly. See below
Conclusion: Congratulations!!!, we have successfully built a Serverless NBA Data Lake Application by leveraging AWS services like API Gateway, Lambda, S3, Glue, and Athena. Terraform adds to the elegance by ensuring your infrastructure is provisioned consistently and can be replicated or modified with ease. This architecture not only showcases the potential of serverless computing but also opens up endless possibilities for expanding into other domains, such as real-time analytics, machine learning, or personalized user experiences.
To Clean up: Run terraform destroy to delete all infrastructure deployed by the terraform codes.
Top comments (0)