Osagie Anolu

Posted on Jan 21

Building an NBA Stats Pipeline with AWS, Python, and DynamoDB

#dynamodb #aws #python #devops

Ever wanted to build a data pipeline that automatically fetches and stores NBA statistics? In this tutorial, I'll walk you through how I created a robust pipeline using AWS services, Python, and DynamoDB. Whether you're a sports enthusiast or just looking to learn more about AWS integration, this project offers hands-on experience with real-world data processing.

Project Overview

This pipeline automatically fetches NBA statistics from the SportsData API, processes the data, and stores it in DynamoDB. We'll be using several AWS services:

DynamoDB for data storage
Lambda for serverless execution
CloudWatch for monitoring and logging

Prerequisites

Before we dive in, make sure you have:

Basic Python knowledge
An AWS account
The AWS CLI installed and configured
A SportsData API key

Setting Up the Project

First, clone the repository and install the required dependencies:

git clone https://github.com/nolunchbreaks/nba-stats-pipeline.git
cd nba-stats-pipeline
pip install -r requirements.txt

Environment Configuration

Create a .env file in your project root with the following variables:

SPORTDATA_API_KEY=your_api_key_here
AWS_REGION=us-east-1
DYNAMODB_TABLE_NAME=nba-player-stats

Project Structure

The project is organized as follows:

nba-stats-pipeline/
├── src/
│   ├── __init__.py           # Package initialization
│   ├── nba_stats.py          # Main pipeline script
│   └── lambda_function.py    # AWS Lambda handler
├── tests/                    # Test cases
├── requirements.txt          # Dependencies
├── README.md                # Documentation
└── .env                     # Environment variables

Data Structure and Storage

DynamoDB Schema

The pipeline stores NBA team statistics in DynamoDB with the following structure:

Partition Key: TeamID
Sort Key: Timestamp
Attributes: Team statistics including:
- Win/Loss records
- Points per game
- Conference standings
- Division rankings
- Historical performance metrics

AWS Infrastructure Setup

DynamoDB Table

The table is designed for efficient querying of team statistics over time. Here's what you need to configure:

Table Name: nba-player-stats
Primary Key: TeamID (String)
Sort Key: Timestamp (Number)
Provisioned Capacity: Adjust based on your needs

Lambda Function Configuration

If you're using Lambda to trigger the pipeline:

Runtime: Python 3.9
Memory: 256MB
Timeout: 30 seconds
Handler: lambda_function.lambda_handler

Error Handling and Monitoring

The pipeline includes comprehensive error handling for:

API failures
DynamoDB throttling
Data transformation issues
Invalid API responses

All events are logged to CloudWatch in structured JSON format, making it easy to:

Monitor pipeline performance
Track and debug issues
Ensure successful data processing

Cleanup Process

When you're done experimenting, clean up your AWS resources:

# Delete DynamoDB table
aws dynamodb delete-table --table-name nba-player-stats

# Remove Lambda function
aws lambda delete-function --function-name nba-stats-function

# Clean up CloudWatch logs
aws logs delete-log-group --log-group-name /aws/lambda/nba-stats-function

Key Learnings

Building this pipeline taught me several valuable lessons:

AWS Service Integration: Understanding how different AWS services work together to create a robust data pipeline.
Error Handling: The importance of comprehensive error handling in production systems.
Monitoring: Setting up proper logging and monitoring is crucial for maintaining data pipelines.
Cost Management: Being mindful of AWS resource usage and cleaning up unused resources.

Next Steps

Want to extend this project? Here are some ideas:

Add real-time game statistics
Implement data visualization
Create API endpoints for accessing the stored data
Add more sophisticated data analysis

Conclusion

This NBA stats pipeline project demonstrates how to combine AWS services with Python to create a functional data pipeline. It's a great starting point for anyone interested in sports analytics or learning about AWS data processing.

Have you built something similar? I'd love to hear about your experiences and any suggestions for improving this pipeline!

Follow me for more AWS and Python tutorials! If you found this helpful, don't forget to leave a ❤️ and a 🦄!

DEV Community