DEV Community

Osagie Anolu
Osagie Anolu

Posted on

Building an NBA Stats Pipeline with AWS, Python, and DynamoDB

Image description

Ever wanted to build a data pipeline that automatically fetches and stores NBA statistics? In this tutorial, I'll walk you through how I created a robust pipeline using AWS services, Python, and DynamoDB. Whether you're a sports enthusiast or just looking to learn more about AWS integration, this project offers hands-on experience with real-world data processing.

Project Overview

This pipeline automatically fetches NBA statistics from the SportsData API, processes the data, and stores it in DynamoDB. We'll be using several AWS services:

  • DynamoDB for data storage
  • Lambda for serverless execution
  • CloudWatch for monitoring and logging

Prerequisites

Before we dive in, make sure you have:

  • Basic Python knowledge
  • An AWS account
  • The AWS CLI installed and configured
  • A SportsData API key

Setting Up the Project

First, clone the repository and install the required dependencies:

git clone https://github.com/nolunchbreaks/nba-stats-pipeline.git
cd nba-stats-pipeline
pip install -r requirements.txt
Enter fullscreen mode Exit fullscreen mode

Environment Configuration

Create a .env file in your project root with the following variables:

SPORTDATA_API_KEY=your_api_key_here
AWS_REGION=us-east-1
DYNAMODB_TABLE_NAME=nba-player-stats
Enter fullscreen mode Exit fullscreen mode

Project Structure

The project is organized as follows:

nba-stats-pipeline/
├── src/
│   ├── __init__.py           # Package initialization
│   ├── nba_stats.py          # Main pipeline script
│   └── lambda_function.py    # AWS Lambda handler
├── tests/                    # Test cases
├── requirements.txt          # Dependencies
├── README.md                # Documentation
└── .env                     # Environment variables
Enter fullscreen mode Exit fullscreen mode

Data Structure and Storage

DynamoDB Schema

The pipeline stores NBA team statistics in DynamoDB with the following structure:

  • Partition Key: TeamID
  • Sort Key: Timestamp
  • Attributes: Team statistics including:
    • Win/Loss records
    • Points per game
    • Conference standings
    • Division rankings
    • Historical performance metrics

AWS Infrastructure Setup

Image description

DynamoDB Table

The table is designed for efficient querying of team statistics over time. Here's what you need to configure:

Image description

  • Table Name: nba-player-stats
  • Primary Key: TeamID (String)
  • Sort Key: Timestamp (Number)
  • Provisioned Capacity: Adjust based on your needs

Lambda Function Configuration

If you're using Lambda to trigger the pipeline:

  • Runtime: Python 3.9
  • Memory: 256MB
  • Timeout: 30 seconds
  • Handler: lambda_function.lambda_handler

Error Handling and Monitoring

The pipeline includes comprehensive error handling for:

  • API failures
  • DynamoDB throttling
  • Data transformation issues
  • Invalid API responses

All events are logged to CloudWatch in structured JSON format, making it easy to:

  • Monitor pipeline performance
  • Track and debug issues
  • Ensure successful data processing

Cleanup Process

When you're done experimenting, clean up your AWS resources:

# Delete DynamoDB table
aws dynamodb delete-table --table-name nba-player-stats

# Remove Lambda function
aws lambda delete-function --function-name nba-stats-function

# Clean up CloudWatch logs
aws logs delete-log-group --log-group-name /aws/lambda/nba-stats-function
Enter fullscreen mode Exit fullscreen mode

Key Learnings

Building this pipeline taught me several valuable lessons:

  1. AWS Service Integration: Understanding how different AWS services work together to create a robust data pipeline.
  2. Error Handling: The importance of comprehensive error handling in production systems.
  3. Monitoring: Setting up proper logging and monitoring is crucial for maintaining data pipelines.
  4. Cost Management: Being mindful of AWS resource usage and cleaning up unused resources.

Next Steps

Want to extend this project? Here are some ideas:

  • Add real-time game statistics
  • Implement data visualization
  • Create API endpoints for accessing the stored data
  • Add more sophisticated data analysis

Conclusion

This NBA stats pipeline project demonstrates how to combine AWS services with Python to create a functional data pipeline. It's a great starting point for anyone interested in sports analytics or learning about AWS data processing.

Have you built something similar? I'd love to hear about your experiences and any suggestions for improving this pipeline!


Follow me for more AWS and Python tutorials! If you found this helpful, don't forget to leave a ❤️ and a 🦄!

Top comments (0)