DEV Community

Cover image for Building an NBA Data Lake with AWS: Challenges, Use Cases, and Future Enhancements
Goodluck Ekeoma Adiole
Goodluck Ekeoma Adiole

Posted on

Building an NBA Data Lake with AWS: Challenges, Use Cases, and Future Enhancements

In today's data-driven world, sports analytics has taken center stage in improving team performance, fan engagement, and business decisions. As a Cloud Security Engineer passionate about cloud-native solutions, I embarked on a project to build an NBA Data Lake using AWS services. This article outlines how I went about the project, the challenges faced, the use cases, and future enhancement possibilities.

Project Overview

The goal of this project was to create an automated pipeline for storing and querying NBA-related data using AWS services. The setup involved:

  • Amazon S3: Storing raw and processed NBA data.
  • AWS Glue: Creating a database and tables for structured querying.
  • Amazon Athena: Querying the stored data using SQL.
  • SportsData.io API: Fetching real-time NBA data.

Step-by-Step Implementation

1. Setting Up AWS CloudShell

To ensure a smooth workflow, I used AWS CloudShell, which provides a secure and pre-configured command-line environment. After logging into AWS, I accessed CloudShell from the AWS Management Console.

2. Creating the Python Script

I wrote a Python script (setup_nba_data_lake.py) to automate the setup process. The script performs the following:

  • Creates an S3 bucket for storing NBA data.
  • Uploads sample data in JSON format.
  • Configures AWS Glue to define a database and table.
  • Sets up Amazon Athena for SQL queries.

Below is a sample of the script:

import boto3
import json
import time
import requests
from dotenv import load_dotenv
import os

# Load environment variables from .env file
load_dotenv()

# AWS configurations
region = "us-east-1"  # Replace with your preferred AWS region
bucket_name = "goody-sports-analytics-data-lake"  # Change to a unique S3 bucket name
glue_database_name = "glue_nba_data_lake"
athena_output_location = f"s3://{bucket_name}/athena-results/"

# Sportsdata.io configurations (loaded from .env)
api_key = os.getenv("SPORTS_DATA_API_KEY")  # Get API key from .env
nba_endpoint = os.getenv("NBA_ENDPOINT")  # Get NBA endpoint from .env

# Create AWS clients
s3_client = boto3.client("s3", region_name=region)
glue_client = boto3.client("glue", region_name=region)
athena_client = boto3.client("athena", region_name=region)

def create_s3_bucket():
    """Create an S3 bucket for storing sports data."""
    try:
        if region == "us-east-1":
            s3_client.create_bucket(Bucket=bucket_name)
        else:
            s3_client.create_bucket(
                Bucket=bucket_name,
                CreateBucketConfiguration={"LocationConstraint": region},
            )
        print(f"S3 bucket '{bucket_name}' created successfully.")
    except Exception as e:
        print(f"Error creating S3 bucket: {e}")

def create_glue_database():
    """Create a Glue database for the data lake."""
    try:
        glue_client.create_database(
            DatabaseInput={
                "Name": glue_database_name,
                "Description": "Glue database for NBA sports analytics.",
            }
        )
        print(f"Glue database '{glue_database_name}' created successfully.")
    except Exception as e:
        print(f"Error creating Glue database: {e}")

def fetch_nba_data():
    """Fetch NBA player data from sportsdata.io."""
    try:
        headers = {"Ocp-Apim-Subscription-Key": api_key}
        response = requests.get(nba_endpoint, headers=headers)
        response.raise_for_status()  # Raise an error for bad status codes
        print("Fetched NBA data successfully.")
        return response.json()  # Return JSON response
    except Exception as e:
        print(f"Error fetching NBA data: {e}")
        return []

def convert_to_line_delimited_json(data):
    """Convert data to line-delimited JSON format."""
    print("Converting data to line-delimited JSON format...")
    return "\n".join([json.dumps(record) for record in data])

def upload_data_to_s3(data):
    """Upload NBA data to the S3 bucket."""
    try:
        # Convert data to line-delimited JSON
        line_delimited_data = convert_to_line_delimited_json(data)

        # Define S3 object key
        file_key = "raw-data/nba_player_data.jsonl"

        # Upload JSON data to S3
        s3_client.put_object(
            Bucket=bucket_name,
            Key=file_key,
            Body=line_delimited_data
        )
        print(f"Uploaded data to S3: {file_key}")
    except Exception as e:
        print(f"Error uploading data to S3: {e}")

def create_glue_table():
    """Create a Glue table for the data."""
    try:
        glue_client.create_table(
            DatabaseName=glue_database_name,
            TableInput={
                "Name": "nba_players",
                "StorageDescriptor": {
                    "Columns": [
                        {"Name": "PlayerID", "Type": "int"},
                        {"Name": "FirstName", "Type": "string"},
                        {"Name": "LastName", "Type": "string"},
                        {"Name": "Team", "Type": "string"},
                        {"Name": "Position", "Type": "string"},
                        {"Name": "Points", "Type": "int"}
                    ],
                    "Location": f"s3://{bucket_name}/raw-data/",
                    "InputFormat": "org.apache.hadoop.mapred.TextInputFormat",
                    "OutputFormat": "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat",
                    "SerdeInfo": {
                        "SerializationLibrary": "org.openx.data.jsonserde.JsonSerDe"
                    },
                },
                "TableType": "EXTERNAL_TABLE",
            },
        )
        print(f"Glue table 'nba_players' created successfully.")
    except Exception as e:
        print(f"Error creating Glue table: {e}")

def configure_athena():
    """Set up Athena output location."""
    try:
        athena_client.start_query_execution(
            QueryString="CREATE DATABASE IF NOT EXISTS nba_analytics",
            QueryExecutionContext={"Database": glue_database_name},
            ResultConfiguration={"OutputLocation": athena_output_location},
        )
        print("Athena output location configured successfully.")
    except Exception as e:
        print(f"Error configuring Athena: {e}")

# Main workflow
def main():
    print("Setting up data lake for NBA sports analytics...")
    create_s3_bucket()
    time.sleep(5)  # Ensure bucket creation propagates
    create_glue_database()
    nba_data = fetch_nba_data()
    if nba_data:  # Only proceed if data was fetched successfully
        upload_data_to_s3(nba_data)
    create_glue_table()
    configure_athena()
    print("Data lake setup complete.")

if __name__ == "__main__":
    main()

Enter fullscreen mode Exit fullscreen mode

3. Configuring API Access

Since the project relies on real-time NBA data, I registered on SportsData.io and obtained an API key. This key was stored securely in a .env file to enable authentication.

SPORTS_DATA_API_KEY=your_sportsdata_api_key
NBA_ENDPOINT=https://api.sportsdata.io/v3/nba/scores/json/Players
Enter fullscreen mode Exit fullscreen mode

4. Running the Script and Verifying Resources

With everything in place, I executed the script:

python3 setup_nba_data_lake.py
Enter fullscreen mode Exit fullscreen mode

I then manually verified that the S3 bucket, Glue database, and Athena queries worked as expected.

Image description

Image description

Image description

Challenges Faced

  1. IAM Role and Permissions Issues: AWS follows a strict security model, and setting up IAM roles with the right permissions for S3, Glue, and Athena required troubleshooting.
  2. Data Formatting: Ensuring the JSON format from the API matched the schema required by AWS Glue took some effort.
  3. Athena Query Performance: Optimizing queries for better performance involved structuring the data properly in S3.

Use Cases

  • Real-time NBA Analytics: Teams and analysts can extract insights about player performances, game statistics, and team trends.
  • Fantasy Sports Platforms: Sports betting and fantasy leagues can use this data to build prediction models.
  • Fan Engagement Applications: Media platforms can use this structured data to create interactive dashboards for fans.
  • Historical Data Analysis: Researchers can study long-term player trends and team performances.

Future Enhancements

  1. Automating Data Ingestion with AWS Lambda: Instead of manually running the script, I plan to use AWS Lambda to schedule API calls and update data in real-time.
  2. Implementing AWS Glue ETL: Transforming raw JSON data into structured formats such as Parquet will improve query efficiency.
  3. Advanced Analytics with AWS QuickSight: Adding visualization capabilities will enhance insights derived from the data lake.
  4. Machine Learning for Predictive Analysis: Integrating AWS SageMaker to predict player performance based on historical data.

Conclusion

This project provided valuable hands-on experience in integrating AWS services for sports analytics.
It highlighted the importance of security, automation, and scalability in cloud-based workflows.
As sports data continues to evolve, leveraging cloud solutions like AWS ensures efficient storage, querying, and analysis, opening up endless possibilities for innovation in sports analytics.

Top comments (0)