DEV Community

Ameh Mathias Ejeh
Ameh Mathias Ejeh

Posted on

Building an NBA Sport Data Lake Analytic using AWS Services

Overview

The NBA Sport Data Lake Analytic project is a cloud-native solution that builds a scalable data lake for NBA analytics. By leveraging AWS services, this project automates data ingestion, cataloging, and querying, enabling efficient storage and analysis of NBA-related data.

Architecture

The architecture of the project is designed to process and analyze NBA data efficiently. The main components are:

  • Amazon S3: Stores raw and processed data.
  • AWS Glue: Automates data cataloging and schema creation.
  • Amazon Athena: Enables SQL querying of the data stored in S3.

Architecture Diagram

Image description

Workflow

  • Data Ingestion: Fetch data from SportsData.io's NBA API.
  • Data Storage: Store the raw data in Amazon S3.
  • Data Cataloging: Use AWS Glue to create a database and table schema.
  • Data Querying: Query the data using Amazon Athena for analytics.

Prerequisites

Required Accounts and Tools

  • SportsData.io API Key: Sign up at SportsData.io to get access to the NBA API.
  • AWS Account: An active AWS account with permissions to use S3, Glue, and Athena.
  • Python Environment: Python 2.31.0 installed locally. A virtual environment for dependency management.

Permissions

Ensure the IAM user or role has the following AWS permissions:

  • S3: s3:CreateBucket, s3:PutObject, s3:DeleteBucket, s3:ListBucket
  • Glue: glue:CreateDatabase, glue:CreateTable, glue:DeleteDatabase, glue:DeleteTable
  • Athena: athena:StartQueryExecution, athena:GetQueryResults

Setup Guide

Step 1: Clone the Repository

git clone https://github.com/ameh0429/ameh0429-NBA-Sport-Data-Lake-Analytic.git
cd ameh0429-NBA-Sport-Data-Lake-Analytic
Enter fullscreen mode Exit fullscreen mode

Step 2: Install Dependencies

  • Create and activate a virtual environment:
pip install -r requirements.txt
Enter fullscreen mode Exit fullscreen mode

Step 3: Configure Environment Variables

  • Create a .env file with your API key and endpoint:
echo "SPORTS_DATA_API_KEY=your_api_key" >> .env
echo "NBA_ENDPOINT=https://api.sportsdata.io/v3/nba/scores/json/Players" >> .env
Enter fullscreen mode Exit fullscreen mode

Step 4: Run the Data Lake Setup Script

  • In the CLI terminal, paste the setup_nba_data_lake.py script

Image description

  • Run the script
python setup_nba_data_lake.py
Enter fullscreen mode Exit fullscreen mode

The script performs the following actions:

  • Creates an S3 bucket named sports-analytics-data-lake-0429.
  • Uploads NBA player data to the raw-data folder.
  • Configures a Glue database and table.
  • Sets up Athena for querying

Image description

Step 5: Validate Setup

  • S3: Verify the bucket and data file in the AWS Management Console.

Image description

Image description

  • Athena: Run a test query:

Query 1

SELECT FirstName, LastName, Position, Team
FROM nba_players
WHERE Position = 'PG';
Enter fullscreen mode Exit fullscreen mode

The output

Image description
Query 2

SELECT PlayerID, FirstName, LastName, Team, Position
FROM nba_players
WHERE Team = 'LAL';
Enter fullscreen mode Exit fullscreen mode

The output

Image description

Cleanup

To delete all the resources created by the project, run the cleanup script:

python delete_resources.py
Enter fullscreen mode Exit fullscreen mode

This will:

  • Remove the S3 bucket and its contents.
  • Delete the Glue database and table.
  • Clean up Athena configurations.

Top comments (0)