Overview
The NBA Sport Data Lake Analytic project is a cloud-native solution that builds a scalable data lake for NBA analytics. By leveraging AWS services, this project automates data ingestion, cataloging, and querying, enabling efficient storage and analysis of NBA-related data.
Architecture
The architecture of the project is designed to process and analyze NBA data efficiently. The main components are:
- Amazon S3: Stores raw and processed data.
- AWS Glue: Automates data cataloging and schema creation.
- Amazon Athena: Enables SQL querying of the data stored in S3.
Architecture Diagram
Workflow
- Data Ingestion: Fetch data from SportsData.io's NBA API.
- Data Storage: Store the raw data in Amazon S3.
- Data Cataloging: Use AWS Glue to create a database and table schema.
- Data Querying: Query the data using Amazon Athena for analytics.
Prerequisites
Required Accounts and Tools
- SportsData.io API Key: Sign up at SportsData.io to get access to the NBA API.
- AWS Account: An active AWS account with permissions to use S3, Glue, and Athena.
- Python Environment: Python 2.31.0 installed locally. A virtual environment for dependency management.
Permissions
Ensure the IAM user or role has the following AWS permissions:
- S3:
s3:CreateBucket
,s3:PutObject
,s3:DeleteBucket
,s3:ListBucket
- Glue:
glue:CreateDatabase
,glue:CreateTable
,glue:DeleteDatabase
,glue:DeleteTable
- Athena:
athena:StartQueryExecution
,athena:GetQueryResults
Setup Guide
Step 1: Clone the Repository
git clone https://github.com/ameh0429/ameh0429-NBA-Sport-Data-Lake-Analytic.git
cd ameh0429-NBA-Sport-Data-Lake-Analytic
Step 2: Install Dependencies
- Create and activate a virtual environment:
pip install -r requirements.txt
Step 3: Configure Environment Variables
- Create a
.env
file with your API key and endpoint:
echo "SPORTS_DATA_API_KEY=your_api_key" >> .env
echo "NBA_ENDPOINT=https://api.sportsdata.io/v3/nba/scores/json/Players" >> .env
Step 4: Run the Data Lake Setup Script
- In the CLI terminal, paste the
setup_nba_data_lake.py
script
- Run the script
python setup_nba_data_lake.py
The script performs the following actions:
- Creates an S3 bucket named sports-analytics-data-lake-0429.
- Uploads NBA player data to the raw-data folder.
- Configures a Glue database and table.
- Sets up Athena for querying
Step 5: Validate Setup
- S3: Verify the bucket and data file in the AWS Management Console.
- Athena: Run a test query:
Query 1
SELECT FirstName, LastName, Position, Team
FROM nba_players
WHERE Position = 'PG';
The output
SELECT PlayerID, FirstName, LastName, Team, Position
FROM nba_players
WHERE Team = 'LAL';
The output
Cleanup
To delete all the resources created by the project, run the cleanup script:
python delete_resources.py
This will:
- Remove the S3 bucket and its contents.
- Delete the Glue database and table.
- Clean up Athena configurations.
Top comments (0)