La Tasha "L." Pollard

Posted on Feb 8

An NBA Player Data Lake automation script, leveraging AWS S3, Glue, and Athena.

#cloud #devops #aws #automation

Introduction

This is the Week 1 Day 3 Project of the 30-Day DevOps challenge I am participating in. Learn more about the challenge and it's creators.

Project Overview

In today's project, we are implementing a data lake to ingest, store, and manage the high volume, complex data that we will retrieve from our API call to the NBA API.

What's a Data Lake?

A data lake is kind of like a database in the sense that we store large amount of data in both. The difference is that, whereas a database is structured (tables, schemas, relationships), a data lake can handle large volumes of all of kinds data regardless of format(structured, semi-structured, unstructured), and your data doesn't need to organized first.

We can think of a database like a organized filling cabinet: when we add to it, we add the file to the correct location based on the schema of the cabinet.

Whereas a data lake would be more like a storage warehouse where we can add any kind of data. We don't have to organize it and put it in the correct location, just dump it on in there.

But what good for us is a whole bunch of unprocessed, unorganized data? So in our project we are gathering all this data, storing it in a data lake in an S3 bucket, and then using other AWS services (e.g. Glue and Athena) to make the data queryable.

Key Features

Fetches NBA player data using SportsDataIO's NBA API.
Stores raw and processed data in AWS S3 buckets.
Uses AWS Glue Crawlers to infer schema from raw data in S3.
Stores metadata in AWS Glue Data Catalog for structured querying.
Allows SQL-based querying of structured data stored in S3.
Implements IAM roles and policies to control access to S3, Glue, and Athena.

Technical Architecture

Technologies

Cloud Provider: AWS
Core Services: S3, Glue, Athena
External API: NBA Game API (SportsData.io)
Programming Language: Python 3.x
IAM Security: Least privilege policies

NBA API

We are making a request to the NBA API, this time for data about NBA players. This is how I coded the request:

nba_endpoint = f"https://api.sportsdata.io/v3/nba/scores/json/Players?key={api_key}"
try:
    response = requests.get(nba_endpoint)
    response.raise_for_status()  
    print("Fetched NBA data successfully.")
    return response.json()  
except Exception as e:
    print(f"Error fetching NBA data: {e}")
    return []

The response looks something like this:

[
  {
    "PlayerID": 20000441,
    "SportsDataID": "",
    "Status": "Active",
    "TeamID": 29,
    "Team": "PHO",
    "Jersey": 3,
    "PositionCategory": "G",
    "Position": "SG",
    "FirstName": "Bradley",
    "LastName": "Beal",
    "Height": 76,
    "Weight": 207,
    "BirthDate": "1993-06-28T00:00:00",
    "BirthCity": "St. Louis",
    "BirthState": "MO",
    "BirthCountry": "USA",
    "HighSchool": null,
    "College": "Florida",
    "Salary": 50203930,
    "PhotoUrl": "https://s3-us-west-2.amazonaws.com/static.fantasydata.com/headshots/nba/low-res/0.png",
    "Experience": 12,
    "SportRadarPlayerID": "ff461754-ad20-4eeb-af02-2b46cc980b24",
    "RotoworldPlayerID": 1966,
    "RotoWirePlayerID": 3303,
    "FantasyAlarmPlayerID": 200464,
    "StatsPlayerID": 606912,
    "SportsDirectPlayerID": 750970,
    "XmlTeamPlayerID": 3395,
    "InjuryStatus": "Scrambled",
    "InjuryBodyPart": "Scrambled",
    "InjuryStartDate": "2025-02-06T00:00:00",
    "InjuryNotes": "Scrambled",
    "FanDuelPlayerID": 15595,
    "DraftKingsPlayerID": 606912,
    "YahooPlayerID": 5009,
    "FanDuelName": "Bradley Beal",
    "DraftKingsName": "Bradley Beal",
    "YahooName": "Bradley Beal",
    "DepthChartPosition": "SG",
    "DepthChartOrder": 6,
    "GlobalTeamID": 20000029,
    "FantasyDraftName": "Bradley Beal",
    "FantasyDraftPlayerID": 606912,
    "UsaTodayPlayerID": 8315651,
    "UsaTodayHeadshotUrl": "http://cdn.usatsimg.com/api/download/?imageID=24445236",
    "UsaTodayHeadshotNoBackgroundUrl": "http://cdn.usatsimg.com/api/download/?imageID=24445234",
    "UsaTodayHeadshotUpdated": "2024-10-09T14:17:10",
    "UsaTodayHeadshotNoBackgroundUpdated": "2024-10-09T14:17:04",
    "NbaDotComPlayerID": 203078
  },
...
]

That's like 50 lines of code for one player! ..Googles the number of current NBA players, multiples times 50.. Yea, this is good introductory example of a large data set.

S3

AWS S3 - Simple Storage Solution - will be our storage layer. The raw data retrieved from the API call will be stored in our S3 bucket. The processed data (after being processed by Glue) is also stored in an S3 bucket.

Glue

AWS Glue is a serverless data integration service that includes tools for ETL (Extract, Transform, Load) jobs and a Data Catalog for metadata management. We will use Glue Crawler to process the raw data we are storing in our S3 bucket and Glue Data Catalog to store and manage metadata about the structured data. This will make the data accessible for querying using AWS Athena.

Athena

AWS Athena is a serverless querying service used to analyze data. We will use it to run SQL queries on the data in our S3 bucket after its been processed by AWS Glue.

A helpful analogy:

S3 = The Library
- Where all the raw and processed data is stored.
Glue Data Catalog = The Library's Catalog
- A system that organizes and records information about the books/data stored in S3.
Athena = A Reader
- Who looks at the catalog (Glue Data Catalog) to find and understand the structure of the books (data) before reading them directly from the library (S3))

Python Script

The creation of these resources could be done manually through the AWS Console, however this project automates this process by preparing python code, utilizing the boto3 library. The script will be copy-and-pasted into the AWS CloudShell terminal and executed to provision the necessary AWS resources, including fetching data from the API, creating the S3 bucket, configuring the Glue Crawler, updating the Glue Data Catalog, and setting up Athena for querying the stored data.

Project Structure

nba-data-lake/
├── .env                             # holds environment variables
├── policies/        
│   └── IAM_role.json                # json for policy permissions
├── src/
│   ├── setup_nba_data_lake.py       # main script
│   ├── delete_nba_data_lake.py      # script to delete resources
├── README.md                        # documentation
└── requirements.txt                 # dependencies

Set Up Instructions

Prerequisites

AWS Account with the following permissions:
- S3: CreateBucket, PutObject, DeleteBucket, ListBucket
- Glue: CreateDatabase, CreateTable, DeleteDatabase, DeleteTable
- Athena: StartQueryExecution, GetQueryResults
NBA API key

1. Clone the Repo

git clone <url>

2. Go to the AWS Console, and click the CloudShell icon to open the terminal.

3. Type `nano setup_nba_data_lake.py` into the shell console and press enter.

4. Copy the code from our repo file: `src/setup_data_lake.py` to the shell console.

Update the api_key variable to your actual key
Update the nba_endpoint variable.
Update the bucket_name variable.
Update the region variable (if necessary).
Hit crtl + x to exit, then y to save.
Hit enter to confirm the name.

5. Type `python setup_nba_data_lake.py` and run the code.

6. Manually check for the resources.

Go to S3 and you should see the new bucket with 3 objects inside of it.
Click on raw-data and open the file inside of it.

7. Query the data with Athena.

Go to Athena and paste the sample query

SELECT FirstName, LastName, Position, Team
FROM nba_players
WHERE Position = 'PG';

Click run and you should see output under "Query Results"

8. Delete Resources

Navigate to the CloudShell console.
Type nano delete_aws_resources
Copy-and-paste the content from the file src/delete_aws_resources into the console.
Update the bucket_name variable to match the bucket created.
Hit crtl + x to exit, then y to save.
Hit enter to confirm the name
Type python delete_aws_resources into the console and hit enter.
Manually confirm resources have been deleted.

Today I learned...

What data lakes are and how they differ from databases
How to create a python script to automate the provisioning of AWS resources
What the services AWS Glue and AWS Athena are
How to use Glue Crawler and Glue Data Catalog to process data
How to use Athena to query data

Future Enhancements

I am writing this after reaching the initial goal of this challenge: to automate the creation of a data lake using AWS services.

If I have the capacity I will enhance this app by integrating a data visualization dashboard that connects to AWS Athena. I could also add logic to include NFL player data.

And if you've made it this far, thanks for reading! Feel free to checkout my GitHub. I'd love to connect on LinkedIn.

DEV Community

An NBA Player Data Lake automation script, leveraging AWS S3, Glue, and Athena.

Introduction

Project Overview

What's a Data Lake?

Key Features

Technical Architecture

Technologies

NBA API

S3

Glue

Athena

A helpful analogy:

Python Script

Project Structure

Set Up Instructions

Prerequisites

1. Clone the Repo

2. Go to the AWS Console, and click the CloudShell icon to open the terminal.

3. Type `nano setup_nba_data_lake.py` into the shell console and press enter.

4. Copy the code from our repo file: `src/setup_data_lake.py` to the shell console.

5. Type `python setup_nba_data_lake.py` and run the code.

6. Manually check for the resources.

7. Query the data with Athena.

8. Delete Resources

Today I learned...

Future Enhancements

Top comments (0)

Read next

Serverless Chat on AWS with AppSync Events

Scale a Stateful Streamlit Chatbot with AWS ECS and EFS

Creating An AI Agent For Kubernetes Performance Optimization

Error Budgets in Practice: A Data-Driven Approach to Risk and Release Management

Introduction

Project Overview

What's a Data Lake?

Key Features

Technical Architecture

Technologies

NBA API

S3

Glue

Athena

A helpful analogy:

Python Script

Project Structure

Set Up Instructions

Prerequisites

1. Clone the Repo

2. Go to the AWS Console, and click the CloudShell icon to open the terminal.

3. Type nano setup_nba_data_lake.py into the shell console and press enter.

4. Copy the code from our repo file: src/setup_data_lake.py to the shell console.

5. Type python setup_nba_data_lake.py and run the code.

6. Manually check for the resources.

7. Query the data with Athena.

8. Delete Resources

Today I learned...

Future Enhancements

Read next

Serverless Chat on AWS with AppSync Events

Scale a Stateful Streamlit Chatbot with AWS ECS and EFS

Creating An AI Agent For Kubernetes Performance Optimization

Error Budgets in Practice: A Data-Driven Approach to Risk and Release Management

3. Type `nano setup_nba_data_lake.py` into the shell console and press enter.

4. Copy the code from our repo file: `src/setup_data_lake.py` to the shell console.

5. Type `python setup_nba_data_lake.py` and run the code.