Introduction
This is the Week 1 Day 3 Project of the 30-Day DevOps challenge I am participating in. Learn more about the challenge and it's creators.
Project Overview
In today's project, we are implementing a data lake to ingest, store, and manage the high volume, complex data that we will retrieve from our API call to the NBA API.
What's a Data Lake?
A data lake is kind of like a database in the sense that we store large amount of data in both. The difference is that, whereas a database is structured (tables, schemas, relationships), a data lake can handle large volumes of all of kinds data regardless of format(structured, semi-structured, unstructured), and your data doesn't need to organized first.
We can think of a database like a organized filling cabinet: when we add to it, we add the file to the correct location based on the schema of the cabinet.
Whereas a data lake would be more like a storage warehouse where we can add any kind of data. We don't have to organize it and put it in the correct location, just dump it on in there.
But what good for us is a whole bunch of unprocessed, unorganized data? So in our project we are gathering all this data, storing it in a data lake in an S3 bucket, and then using other AWS services (e.g. Glue and Athena) to make the data queryable.
Key Features
- Fetches NBA player data using SportsDataIO's NBA API.
- Stores raw and processed data in AWS S3 buckets.
- Uses AWS Glue Crawlers to infer schema from raw data in S3.
- Stores metadata in AWS Glue Data Catalog for structured querying.
- Allows SQL-based querying of structured data stored in S3.
- Implements IAM roles and policies to control access to S3, Glue, and Athena.
Technical Architecture
Technologies
- Cloud Provider: AWS
- Core Services: S3, Glue, Athena
- External API: NBA Game API (SportsData.io)
- Programming Language: Python 3.x
- IAM Security: Least privilege policies
NBA API
We are making a request to the NBA API, this time for data about NBA players. This is how I coded the request:
nba_endpoint = f"https://api.sportsdata.io/v3/nba/scores/json/Players?key={api_key}"
try:
response = requests.get(nba_endpoint)
response.raise_for_status()
print("Fetched NBA data successfully.")
return response.json()
except Exception as e:
print(f"Error fetching NBA data: {e}")
return []
The response looks something like this:
[
{
"PlayerID": 20000441,
"SportsDataID": "",
"Status": "Active",
"TeamID": 29,
"Team": "PHO",
"Jersey": 3,
"PositionCategory": "G",
"Position": "SG",
"FirstName": "Bradley",
"LastName": "Beal",
"Height": 76,
"Weight": 207,
"BirthDate": "1993-06-28T00:00:00",
"BirthCity": "St. Louis",
"BirthState": "MO",
"BirthCountry": "USA",
"HighSchool": null,
"College": "Florida",
"Salary": 50203930,
"PhotoUrl": "https://s3-us-west-2.amazonaws.com/static.fantasydata.com/headshots/nba/low-res/0.png",
"Experience": 12,
"SportRadarPlayerID": "ff461754-ad20-4eeb-af02-2b46cc980b24",
"RotoworldPlayerID": 1966,
"RotoWirePlayerID": 3303,
"FantasyAlarmPlayerID": 200464,
"StatsPlayerID": 606912,
"SportsDirectPlayerID": 750970,
"XmlTeamPlayerID": 3395,
"InjuryStatus": "Scrambled",
"InjuryBodyPart": "Scrambled",
"InjuryStartDate": "2025-02-06T00:00:00",
"InjuryNotes": "Scrambled",
"FanDuelPlayerID": 15595,
"DraftKingsPlayerID": 606912,
"YahooPlayerID": 5009,
"FanDuelName": "Bradley Beal",
"DraftKingsName": "Bradley Beal",
"YahooName": "Bradley Beal",
"DepthChartPosition": "SG",
"DepthChartOrder": 6,
"GlobalTeamID": 20000029,
"FantasyDraftName": "Bradley Beal",
"FantasyDraftPlayerID": 606912,
"UsaTodayPlayerID": 8315651,
"UsaTodayHeadshotUrl": "http://cdn.usatsimg.com/api/download/?imageID=24445236",
"UsaTodayHeadshotNoBackgroundUrl": "http://cdn.usatsimg.com/api/download/?imageID=24445234",
"UsaTodayHeadshotUpdated": "2024-10-09T14:17:10",
"UsaTodayHeadshotNoBackgroundUpdated": "2024-10-09T14:17:04",
"NbaDotComPlayerID": 203078
},
...
]
That's like 50 lines of code for one player! ..Googles the number of current NBA players, multiples times 50.. Yea, this is good introductory example of a large data set.
S3
AWS S3 - Simple Storage Solution - will be our storage layer. The raw data retrieved from the API call will be stored in our S3 bucket. The processed data (after being processed by Glue) is also stored in an S3 bucket.
Glue
AWS Glue is a serverless data integration service that includes tools for ETL (Extract, Transform, Load) jobs and a Data Catalog for metadata management. We will use Glue Crawler to process the raw data we are storing in our S3 bucket and Glue Data Catalog to store and manage metadata about the structured data. This will make the data accessible for querying using AWS Athena.
Athena
AWS Athena is a serverless querying service used to analyze data. We will use it to run SQL queries on the data in our S3 bucket after its been processed by AWS Glue.
A helpful analogy:
-
S3 = The Library
- Where all the raw and processed data is stored.
-
Glue Data Catalog = The Library's Catalog
- A system that organizes and records information about the books/data stored in S3.
-
Athena = A Reader
- Who looks at the catalog (Glue Data Catalog) to find and understand the structure of the books (data) before reading them directly from the library (S3))
Python Script
The creation of these resources could be done manually through the AWS Console, however this project automates this process by preparing python code, utilizing the boto3 library. The script will be copy-and-pasted into the AWS CloudShell terminal and executed to provision the necessary AWS resources, including fetching data from the API, creating the S3 bucket, configuring the Glue Crawler, updating the Glue Data Catalog, and setting up Athena for querying the stored data.
Project Structure
nba-data-lake/
├── .env # holds environment variables
├── policies/
│ └── IAM_role.json # json for policy permissions
├── src/
│ ├── setup_nba_data_lake.py # main script
│ ├── delete_nba_data_lake.py # script to delete resources
├── README.md # documentation
└── requirements.txt # dependencies
Set Up Instructions
Prerequisites
-
AWS Account with the following permissions:
- S3: CreateBucket, PutObject, DeleteBucket, ListBucket
- Glue: CreateDatabase, CreateTable, DeleteDatabase, DeleteTable
- Athena: StartQueryExecution, GetQueryResults
NBA API key
1. Clone the Repo
git clone <url>
2. Go to the AWS Console, and click the CloudShell icon to open the terminal.
3. Type nano setup_nba_data_lake.py
into the shell console and press enter.
4. Copy the code from our repo file: src/setup_data_lake.py
to the shell console.
- Update the api_key variable to your actual key
- Update the nba_endpoint variable.
- Update the bucket_name variable.
- Update the region variable (if necessary).
- Hit crtl + x to exit, then y to save.
- Hit enter to confirm the name.
5. Type python setup_nba_data_lake.py
and run the code.
6. Manually check for the resources.
- Go to S3 and you should see the new bucket with 3 objects inside of it.
- Click on raw-data and open the file inside of it.
7. Query the data with Athena.
- Go to Athena and paste the sample query
SELECT FirstName, LastName, Position, Team
FROM nba_players
WHERE Position = 'PG';
- Click run and you should see output under "Query Results"
8. Delete Resources
- Navigate to the CloudShell console.
- Type
nano delete_aws_resources
- Copy-and-paste the content from the file
src/delete_aws_resources
into the console. - Update the bucket_name variable to match the bucket created.
- Hit crtl + x to exit, then y to save.
- Hit enter to confirm the name
- Type
python delete_aws_resources
into the console and hit enter. - Manually confirm resources have been deleted.
Today I learned...
- What data lakes are and how they differ from databases
- How to create a python script to automate the provisioning of AWS resources
- What the services AWS Glue and AWS Athena are
- How to use Glue Crawler and Glue Data Catalog to process data
- How to use Athena to query data
Future Enhancements
I am writing this after reaching the initial goal of this challenge: to automate the creation of a data lake using AWS services.
If I have the capacity I will enhance this app by integrating a data visualization dashboard that connects to AWS Athena. I could also add logic to include NFL player data.
And if you've made it this far, thanks for reading! Feel free to checkout my GitHub. I'd love to connect on LinkedIn.
Top comments (0)