DEV Community

Cover image for How to ETL with AWS Glue and Amazon Glue Studio to transform your data - Part 2
Wendy Wong for AWS Community Builders

Posted on • Edited on

How to ETL with AWS Glue and Amazon Glue Studio to transform your data - Part 2

Introduction

In the previous blog you mastered ETL using sample sales data for both Amazon Glue Studio and AWS Glue.

The third tutorial explored how you may query data from your Amazon S3 bucket using Amazon Athena and create data visualizations from an Amazon Athena table in Amazon QuickSight that references the sales database stored in the AWS Data Catalog.

In this new tutorial, you will learn how to:

a) Locate open source datasets for your passion project, build a test or production workload that you can use with AWS services.

b) Complete data transformation with your own data using both Amazon Glue and Amazon Glue Studio.

Where can you find interesting open source datasets for analytics and machine learning projects?

kaggle

  • Government websites e.g. data.gov, data.gov.uk, data.gov.au.

gov

google

abs

Iuci

Solution Architecture overview - AWS Glue

glue

Tutorial 1: Data Transformation with AWS Glue

Step 1: In this tutorial, download the DOHMH New York City Restaurant Inspections data as a csv file from this link

ny data

A preview of the raw dataset:

excel

Step 2: Log into your AWS account. If you do not have an AWS account you may create one here

account

Step 3: You may follow the instructions here to create your first S3 bucket and give it a name new-york-restaurant-inspections.

s3

Step 4: Click the bucket and select Create Folder.

Create two folders within the Amazon S3 bucket and name them raw and processed respectively as shown below:

folders

Step 5: Upload and click add file to the folder raw

upload file

successful

Step 6: Set up basic IAM permissions to use AWS Glue console. You may follow the blog for more details.

If you need to access more Amazon resources such as Amazon Sagemaker Notebooks, you may follow the instructions here.

Step 7: Navigate to the Amazon Glue console and on the left-hand menu, click crawler followed by Add crawler.

glue get started

add crawlern

Step 8: Provide a name for the crawler. For this example, the we will name the crawler new york-restaurant-inspection-crawler and click Next.

name crawler
Step 9: For crawler source type, select Data stores and then crawler all folders.

specifify

Step 10: In this section, add the S3 bucket path and selecting the csv file from the raw folder and click Next.

path

add data store

Step 11: Click Yes to add another data store and Next.

yes

Step 12: Repeat Step 10 but select the Processed* folder for the S3 bucket path. Click Next.

processed

processed path

Step 13: There are no more data stores to be created, click No. Click Next.

no data store

Step 14: You may select an existing IAM role for AWS Glue that was created earlier for example 'AWSGlueServiceRole-learn-glue-role'. Click Next.

iAM

Step 15: Create a schedule for the crawler, select from the drop-down menu Run on demand.

run on demand

Step 16: Configure the crawler output. Click Add database and provide a name for the database 'restaurant-inspections'. Add a prefix for the table name called Raw. Click Next.

add db

rest

outpout
Step 17: Check the input details and click Finish.

check details

The crawler details are show below:

crawler details

Step 18: Click the crawler'new-york-restaurant-inspection-crawler' and select Run crawler to execute the job.

run crawler

run crawler now

Step 19: After the crawler job is completed with a 'ready' status , you can inspect the meta data by clicking Tables on the left-hand side.

logs

You may inspect the following schema from the data catalog:

schema

Solution Architecture overview - Amazon Glue Studio

architecture

Tutorial 2: Data Transformation with Amazon Glue Studio

Step 1: Access 'AWS Glue Studio' from the AWS Glue console.

glue 2 console

Step 2: From the AWS Glue Studio console, click View jobs.

jobs

Step 3: Select the option Visual with a source and target and click Create.

visualn

Step 4: Under Node Properties select S3 bucket followed by Amazon S3

node

Step 5: Under the S3 bucket node, select from the option Data catalog the database name followed by the table name as shown in the image below.

Imapply mapping

Choose Apply Mapping node.

Step 6: Under the'Apply Mapping node, click on the tab Transform.

Are there any variables that have been incorrectly categorized in the data catalog?

Are there any variables that you need to change the data type?

Are there any columns that you need to drop from your dataset that won't be used in your data analysis or for machine learning?

transform

You may complete some tranformatons such as changing date variables from the data type 'string' to 'date'.

date change

You may view a listing of common data types in AWS Glue in the documentation

Step 7: Click on the tab Output Schema, check the data transformations.

output schema

Step 8: Click on S3 bucket- data target node, followed by Parquet format, Compression Type as Snappy and lastly select the S3 bucket path for the processed data folder to output the transformed data.

S3 location

Select Create a data catalog on subsequent runs and name this table processed.

processed

Step 9: Click on the Script tab to view the Python generated code for this ETL job.

script

Step 10: Click on the tab 'Job Details', provide a name for the job e.g.restaurant inspection job and a description for the data transformation. Change the number of workers to 2.

job name

Step 11: Click Save and Run to commence the ETL job.

run

Step 12: The ETL job will take a few minutes to run and will change to a status of 'Succeeded'.

successful

Step 13: Navigate back to the AWS Glue console and select Crawlers on the left hand menu and re-run the'restaurant-inspection-crawler' by clicking Run crawler.

ran crawler

After the status changes to 'ready' click on Table to inspect a processed table in the Data catalog for transformed data types.

processed table

Resources

Setting up IAM permissions for AWS Glue

Get started with AWS Glue

Data types for AWS Glue

Working with geospatial data in QuickSight

Best practice working with geospatial data in Amazon Athena and Glue

Common data types

AWS Innovate Online Conference APJ - 23 & 25 August 2022

AWS Innovate

Register at this link

🚀 Hot off the press: Introducing AWS Glue Flex jobs: Cost savings on ETL workloads

AWS Glue Flex jobs is perfect for use cases including:

  • Scheduling ETL jobs to run overnight
  • Scheduling test workloads
  • One-time bulk upload
  • Time-sensitive workloads

You can read about the new feature and flex pricing from the AWS Big Data Blog here.

Iflex

Until the next lesson, happy learning ! 😁

Top comments (4)

Collapse
 
pabloin profile image
Pablo Ezequiel Inchausti

Congrats! Very detailed!

Collapse
 
abc_wendsss profile image
Wendy Wong

Thanks Pablo!

Collapse
 
aditmodi profile image
Adit Modi

Great blog, wendy 😊
Flex jobs is a great feature to the existing AWS Glue service. excited to try it out 🔥
Also AWS Glue Streaming ETL Auto Scaling is GA and it looks good 👌

Collapse
 
abc_wendsss profile image
Wendy Wong

Thanks Adit, I am excited to try AWS Glue Streaming ETL for a business use case :)