Notes: Read each line of problem statement
S3-Glue-Redshift
Setting Up Redshift cluster and connection in AWS Glue
create cluster → Node type ( dc2.large) → No of nodes (1)
Database Configurations- Manually add admin password → awsuser , Awsuser1
Cluster permissions → new IAM Role with Any S3 Policy, keep all default and
create cluster → see “available” → click on Actions → turn on public access.
In cluster security group, add inbound rule to allow custom TCP 5439 port ipv4 address
Creating a VPC Endpoint for s3, glue and redshift integration.
VPC → endpoint ( left side) → Name → s3-glue-redshift-endpoint
Service category → AWS services
Service Name - com.amazonaws.us-east-1.s3 - Gateway
VPC → select default, Route table → select default table, leave all as default , create endpoint.
Navigate to the AWS Data Catalog
databases → add database → name: employees_db
Connections → new connection → redshift → select redshift connection with user name and password
Connection Name: redshift_connection → save → click connections → actions → test connection
Creating Crawlers in AWS Glue
S3 : crawlers → create crawler → name → Data Store - S3 → include path (same exact path as S3) → existing IAM role → select db : employees_db → Run crawler
target Redshift Table:
Note: Create Redshift output tables before creating the crawlers for redshift tables, kindle check in the tasks for the output schema
crawlers → create crawler → name → Data Store - Jdbc , Amazon redshift → Jdbc Connection: redshift_connection → include path (dev/public/employees_cleaned_data) → existing IAM role → select db : employees_db → Run crawler
create job with s3_glue_redshift_job
Source : Input → retrieve data from the emp data in aws glue data catalog
Transform: filter → Schema verification
Output: Load data to Glue Catalog in employees_cleaned_data
use this this data for data transformation
use the cleaned data node for the next task.
Top comments (0)