DEV Community

Harsh Vardhan Singh
Harsh Vardhan Singh

Posted on

Create a cross-account glue Job using AWS CDK

AWS Glue is a powerful service for data integration and ETL (Extract, Transform, Load) workloads, making it easier to prepare and transform data for analytics. If you’re looking to automate the creation of Glue jobs using Infrastructure as Code (IaC), AWS CDK (Cloud Development Kit) is a great choice. In this post, we’ll walk through the process of defining and deploying an AWS Glue job using AWS CDK. We will be creating a job that can connect to cross-account RDS cluster and execute an etl scripts.

Note : We will not be covering AWS CLI and CDK package setup in this article.

Image description

Step 1: Define the VPC stack for Glue Job

export class GlueVpcStack extends DeploymentStack {
  public readonly vpc: Vpc;
  public readonly vpcDefaultSecurityGroupId: string;
  constructor(scope: Construct, id: string, props: DeploymentStackProps) {
    super(scope, id, props);

    const vpc = new Vpc(this, 'VPCForGlue', {
      ipAddresses: IpAddresses.cidr(Vpc.DEFAULT_CIDR_RANGE),
      subnetConfiguration: [
        {
          cidrMask: 24,
          name: 'Public',
          subnetType: SubnetType.PUBLIC,
        },
        {
          cidrMask: 24,
          name: 'Private',
          subnetType: SubnetType.PRIVATE_WITH_EGRESS,
        },
      ],
      natGateways: 1,
    });

    vpc.addGatewayEndpoint('S3GatewayEndpoint', {
      service: GatewayVpcEndpointAwsService.S3,
    });

    vpc.addInterfaceEndpoint('SecretsManagerEndpoint', {
      service: InterfaceVpcEndpointAwsService.SECRETS_MANAGER,
    });

    const vpcDefaultSecurityGroup = SecurityGroup.fromSecurityGroupId(
      this,
      'SecurityGroup',
      vpc.vpcDefaultSecurityGroup,
      {
        allowAllOutbound: false,
        mutable: true,
      },
    );
    this.vpc = vpc;
    this.vpcDefaultSecurityGroupId = vpc.vpcDefaultSecurityGroup;

    vpcDefaultSecurityGroup.addEgressRule(Peer.anyIpv4(), Port.allTraffic());
    vpcDefaultSecurityGroup.addIngressRule(Peer.securityGroupId(this.vpcDefaultSecurityGroupId), Port.allTraffic());
  }
}
Enter fullscreen mode Exit fullscreen mode

Note: We must update the default security group of the VPC to include a self-referencing inbound rule and an outbound rule to allow all traffic from all ports. Later, we attach this security group to an AWS Glue connection to let network interfaces set up by AWS Glue communicate with each other within a private subnet.

Step 2: Define stack for Glue Job and Glue connection

export class InfraStack extends DeploymentStack {
  constructor(scope: Construct, id: string, props: DeploymentStackProps, stageName: string) {
    super(scope, id, props);

    const vpcStack = new GlueVpcStack(this, id + '-VPC', props);

//Creating an IAM role to let AWS Glue access required service 
    const glueRole = new Role(this, id + '-GlueJobsRole', {
      roleName: 'GlueJobsRole-' + stageName,
      assumedBy: new ServicePrincipal('glue.amazonaws.com'),
    });
    glueRole.addManagedPolicy(ManagedPolicy.fromAwsManagedPolicyName('AmazonS3FullAccess'));
    glueRole.addManagedPolicy(ManagedPolicy.fromAwsManagedPolicyName('service-role/AWSGlueServiceRole'));
    glueRole.addManagedPolicy(ManagedPolicy.fromAwsManagedPolicyName('SecretsManagerReadWrite'));

//Creating an AWS Glue connection 
    const glueConnection = new CfnConnection(this, 'GlueConnection', {
      catalogId: this.account,
      connectionInput: {
        connectionType: 'JDBC',
        connectionProperties: {
          JDBC_CONNECTION_URL: "jdbcUrl",
          SECRET_ID: "secretId", //secret that has DB credentials, we need to create this manually if 
          JDBC_ENFORCE_SSL: true,
        },
        physicalConnectionRequirements: {
          securityGroupIdList: [vpcStack.vpcDefaultSecurityGroupId],
          subnetId: vpcStack.vpc.privateSubnets[0].subnetId,
          availabilityZone: vpcStack.vpc.privateSubnets[0].availabilityZone,
        },
        name: 'GlueConnection',
        description: 'GlueConnection',
      },
    });

    //Creating a bucket to keep scripts run in job
    const testBucket = this.createBucket(id.toLowerCase() + '-testgluejobscripts', id + '-testgluejobscripts');
    new BucketDeployment(this, 'DeployTestScripts', {
      sources: [Source.asset('test_glue_job_scripts')], //This folder should be present under root of CDK package
      destinationBucket: testBucket,
    });

    const job = new CfnJob(this, 'TestGlueJob', {
      name: 'TestGlueJob',
      role: glueRole.roleArn,
      command: {
        name: 'pythonshell',
        pythonVersion: '3.9',
        scriptLocation: `s3://<bucket>/script_name.py`,
      },
      glueVersion: '4.0',
      executionProperty: {
        maxConcurrentRuns: 1,
      },
      connections: {
        connections: ['GlueConnection'], //This should match with the  name set inside new CfnConnection() constructor
      },
    });
  }

  private createBucket(name: string, id: string) {
    return new Bucket(this, id, {
      enforceSSL: true,
      bucketName: name,
    });
  }
}
Enter fullscreen mode Exit fullscreen mode

Step 3: Allow Amazon RDS to accept network traffic from AWS Glue

For this, we update the security group attached to the Amazon RDS cluster, and whitelist the Elastic IP address attached to the NAT gateway for the AWS Glue VPC.

Step 4: Deploy the CDK Stack

cdk deploy
Enter fullscreen mode Exit fullscreen mode

Step 5: Verify the Glue Job

Once the deployment is complete, navigate to the AWS Glue Console to verify that the job has been created. The job should appear with the specified configurations and script. You can run the job from console and check expected outcome.

Resources

Top comments (0)