Zied Ben Tahar for AWS Community Builders

Posted on Dec 30, 2024 • Edited on Jan 1 • Originally published at levelup.gitconnected.com

Replicate data from DynamoDB to Apache Iceberg tables using Glue Zero-ETL integration

#glue #dynamodb #iceberg #zeroetl

Analyzing data directly from Amazon DynamoDB can be tricky since it doesn’t come with built-in analytics features. One approach is to set up ETL pipelines to move the data into a data lake or a lakehouse. From there, services like Amazon Athena or EMR can take over for analysis and processing. Building and maintaining those ETL pipelines takes time and effort.

AWS Glue Zero-ETL Integration provides an easy way to replicate data from DynamoDB to Apache Iceberg tables in Amazon S3. It’s particularly useful when your DynamoDB table schema isn’t complex. In such cases, it helps reduce operational overhead.

Apache Iceberg is an open-source table format designed for high performance and large-scale analytics. It is increasingly recognized as a standard in data lake architectures providing advanced features such as schema evolution, time travel, ACID transactions, and efficient metadata handling, addressing key challenges in data lakes while offering warehouse-like capabilities.

In this article, I’ll walk you through setting up Glue Zero-ETL Integration using Terraform. Along the way, I’ll share my thoughts on using this service.

TL;DR

You can find the complete code repository at this link 👇

https://github.com/ziedbentahar/glue-zero-etl-dynamodb-to-apache-iceberg-table

Solution overview

I’ll use a hypothetical Orders table to demonstrate running analytical queries with Athena across various order-related dimensions:

In this example, I’m using a simplified Orders model, which has the following structure:

We’ll look at how Zero-ETL integration handles nested fields, sets, and lists of maps but first let setup the configuration.

Integration configuration

Let’s walk through the steps to configure the integration.

1- Configuring the DynamoDb source table

Before getting started, Point in time recovery (PITR) must be enabled on the source table:

We also need to configure the table’s resource policy to allow the integration to export table’s point in time.

resource "aws_dynamodb_resource_policy" "this" {
  resource_arn = data.aws_dynamodb_table.this.arn

  policy = jsonencode({
    Version = "2012-10-17",
    Statement = [
      {
        Effect = "Allow",
        Principal = {
          Service = "glue.amazonaws.com"
        },
        Action = [
          "dynamodb:ExportTableToPointInTime",
          "dynamodb:DescribeTable",
          "dynamodb:DescribeExport"
        ],
        Resource = "*",
        Condition = {
          StringEquals = {
            "aws:SourceAccount" = data.aws_caller_identity.current.account_id
          },
          StringLike = {
            "aws:SourceArn" = "arn:aws:glue:${data.aws_region.current.name}:${data.aws_caller_identity.current.account_id}:integration:*"
          }
        }
      }
    ]
  })
}

2- Glue catalog database configuration

An IAM role must be created for the Zero-ETL integration target to grant access to the Glue database:

resource "aws_glue_catalog_database" "this" {
  name         = "${var.application}${var.environment}db"
  location_uri = "s3://${aws_s3_bucket.database_bucket.bucket}/"
}

resource "aws_iam_policy" "integration_policy" {
  name = "${var.application}-${var.environment}-integration-policy"

  policy = jsonencode({
    Version = "2012-10-17",
    Statement = [
      {
        Effect = "Allow",
        Action = [
          "glue:CreateTable",
          "glue:GetTable",
          "glue:UpdateTable",
          "glue:GetTableVersion",
          "glue:GetTableVersions",
          "glue:GetResourcePolicy"
        ],
        Resource = [
          "arn:aws:glue:${data.aws_region.current.name}:${data.aws_caller_identity.current.account_id}:catalog",
          "arn:aws:glue:${data.aws_region.current.name}:${data.aws_caller_identity.current.account_id}:database/${aws_glue_catalog_database.this.name}",
          "arn:aws:glue:${data.aws_region.current.name}:${data.aws_caller_identity.current.account_id}:table/${aws_glue_catalog_database.this.name}/*"
        ]
      },
      {
        Effect = "Allow",
        Action = [
          "cloudwatch:PutMetricData",
        ],
        Resource = "*",
        Condition = {
          StringEquals = {
            "cloudwatch:namespace" = "AWS/Glue/ZeroETL"
          }
        },

      },
      {
        Effect = "Allow",
        Action = [
          "logs:CreateLogGroup",
          "logs:CreateLogStream",
          "logs:PutLogEvents"
        ],
        Resource = "*"
      },
      {
        Effect = "Allow",
        Action = [
          "glue:GetDatabase",
        ],
        Resource = [
          "arn:aws:glue:${data.aws_region.current.name}:${data.aws_caller_identity.current.account_id}:catalog",
          "arn:aws:glue:${data.aws_region.current.name}:${data.aws_caller_identity.current.account_id}:database/${aws_glue_catalog_database.this.name}",
        ]
      },
      {
        Effect = "Allow",
        Action = [
          "s3:ListBucket"
        ],
        Resource = [
          aws_s3_bucket.database_bucket.arn,
        ]
      },
      {
        Effect = "Allow",
        Action = [
          "s3:GetObject",
          "s3:PutObject",
          "s3:DeleteObject"
        ],
        Resource = [
          "${aws_s3_bucket.database_bucket.arn}/*",
        ]
      }

    ]
  })
}

resource "aws_iam_role" "integration_role" {
  name = "${var.application}-${var.environment}-integration-role"
  assume_role_policy = jsonencode({
    Version = "2012-10-17",
    Statement = [
      {
        Effect = "Allow",
        Principal = {
          Service = "glue.amazonaws.com"
        },
        Action = "sts:AssumeRole"
      }
    ]
  })
}

resource "aws_iam_role_policy_attachment" "integration" {
  role       = aws_iam_role.integration_role.name
  policy_arn = aws_iam_policy.integration_policy.arn
}

3- Creating the integration

Currently, neither CloudFormation nor the AWS Terraform provider supports Glue Zero-ETL. So, I’m using the AWS SDK to create the integration and configure table properties. To handle this, I rely on aws_lambda_invocation to trigger a lambda function that creates or deletes the integration whenever a the database is created or deleted—pretty much like a CloudFormation custom resource.

import { GlueClient, CreateIntegrationCommand, CreateIntegrationResourcePropertyCommand, DeleteIntegrationCommand, CreateIntegrationTablePropertiesCommand } from "@aws-sdk/client-glue";
import { SSMClient, PutParameterCommand, GetParameterCommand } from "@aws-sdk/client-ssm";

export const handler = async (event) => {

    let glueClient = new GlueClient({ region: process.env.AWS_REGION });
    let paramStore = new SSMClient({ region: process.env.AWS_REGION });

    if(event.sourceArn == null || event.targetArn == null || event.roleArn == null) {
        throw new Error("SourceArn, TargetArn and RoleArn are required");
    }

    if (event.tf.action === "create") {
        const integrationResult = await glueClient.send(new CreateIntegrationCommand({
            IntegrationName : event.integrationName,
            SourceArn : event.sourceArn,
            TargetArn : event.targetArn,

        }));
        const integrationResourcePropertyResult =  await glueClient.send(new CreateIntegrationResourcePropertyCommand({
            ResourceArn: event.targetArn,
            TargetProcessingProperties: {
                RoleArn: event.roleArn
            }
        }));

        await glueClient.send(new CreateIntegrationTablePropertiesCommand({
            ResourceArn: integrationResult.IntegrationArn,
            TableName: event.tableConfig.tableName,
            TargetTableConfig: {
                PartitionSpec: event.tableConfig.partitionSpec ? event.tableConfig.partitionSpec : undefined,
                UnnestSpec: event.tableConfig.unnestSpec ? event.tableConfig.unnestSpec : undefined,
                TargetTableName: event.tableConfig.tableName ? event.tableConfig.tableName : undefined
            }

        }));

        await paramStore.send(new PutParameterCommand({
            Name: event.integrationName,
            Value: JSON.stringify({
                integrationArn:  integrationResult.IntegrationArn,
                resourcePropertyArn: integrationResourcePropertyResult.ResourceArn
            }),
            Type: "String",
            Overwrite: true
        }));

        return;
    }

    if (event.tf.action === "delete") {
        const integrationParams = await paramStore.send(new GetParameterCommand({
            Name: event.integrationName,
        }));

        const { integrationArn } = JSON.parse(integrationParams.Parameter.Value);

        await glueClient.send(new DeleteIntegrationCommand({
            IntegrationIdentifier: integrationArn
        }));

        return;
    }

};

I’m using the @aws-sdk/client-glue to set up the integration, assign the target processing role, and configure table properties such as the target table name, schema unnesting options, and data partitioning for the target Apache Iceberg table. By default, the integration with DynamoDB uses the table’s primary keys.

Here’s how Lambda invocation is used; I’m passing the parameters I want to use to configure the integration:

resource "aws_lambda_invocation" "manage_zero_etl_integration" {

  function_name = aws_lambda_function.manage_zero_etl_integration_fn.function_name
  input = jsonencode({
    integrationName = "${var.application}-${var.environment}-zero-etl-integration",
    sourceArn       = data.aws_dynamodb_table.this.arn,
    targetArn       = aws_glue_catalog_database.this.arn,
    roleArn         = aws_iam_role.integration_role.arn,
    tableConfig = {
      tableName = data.aws_dynamodb_table.this.name,
      partitionSpec = [
        {
          FieldName    = "orderDate",
          FunctionSpec = "day"
        }
      ],
      unnestSpec : "FULL"
    }

  })

  lifecycle_scope = "CRUD"

  depends_on = [aws_glue_resource_policy.this]

}

Very much a happy-path solution here — just a workaround while waiting for proper IaC support. If you’d prefer not to take this route, another option is to create the integration using the CLI.

4- Glue resource policy

Since I’m using the Glue catalog for the integration, I made sure to include the following permissions in the glue catalog resource policy. This allows for integration between the source DynamoDB table and the target Iceberg table:

data "aws_iam_policy_document" "glue_resource_policy" {
  statement {
    effect = "Allow"

    principals {
      type = "AWS"
      identifiers = [
        "arn:aws:iam::${data.aws_caller_identity.current.account_id}:root",
        aws_iam_role.manage_zero_etl_integration_role.arn
      ]
    }

    actions = [
      "glue:CreateInboundIntegration",
    ]

    resources = [
      "arn:aws:glue:${data.aws_region.current.name}:${data.aws_caller_identity.current.account_id}:catalog",
      "arn:aws:glue:${data.aws_region.current.name}:${data.aws_caller_identity.current.account_id}:database/${aws_glue_catalog_database.this.name}",
    ]

    condition {
      test     = "StringLike"
      variable = "aws:SourceArn"
      values   = [data.aws_dynamodb_table.this.arn]
    }
  }

  statement {
    effect = "Allow"
    principals {
      type        = "Service"
      identifiers = ["glue.amazonaws.com"]
    }

    actions = [
      "glue:AuthorizeInboundIntegration"
    ]

    resources = [
      "arn:aws:glue:${data.aws_region.current.name}:${data.aws_caller_identity.current.account_id}:catalog",
      "arn:aws:glue:${data.aws_region.current.name}:${data.aws_caller_identity.current.account_id}:database/${aws_glue_catalog_database.this.name}",
    ]

    condition {
      test     = "StringEquals"
      variable = "aws:SourceArn"
      values   = [data.aws_dynamodb_table.this.arn]
    }
  }

  depends_on = [
    aws_iam_role.manage_zero_etl_integration_role,
    aws_lambda_function.manage_zero_etl_integration_fn
  ]
}


resource "aws_glue_resource_policy" "this" {
  policy = data.aws_iam_policy_document.glue_resource_policy.json
}

You can find this configuration in the official docs here.

Glue Zero-ETL in action

Once you’ve deployed all the components, you can go straight to the Glue Zero-ETL list. Here’s what it looks like in the console:

You can view the details. By default the refresh interval from the source DynamoDb table to the Iceberg table is set to 15 minutes, it is not editable for now:

You can also monitor the integration operations and track the number of items inserted, updated, or deleted directly from CloudWatch Logs. The documentation for the metrics generated during each execution can be found at the following link.

Once the first insert operation is successful, you can view the inferred Iceberg table schema on the data catalog page on the console:

☝️ Note that the shippingAddress was un-nested and deliveryPreferences was replicated as an array. That’s very convenient. However,items property was inferred as string. Since it’s a list of maps in DynamoDB, I expected it to map cleanly to a list of structs in Apache Iceberg, but it didn’t quite get the schema right.

The items property ends up as a plain JSON string in this DynamoDb list format, It’s not perfect, but we can work around it by using json_extract in Athena to parse the data:

Querying with Athena

Here’s an example query using Athena to get the number of orders grouped by city:

Here’s another example where I want to get the number of orders by city where the delivery preferences include LeaveAtDoor. While this involves some extra steps with DynamoDB, it’s much easier to achieve with Iceberg tables:

My Wishlist

After trying out Glue Zero-ETL, I came up with a wishlist of features and improvements I'd like to see. Since it's still relatively new (at the time of writing), I'm looking forward to potential updates and enhancements over time. I'll keep this blog post updated as things evolve:

IaC support

Deploying services through the console is not my preferred approach. As mentioned earlier in this post, currently, neither CloudFormation nor the AWS Terraform provider supports Glue Zero-ETL. I used the AWS SDK to create the integration and configure table properties. While this approach works for now, it’s not ideal. I expect that support for CloudFormation and Terraform will be introduced soon.

Handling DynamoDb List of Maps

Lists of Maps aren’t supported (yet?). Since Apache Iceberg tables can handle lists of structs, the lack of support for this feature could complicate more advanced use cases with complex table schemas. In such cases, running a custom ETL job remains a better solution.

Custom partitioning configuration

When setting up the integration, you can configure target table properties, such data partitioning as using the primary key from the DynamoDB table or specifying a custom partition:

await glueClient.send(new CreateIntegrationTablePropertiesCommand({
    ResourceArn: integrationResult.IntegrationArn,
    TableName: event.tableConfig.tableName,
    TargetTableConfig: {
        PartitionSpec: event.tableConfig.partitionSpec ? event.tableConfig.partitionSpec : undefined,
        UnnestSpec: event.tableConfig.unnestSpec ? event.tableConfig.unnestSpec : undefined,
        TargetTableName: event.tableConfig.tableName ? event.tableConfig.tableName : undefined
    }
}));

However, while I was able to define custom partition configuration through both the console and the AWS CLI, it didn’t seem to take effect:

I’m not sure if this is a UI issue or if Glue Zero-ETL Integration simply doesn’t support it yet. The documentation isn’t very clear on this point, but hopefully, it gets updated soon!

Support for AWS services other than DynamoDb

The Glue Zero-ETL integration currently supports a many sources, with DynamoDB being the only AWS service available at this point. While this is a great start, I would have preferred better alignment across AWS’s data integration offerings. For example, Amazon Kinesis Data Firehose already supports native CDC integration for RDS databases. It would have been ideal to see a more aligned approach, where Glue Zero-ETL could also support CDC from RDS and other AWS services.

Wrapping up

I hope you found this article helpful! I’ve found the Glue Zero-ETL integration to be an interesting tool to have in your toolkit, especially for offloading undifferentiated heavy lifting and focusing on what matters most. It’s also useful for teams that aren’t familiar with writing Glue Jobs, as it makes running ad-hoc analytics queries on data originally stored in DynamoDB much easier.

As ususal, you can find the full code source, ready to be adapted and deployed here 👇

https://github.com/ziedbentahar/glue-zero-etl-dynamodb-to-apache-iceberg-table

Thank you for reading and may your data be clean, your queries be fast, and your pipelines never break 😉

Resources

https://docs.aws.amazon.com/glue/latest/dg/zero-etl-using.html

https://docs.aws.amazon.com/cli/latest/reference/glue/create-integration.html

https://docs.aws.amazon.com/AWSJavaScriptSDK/v3/latest/client/glue/command/CreateIntegrationCommand/

DEV Community