Sidra Saleem for SUDO Consultants

Posted on Feb 28 • Originally published at sudoconsultants.com

Data Mesh on AWS: Federated Governance with Lake Formation and Glue

#datamesh #lakeformation #aws

Organizations are increasingly adopting decentralized data architectures to enable scalability, agility, and domain-oriented ownership. Data Mesh, a paradigm introduced by Zhamak Dehghani, advocates for a decentralized approach to data management, where data is treated as a product and owned by domain-specific teams. However, this decentralization introduces challenges in governance, security, and access control. AWS provides a robust set of tools, including AWS Lake Formation and AWS Glue, to implement a federated governance model for Data Mesh. This article explores how to implement Data Mesh on AWS, focusing on federated governance, centralized access controls, and sharing data products across accounts.

Understanding Data Mesh and Federated Governance

Data Mesh is a socio-technical approach to data architecture that emphasizes domain-oriented decentralization, data as a product, self-serve data infrastructure, and federated computational governance. Federated governance ensures that while data ownership and management are decentralized, there is a centralized mechanism for access control, compliance, and security.

AWS Lake Formation and AWS Glue play pivotal roles in implementing federated governance. Lake Formation provides a centralized way to manage data lakes, including fine-grained access control, while Glue offers data cataloging, ETL (Extract, Transform, Load), and data preparation capabilities. Together, they enable organizations to share data products across AWS accounts securely and efficiently.

Key Components of Data Mesh on AWS

AWS Lake Formation

AWS Lake Formation simplifies the process of setting up, securing, and managing data lakes. It provides a centralized interface to define data permissions, manage metadata, and enable secure data sharing across accounts.

AWS Glue

AWS Glue is a fully managed ETL service that makes it easy to prepare and load data for analytics. It includes a data catalog that serves as a central metadata repository, enabling seamless data discovery and integration.

IAM (Identity and Access Management)

IAM is used to manage access to AWS services and resources securely. It allows you to create and manage AWS users and groups, and use permissions to allow and deny their access to AWS resources.

S3 (Simple Storage Service)

S3 is the primary storage service used in conjunction with Lake Formation and Glue. It provides scalable, durable, and secure object storage for data lakes.

Implementing Data Mesh on AWS: Step-by-Step Guide

Step 1: Setting Up the Central Data Lake

CLI-Based Setup

Create an S3 Bucket: Use the AWS CLI to create an S3 bucket that will serve as the central data lake.

   aws s3api create-bucket --bucket my-data-lake --region us-west-2

Register the S3 Bucket with Lake Formation: Register the S3 bucket with Lake Formation to enable centralized governance.

   aws lakeformation register-resource --resource-arn arn:aws:s3:::my-data-lake --use-service-linked-role

AWS Console-Based Setup

Create an S3 Bucket: Navigate to the S3 console and create a new bucket.
Register the S3 Bucket with Lake Formation: Go to the Lake Formation console, select "Register and ingest," and follow the prompts to register your S3 bucket.

Step 2: Setting Up the Data Catalog

CLI-Based Setup

Create a Database in the Data Catalog: Use the AWS CLI to create a database in the Glue Data Catalog.

   aws glue create-database --database-input '{"Name":"my_database"}'

Create a Table in the Database: Create a table in the database to represent your data.

   aws glue create-table --database-name my_database --table-input '{"Name":"my_table", "StorageDescriptor":{"Location":"s3://my-data-lake/path/to/data/"}}'

AWS Console-Based Setup

Create a Database in the Data Catalog: Navigate to the Glue console, select "Databases," and create a new database.
Create a Table in the Database: Select "Tables" and create a new table, specifying the S3 location of your data.

Step 3: Implementing Fine-Grained Access Control

CLI-Based Setup

Grant Permissions on the Database: Use Lake Formation to grant permissions on the database to specific IAM roles or users.

   aws lakeformation grant-permissions --principal '{"DataLakePrincipalIdentifier":"arn:aws:iam::123456789012:role/my-role"}' --resource '{"Database":{"Name":"my_database"}}' --permissions "SELECT"

Grant Permissions on the Table: Grant permissions on the table to specific IAM roles or users.

   aws lakeformation grant-permissions --principal '{"DataLakePrincipalIdentifier":"arn:aws:iam::123456789012:role/my-role"}' --resource '{"Table":{"DatabaseName":"my_database","Name":"my_table"}}' --permissions "SELECT"

AWS Console-Based Setup

Grant Permissions on the Database: Navigate to the Lake Formation console, select "Databases," and grant permissions to the desired IAM roles or users.
Grant Permissions on the Table: Select "Tables" and grant permissions to the desired IAM roles or users.

Step 4: Sharing Data Products Across Accounts

CLI-Based Setup

Create a Resource Link: Create a resource link in the consumer account to the shared table in the producer account.

   aws lakeformation create-resource-link --resource-link-name my_resource_link --resource-arn arn:aws:glue:us-west-2:123456789012:table/my_database/my_table

Grant Permissions on the Resource Link: Grant permissions on the resource link to the consumer account.

   aws lakeformation grant-permissions --principal '{"DataLakePrincipalIdentifier":"arn:aws:iam::987654321012:role/consumer-role"}' --resource '{"ResourceLink":{"Name":"my_resource_link"}}' --permissions "SELECT"

AWS Console-Based Setup

Create a Resource Link: Navigate to the Lake Formation console in the producer account, select "Resource Links," and create a new resource link.
Grant Permissions on the Resource Link: Navigate to the Lake Formation console in the consumer account and grant permissions on the resource link.

Step 5: Enabling Data Discovery and Cataloging

CLI-Based Setup

Crawl Data with Glue Crawler: Create a Glue crawler to automatically discover and catalog data in the S3 bucket.

   aws glue create-crawler --name my_crawler --role arn:aws:iam::123456789012:role/AWSGlueServiceRole --database-name my_database --targets '{"S3Targets":[{"Path":"s3://my-data-lake/path/to/data/"}]}'

Run the Crawler: Start the crawler to populate the Data Catalog.

   aws glue start-crawler --name my_crawler

AWS Console-Based Setup

Create a Glue Crawler: Navigate to the Glue console, select "Crawlers," and create a new crawler.
Run the Crawler: Start the crawler to populate the Data Catalog.

Step 6: Implementing Data Quality and Validation

CLI-Based Setup

Create a Glue Job for Data Validation: Create a Glue job to validate data quality.

   aws glue create-job --name my_validation_job --role arn:aws:iam::123456789012:role/AWSGlueServiceRole --command '{"Name":"glueetl","ScriptLocation":"s3://my-data-lake/scripts/validate_data.py"}'

Run the Glue Job: Start the Glue job to validate data.

   aws glue start-job-run --job-name my_validation_job

AWS Console-Based Setup

Create a Glue Job: Navigate to the Glue console, select "Jobs," and create a new job for data validation.
Run the Glue Job: Start the job to validate data.

Real-Life Implementation: Case Study

Scenario: Multi-Account Data Sharing in a Financial Institution

A financial institution with multiple business units (e.g., retail banking, investment banking, and insurance) wants to implement a Data Mesh architecture to enable domain-oriented data ownership while maintaining centralized governance.

Implementation Steps

Central Data Lake Setup: The institution sets up a central data lake using S3 and registers it with Lake Formation.
Domain-Specific Data Products: Each business unit creates and manages its own data products, stored in the central data lake.
Federated Governance: Lake Formation is used to implement fine-grained access control, ensuring that only authorized users and roles can access specific data products.
Cross-Account Data Sharing: Resource links are created to share data products across accounts, enabling seamless data access while maintaining security and compliance.
Data Discovery and Cataloging: Glue crawlers are used to automatically discover and catalog data, making it easy for users to find and use data products.
Data Quality and Validation: Glue jobs are implemented to validate data quality, ensuring that only high-quality data is used for analytics and decision-making.

Results

The financial institution successfully implements a Data Mesh architecture on AWS, enabling domain-oriented data ownership, centralized governance, and secure data sharing across accounts. This approach improves data accessibility, enhances data quality, and ensures compliance with regulatory requirements.

Conclusion

Implementing Data Mesh on AWS with federated governance using Lake Formation and Glue provides a scalable, secure, and efficient way to manage decentralized data architectures. By following the steps outlined in this article, organizations can enable domain-oriented data ownership, implement fine-grained access control, and share data products across accounts securely. Real-life implementations, such as the case study of a financial institution, demonstrate the practical benefits of this approach in improving data accessibility, quality, and compliance.

AWS Lake Formation and Glue, combined with IAM and S3, offer a comprehensive solution for building and managing Data Mesh architectures. Whether through CLI-based commands or the AWS Management Console, organizations can leverage these tools to implement a robust Data Mesh that meets their unique needs and challenges.

DEV Community

Data Mesh on AWS: Federated Governance with Lake Formation and Glue

Understanding Data Mesh and Federated Governance

Key Components of Data Mesh on AWS

AWS Lake Formation

AWS Glue

IAM (Identity and Access Management)

S3 (Simple Storage Service)

Implementing Data Mesh on AWS: Step-by-Step Guide

Step 1: Setting Up the Central Data Lake

CLI-Based Setup

AWS Console-Based Setup

Step 2: Setting Up the Data Catalog

CLI-Based Setup

AWS Console-Based Setup

Step 3: Implementing Fine-Grained Access Control

CLI-Based Setup

AWS Console-Based Setup

Step 4: Sharing Data Products Across Accounts

CLI-Based Setup

AWS Console-Based Setup

Step 5: Enabling Data Discovery and Cataloging

CLI-Based Setup

AWS Console-Based Setup

Step 6: Implementing Data Quality and Validation

CLI-Based Setup

AWS Console-Based Setup

Real-Life Implementation: Case Study

Scenario: Multi-Account Data Sharing in a Financial Institution

Implementation Steps

Results

Conclusion

Top comments (0)

Read next

Resolving CORS Errors Caused by S3 and WebKit's Disk Cache

Por qué Migré Mi Aplicación Laravel a AWS Serverless (Y Por Qué Podría Ahorrarte Tiempo y Dinero)

NGINX Configuration: My HNG DevOps Stage 0 Experience

Deploying DeepSeek R1 Model on Amazon Bedrock: A Comprehensive Guide