DEV Community

Arpad Toth for AWS Community Builders

Posted on • Originally published at arpadt.com

Masking sensitive data in real-time with AWS serverless services

If quickly checking for sensitive information in new data uploaded to an S3 bucket is critical, we can create a real-time, serverless data masking workflow using Amazon Macie, EventBridge and Lambda functions.

1. Problem statement

Consider the following scenario.

We (human users, applications, it doesn't matter) upload files to an S3 bucket. We want to detect Personally Identifiable Information (PII) in the new files, and if we find any, we want to immediately trigger a workflow that masks the sensitive information. Since we don't want to expose PII to our users working with the data, we need a dedicated storage (a data lake) that doesn't contain sensitive information. Analysts and engineers can then use the masked data for analytics and further processing.

AWS offers multiple services that can detect and mask or redact sensitive data. These services include Glue, Macie, Glue DataBrew or Comprehend. They all have their best-fit use cases.

2. Solution

This post will describe a serverless workflow solution containing Macie for sensitive data discovery, Lambda functions for creating jobs and masking data, and EventBridge for the real-time, event-based architecture.

3. Architecture

The starting point of the solution is the source bucket containing the raw (i.e., sensitive) data. It's where we upload the files we want to scan for PII. The workflow detects sensitive data in the newly uploaded files and runs a custom logic to mask those pieces of information. Finally, the masked data is saved to a destination bucket containing the masked data.

Real-time sensitive data masking workflow

Let's see the main building blocks in more detail.

3.1. S3 event notification

We can configure S3 to send an event to a configured target service every time a new file is uploaded to the bucket. It's called S3 event notification, and we configure the s3:ObjectCreated:* event type. In this case, the event destination is a standard SQS queue.

FIFO queues are not supported as direct event destinations.

3.2. Message processing

S3 Event Notifications can directly send events to Lambda functions. So why have a queue then?

S3 generates an event for every object uploaded to the bucket. For example, if we uploaded 10 files to the bucket, this would result in 10 different S3 events sent out to Lambda, one for each object. The 10 events would generally mean 10 Lambda invocations.

To reduce the number of Lambda invocations and jobs created in Macie (see next section), we can configure an SQS queue as an S3 event destination. Then, the function retrieves the messages from the queue in batches of records. This will result in less number of invocations. In this case, I have configured a Batch size of 10 and a Maximum batch window of 5 seconds. This way, one Lambda invocation can process more messages.

Having a queue is desirable for another reason, too. The CreateMacieJob function, which polls the queue for messages, does, well, create a job in Macie. By configuring the Lambda function to consume the messages in batches, fewer function invocations will be needed, resulting in fewer jobs created in Macie.

3.3. The Macie job

The SQS messages encapsulate the original S3 events. The job created by the CreateMacieJob function will scan the bucket and objects specified in the events. We can add multiple buckets and objects for a specific Macie job.

Each job the function creates is a one-time job. Macie jobs can also be scheduled, but if we want to process data as soon as they are uploaded, we are better off creating one-time jobs.

Jobs will look for Hungarian ID card, passport and driver's license numbers. It's a shame, but Macie only supports the driver's license pattern as a managed data identifier out of the box. With this data identifier, we can refer to its ID in the job input. Please see the documentation for the list of supported data identifiers and their respective IDs.

The related code in the CreateMacieJob function can look like this:

const createJobCommand = new CreateClassificationJobCommand({
  // ... more properties here
  jobType: 'ONE_TIME',
  customDataIdentifierIds: [
    HUNGARIAN_ID_CARD_IDENTIFIER,
    HUNGARIAN_PASSPORT_IDENTIFIER,
  ],
  managedDataIdentifierSelector: ManagedDataIdentifierSelector.INCLUDE,
  managedDataIdentifierIds: ['HUNGARY_DRIVERS_LICENSE'],
});

try {
    const response = await macieClient.send(createJobCommand);
    return response.jobId;
  } catch (error) {
    console.error('Error while creating Macie job');
    throw error;
}
Enter fullscreen mode Exit fullscreen mode

As stated earlier, the ID card and passport numbers are not supported by default. If we want Macie to discover these data types, we must create two custom data identifiers, one for each.

Hungarian ID card numbers have a format of 123456AB while passport numbers follow the AB1234567 pattern. We must specify matching regular expressions in the custom data identifier input.

Here, the IDs of our custom identifiers are saved to the HUNGARIAN_ID_CARD_IDENTIFIER and HUNGARIAN_PASSPORT_IDENTIFIER environment variables so that the CreateMacieJob function can access and add them to the customDataIdentifierIds job input field.

Macie will scan the specified objects for sensitive data matched by managed and custom identifiers.

The CreateMacieJob function will also need some permissions in its execution role:

{
  "Version": "2008-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "macie2:CreateClassificationJob",
        "macie2:ListClassificationJobs",
      ],
      "Resource": "arn:aws:macie2:eu-central-1:123456789012:classification-job/*",
    }
  ]
}
Enter fullscreen mode Exit fullscreen mode

3.4. Macie findings

After the one-time jobs have been created, they immediately start running. These jobs can run for 10-15 minutes, depending on the number of objects to scan.

If Macie finds sensitive data matching the input patterns in the newly uploaded files, it will create a finding for each file containing PII.

If everything goes well and no sensitive data has been found across all uploaded files, the workflow finishes running since there's nothing to fix!

3.5. EventBridge

How do we know when Macie has found sensitive data and the masking logic can start?

We create an EventBridge rule that matches the event Macie emits when it creates findings:

{
  "detail-type": ["Macie Finding"],
  "source": ["aws.macie"]
}
Enter fullscreen mode Exit fullscreen mode

We can dig deeper and specify Severity (low, medium or high) if the business logic requires it. For the sake of simplicity, I go with the basic matching event pattern here.

We should create the rule in the default event bus since AWS services send their events there.

When an incoming event matches the rule, i.e., Macie has discovered sensitive data in the scanned files, EventBridge will invoke the configured targets. This solution has a Lambda function target called MaskSensitiveData. Optionally, other targets, including API destinations, can also be added. For example, we can make EventBridge send notifications to Slack if Macie creates a finding.

3.6. Masking and saving sensitive data

EventBridge triggers the MaskSensitiveData Lambda function when Macie finds PII in the uploaded files.

The function downloads the objects where PII has been found from the S3 bucket, applies some obfuscation logic on the sensitive data, and saves the modified data to a new bucket. If we saved the modified objects to the original bucket, the uploads would generate new S3 events. They would trigger the workflow again from the beginning, resulting in duplicate runs.

The destination bucket's content can now be used for analytics. We have removed all sensitive information!

4. Considerations

Let's discuss some considerations and trade-offs.

4.1. Not the only one

The workflow takes approximately 10-15 minutes to complete after we have uploaded the files to the bucket. The processing pauses while Macie runs the jobs. When Macie finds sensitive data and creates the findings, it takes seconds to mask and store it in the destination bucket.

As always, this is a solution for the presented scenario but not the only one. It works well when real-time or near real-time processing is required on occasional uploads, and we want flexibility in input buckets.

4.2. Macie rate limits

The CreateClassificationJob API endpoint in Macie is rate-limited. By default, we can call the endpoint once every 10 seconds. If multiple Lambda instances run simultaneously, for example, because we upload dozens of objects to S3, these instances will try creating multiple jobs in a few seconds. Macie will throw a TooManyRequestsException error in this case.

We can mitigate this issue in a couple of different ways. One solution is to configure VisibilityTimeout on the SQS queue. AWS recommends making it at least 6 times as long as the Lambda function's timeout. If Macie rejects the request, the CreateMacieJob function will throw an error. Lambda won't remove the corresponding messages from the queue in this case. After the visibility timeout has expired, the function will try to create the job again.

Another (or rather supplementary) method is configuring the Batch size and Maximum batch window parameters on the event source mapping. The more messages the CreateMacieJob instance can process, the less concurrency is needed, decreasing the probability of getting throttled by Macie.

A third option is to use the reserved concurrency setting on the function. This time, only the configured number of instances will run simultaneously. When many messages wait to be processed, this option might add a bottleneck to the solution by throttling any excess Lambda invocations. Since the number of concurrent Lambda function invocations is maximized, excess messages will be parked in the queue until they are all processed.

4.3. Flexibility

The workflow can receive objects from multiple source buckets as long as the event notification is configured on all of them.

We can also add more managed and custom data identifiers as needed.

5. Code

The infrastructure code written in CDK TypeScript and the function application code are available on my GitHub page.

The stack contains two S3 buckets, the custom data identifiers in Macie, the SQS queue with a connected dead-letter queue, the EventBridge rule and all roles and permissions.

6. Summary

We can use Macie to run sensitive data discovery jobs in S3 buckets. Using Macie, we can implement near-real-time scans for PII in our uploaded files.

We can quickly react to findings and mask sensitive information using Lambda functions and EventBridge.

7. Further reading

Detect and process sensitive data - Handling PII in Glue

Detecting PII entities - Handling PII in Comprehend

Automatically detect Personally Identifiable Information in Amazon Redshift using AWS Glue - Using Redshift and Glue together

Amazon S3 Event Notifications - How event notification works in S3

Understanding how AWS Lambda scales with Amazon SQS standard queues - A good blog post on the topic

Top comments (0)