Serverless PDF Processing with AWS Lambda and Textract

#cloudcomputing #serverless #lambda #dataengineering

Serverless computing has transformed the way we build applications by eliminating the need to manage servers. In data engineering, this flexibility is especially useful for document processing, where workloads can be unpredictable because files can arrive at any time. While it's relatively straightforward to process flat and structured files, this isn't always the case with PDFs, particularly if they are created from scanned documents.

Overview

AWS Textract is a powerful service that automates the extraction of text and data from documents like PDFs and images. You can read more about it in the official AWS documentation. What's important for our use case, though, is that it's serverless, fully managed, and does exactly what we need, when we need it. Plus, it's far more cost-effective than training or using an AI model.

When combined with AWS Lambda and S3, Textract can be triggered automatically whenever a document is uploaded, enabling real-time processing without the hassle of managing infrastructure. In this blog, I'll demonstrate configuration options using CloudFormation template and Python code, allowing you to recreate a basic version then customise it for your own project.

Synchronous Implementation

The first and easiest option to implement is to use a single Lambda for everything - reading the file, sending it to Textract, waiting for the response, and processing the results.

The steps are the following:

Upload: Users upload documents, such as PDFs or scanned images, to an Amazon S3 bucket incoming folder. S3 acts as a secure and scalable storage service that can handle large volumes of data and traffic.
Lambda Trigger via S3 Notification: When a document is uploaded to the S3 bucket under incoming/ prefix, it triggers an AWS Lambda function via S3 notification.
Textract Processing: The triggered Lambda function then calls AWS Textract API, which processes the document and returns the extracted data.
Lambda: The same lambda is processing the json response and writes the extracted text to a flat file in S3 under processed/ prefix.
Notification/Logging: Optionally, we can log processing details to Amazon CloudWatch, which helps in monitoring the application's performance and logging for debugging purposes.

Configuration and setup

The setup is quite straightforward. You need to configure the Lambda function and its associated execution role. I recommend using parameters for resource names. Be sure to add a policy to your Lambda role that allows the textract:DetectDocumentText operation. Amazon Textract doesn't require a resource to be provisioned, as it's a fully managed API-based service. As long as your Lambda has the necessary permissions to call it, you are good to go.

AWSTemplateFormatVersion: '2010-09-09'
Description: 'V1 - S3, Lambda and associated resources.'

Parameters:
  BucketName:
    Type: String
    Description: The name of the S3 bucket.
  LambdaFunctionName:
    Type: String
    Description: The name of the Lambda function.

Resources:
  LambdaExecutionRole:
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Principal:
              Service: lambda.amazonaws.com
            Action: sts:AssumeRole
      Policies:
        - PolicyName: logs
          PolicyDocument:
            Statement:
            - Effect: Allow
              Action:
                - logs:CreateLogGroup
                - logs:CreateLogStream
                - logs:PutLogEvents
              Resource: '*'
        - PolicyName: s3
          PolicyDocument:
            Statement:
            - Effect: Allow
              Action:
                - s3:Get*
                - s3:PutObject
              Resource:
                - !Sub arn:aws:s3:::${BucketName}
                - !Sub arn:aws:s3:::${BucketName}/*
        - PolicyName: textract
          PolicyDocument:
            Version: '2012-10-17'
            Statement:
              - Effect: Allow
                Action:
                  - textract:DetectDocumentText
                Resource: '*'

  LambdaFunction:
    Type: AWS::Lambda::Function
    Properties:
      FunctionName: !Ref LambdaFunctionName
      Handler: index.handler
      Runtime: python3.12
      Code: ../src
      Role: !GetAtt LambdaExecutionRole.Arn
      Timeout: 10

  LogGroup:
    Type: AWS::Logs::LogGroup
    Properties:
      LogGroupName: !Sub '/aws/lambda/${LambdaFunction}'
      RetentionInDays: 7

The Lambda function is triggered every time a file is uploaded to the S3 bucket through an S3 notification. However, you want to avoid an infinite loop where files are continuously read and dropped into the bucket, causing the Lambda to be invoked repeatedly. To prevent this, ensure that the event is limited to a specific prefix - in this case, incoming/. Additionally, the Lambda function needs permission to be invoked by the S3 event, so make sure the correct permissions are configured.

  S3Bucket:
    Type: AWS::S3::Bucket
    Properties:
      BucketName: !Ref BucketName
      NotificationConfiguration:
        LambdaConfigurations:
          - Event: s3:ObjectCreated:*
            Filter:
              S3Key:
                Rules:
                  - Name: prefix
                    Value: incoming/
            Function: !GetAtt LambdaFunction.Arn

  S3InvokeLambdaPermission:
    Type: AWS::Lambda::Permission
    Properties:
      Action: lambda:InvokeFunction
      FunctionName: !Ref LambdaFunction
      Principal: s3.amazonaws.com
      SourceArn: !Sub arn:aws:s3:::${BucketName}

When it comes to the code, here's an example of how you can structure your Lambda function in Python.

import json
import boto3
import logging

logger = logging.getLogger()
logger.setLevel(logging.INFO)

def handler(event, context):
    s3 = boto3.client('s3')
    textract = boto3.client('textract')

    for record in event['Records']:
        bucket = record['s3']['bucket']['name']
        key = record['s3']['object']['key']

        try:
            # Call Textract
            response = textract.detect_document_text(
                Document={
                    'S3Object': {
                        'Bucket': bucket,
                        'Name': key
                    }
                }
            )

            # Extract detected text
            detected_text = []
            for item in response.get('Blocks', []):
                if item['BlockType'] == 'LINE':
                    detected_text.append(item['Text'])

            # Join detected text into a single string
            text_output = '\n'.join(detected_text)

            # Create a new key for the output text file
            file_name = key.split('/')[1].split('.')[0]
            output_key = f'processed/{file_name}.txt'

            # Write the detected text to the S3 bucket
            s3.put_object(
                Bucket=bucket,
                Key=output_key,
                Body=text_output
            )

            logger.info(f'Detected Text is written to: {output_key}')

        except Exception as e:
            logger.error(f'Error processing file {key} 
                         from bucket {bucket}: {str(e)}')
            continue

    return {
        'statusCode': 200,
        'body': json.dumps('Textract processing complete!')
    }

This architecture is simple and well-suited for small workloads with single-page files. However, it's important to keep in mind that a Lambda function has a 15-minute time limit. Additionally, Textract has limitations for synchronous operations.

For example:

JPEG, PNG, PDF, and TIFF files are limited to 10 MB in memory.
PDF and TIFF files are restricted to a maximum of 1 page.

For more details, refer to the quotas in Amazon Textract. These limitations lead us to the second option, which extends both the Lambda execution time and Textract capabilities.

Asynchronous Implementation

For asynchronous operations, while JPEG and PNG files still have a 10 MB limit in Textract memory, PDF and TIFF files benefit from significantly higher limits. PDF and TIFF files can now handle up to 500 MB and a maximum of 3,000 pages - a huge improvement compared to synchronous operations.

The steps are the following:

Upload: Users upload documents, such as PDFs or scanned images, to an Amazon S3 bucket in the incoming folder.
Lambda Trigger via S3 Notification: When a document is uploaded to the S3 bucket, it triggers an AWS Lambda function via an S3 notification.
Textract Processing: The triggered Lambda function calls AWS Textract, which processes the document.
Lambda Trigger via SNS: Once Textract completes the document processing, it sends a message to AWS SNS, which triggers another Lambda function.
Post-Processing: he second Lambda function can further process the extracted data by formatting it into a structured format (e.g., JSON, CSV) and storing it in an S3 bucket or a database like Amazon RDS or DynamoDB for easy retrieval and analysis.
Notification/Logging: Optionally, processing details can be logged to Amazon CloudWatch to monitor the application's performance and assist with debugging.

Configuration and setup

Let's begin by defining parameters and setting up the S3 bucket with notifications, similar to the previous solution. This will include configuring the S3 bucket to trigger Lambda functions when files are uploaded, using S3 event notifications.

AWSTemplateFormatVersion: '2010-09-09'
Description: 'V2 - S3, Lambdas, SNS and associated resources.'

Parameters:
  BucketName:
    Type: String
    Description: The name of the S3 bucket.
  DocProcessingLambdaFunctionName:
    Type: String
    Description: The name of the Lambda function.
  PostProcessingLambdaFunctionName:
    Type: String
    Description: The name of the Lambda function for post-processing.
  TextractSNSTriggerRoleName:
    Type: String
    Description: The name of the IAM role for Textract SNS trigger.

Resources:
  S3Bucket:
    Type: AWS::S3::Bucket
    Properties:
      BucketName: !Ref BucketName
      NotificationConfiguration:
        LambdaConfigurations:
          - Event: s3:ObjectCreated:*
            Filter:
              S3Key:
                Rules:
                  - Name: prefix
                    Value: incoming/
            Function: !GetAtt DocProcessingLambdaFunction.Arn

Now, we define the first Lambda function and its associated resources. This Lambda will send the file to Textract for processing without waiting for a callback. The key difference here is that the Lambda execution role no longer requires the textract:DetectDocumentText permission. Instead, it will need the permission to perform textract:StartDocumentTextDetection.

  DocProcessingLambdaExecutionRole:
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Principal:
              Service: lambda.amazonaws.com
            Action: sts:AssumeRole
      Policies:
        - PolicyName: logs
          PolicyDocument:
            Statement:
            - Effect: Allow
              Action:
                - logs:CreateLogGroup
                - logs:CreateLogStream
                - logs:PutLogEvents
              Resource: '*'
        - PolicyName: s3
          PolicyDocument:
            Statement:
            - Effect: Allow
              Action:
                - s3:Get*
              Resource:
                - !Sub arn:aws:s3:::${BucketName}
                - !Sub arn:aws:s3:::${BucketName}/*
        - PolicyName: textract
          PolicyDocument:
            Version: '2012-10-17'
            Statement:
              - Effect: Allow
                Action:
                  - textract:StartDocumentTextDetection
                Resource: '*'

  DocProcessingLambdaFunction:
    Type: AWS::Lambda::Function
    Properties:
      FunctionName: !Ref DocProcessingLambdaFunctionName
      Handler: index.handler
      Runtime: python3.12
      Code: ../src
      Role: !GetAtt DocProcessingLambdaExecutionRole.Arn
      Timeout: 10
      Environment:
        Variables:
          TEXTRACT_NOTIFICATION_TOPIC: !Ref TextractNotificationTopic
          TEXTRACT_ROLE_ARN: !GetAtt TextractSNSTriggerRole.Arn

  S3InvokeDocLambdaPermission:
    Type: AWS::Lambda::Permission
    Properties:
      Action: lambda:InvokeFunction
      FunctionName: !Ref DocProcessingLambdaFunction
      Principal: s3.amazonaws.com
      SourceArn: !Sub arn:aws:s3:::${BucketName}

  DocProcessingLambdaLogGroup:
    Type: AWS::Logs::LogGroup
    Properties:
      LogGroupName: !Sub '/aws/lambda/${DocProcessingLambdaFunctionName}'
      RetentionInDays: 7

Once the job is sent to Textract, the Lambda function is no longer responsible for managing it. This means Textract will need its own role to send a notification to SNS when the job is completed. The role and SNS topic ARN will be passed as environment variables to the Lambda function, allowing it to pass them to Textract along with the job.

  TextractNotificationTopic:
    Type: AWS::SNS::Topic
    Properties:
      DisplayName: Textract Notification Topic

  TextractSNSTriggerRole:
    Type: AWS::IAM::Role
    Properties:
      RoleName: !Ref TextractSNSTriggerRoleName
      AssumeRolePolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Principal:
              Service: textract.amazonaws.com
            Action: sts:AssumeRole
      Policies:
        - PolicyName: TextractSNSPublishPolicy
          PolicyDocument:
            Statement:
              - Effect: Allow
                Action:
                  - sns:Publish
                Resource: !Ref TextractNotificationTopic

In the code, the start_document_text_detection function initiates a Textract job to process the document stored in our S3 bucket. The DocumentLocation section specifies the S3 bucket and file to be analysed, while the NotificationChannel defines the SNS topic ARN and the IAM role that Textract will use to send notifications. These values are coming from the environment variables that we passed through the CloudFormation template earlier.

import boto3
import logging
import os

logger = logging.getLogger()
logger.setLevel(logging.INFO)

sns = boto3.client('sns')
textract = boto3.client('textract')

def handler(event, context):
    topic_arn = os.environ['TEXTRACT_NOTIFICATION_TOPIC']
    textract_role = os.environ['TEXTRACT_ROLE_ARN']

    for record in event['Records']:
        bucket = record['s3']['bucket']['name']
        key = record['s3']['object']['key']

        try:
            # Start Textract asynchronous processing, use env vars
            response = textract.start_document_text_detection(
                DocumentLocation={
                    'S3Object': {
                        'Bucket': bucket,
                        'Name': key
                    }
                },
                NotificationChannel={
                    'SNSTopicArn': topic_arn,
                    'RoleArn': textract_role
                }
            )

            logger.info(f"File {key} is sent to Textract.")

        except Exception as e:
            logger.error(f"Error processing file {key} 
                         from bucket {bucket}: {str(e)}")
            continue

    return {
        'statusCode': 200,
        'body': 'Textract processing initiation is complete!'
    }

The second Lambda function will need the textract:GetDocumentTextDetection permission to retrieve the results from Textract once it's invoked by the SNS topic. This allows the Lambda to access the output of the Textract job and process the extracted text accordingly.

  PostProcessingLambdaExecutionRole:
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Principal:
              Service: lambda.amazonaws.com
            Action: sts:AssumeRole
      Policies:
        - PolicyName: logs
          PolicyDocument:
            Statement:
            - Effect: Allow
              Action:
                - logs:CreateLogGroup
                - logs:CreateLogStream
                - logs:PutLogEvents
              Resource: '*'
        - PolicyName: s3
          PolicyDocument:
            Statement:
            - Effect: Allow
              Action:
                - s3:PutObject
              Resource:
                - !Sub arn:aws:s3:::${BucketName}
                - !Sub arn:aws:s3:::${BucketName}/*
        - PolicyName: textract
          PolicyDocument:
            Version: '2012-10-17'
            Statement:
              - Effect: Allow
                Action:
                  - textract:GetDocumentTextDetection
                Resource: '*'
        - PolicyName: sns
          PolicyDocument:
            Version: '2012-10-17'
            Statement:
              - Effect: Allow
                Action:
                  - sns:subscribe
                Resource: !Ref TextractNotificationTopic

  PostProcessingLambdaFunction:
    Type: AWS::Lambda::Function
    Properties:
      FunctionName: !Ref PostProcessingLambdaFunctionName
      Handler: index.handler
      Runtime: python3.12
      Code: ../src
      Role: !GetAtt PostProcessingLambdaExecutionRole.Arn
      Timeout: 10

  PostProcessingLambdaLogGroup:
    Type: AWS::Logs::LogGroup
    Properties:
      LogGroupName: !Sub '/aws/lambda/${PostProcessingLambdaFunctionName}'
      RetentionInDays: 7

  S3InvokePostLambdaPermission:
    Type: AWS::Lambda::Permission
    Properties:
      Action: lambda:InvokeFunction
      FunctionName: !Ref PostProcessingLambdaFunction
      Principal: sns.amazonaws.com
      SourceArn: !Ref TextractNotificationTopic

  PostProcessingLambdaSubscription:
    Type: AWS::SNS::Subscription
    Properties:
      Protocol: lambda
      TopicArn: !Ref TextractNotificationTopic
      Endpoint: !GetAtt PostProcessingLambdaFunction.Arn

This Lambda function is also responsible for processing the results. You can implement more complex transformations depending on your specific use case, but here's an example of appending the extracted text and saving it to a text file in an S3 bucket.

import json
import boto3
import logging

logger = logging.getLogger()
logger.setLevel(logging.INFO)

textract = boto3.client('textract')
s3 = boto3.client('s3')

def handler(event, context):
    for record in event['Records']:
        try:
            # The SNS message with job information
            sns_message = json.loads(record['Sns']['Message'])

            # Accessing the keys for getting Textract results
            job_id = sns_message['JobId']
            status = sns_message['Status']

            # Accessing the keys for destination
            bucket = sns_message['DocumentLocation']['S3Bucket']
            s3_object_key = sns_message['DocumentLocation']['S3ObjectName']
            file_name = s3_object_key.split('/')[1].split('.')[0]

            if status == 'SUCCEEDED':
                # Proceed to get the document text detection results
                response = textract.get_document_text_detection(JobId=job_id)

                # Collect extracted text
                detected_text = []
                for item in response.get('Blocks', []):
                    if item['BlockType'] == 'LINE':
                        detected_text.append(item['Text'])

                # Save collected text to S3
                output_key = f"processed/{file_name}.txt"
                s3.put_object(
                    Bucket=bucket,
                    Key=output_key,
                    Body="\n".join(detected_text)
                )
                logger.info(f"Detected text is written to S3/{output_key}")

            elif status == 'FAILED':
                logger.error(f"Job {job_id} failed.")

        except KeyError as e:
            logger.error(f"KeyError: Missing expected key {str(e)} 
                         in the message: {sns_message}")
        except Exception as e:
            logger.error(f"Error processing job {job_id}: {str(e)}")

    return {
        'statusCode': 200,
        'body': 'Notification processed successfully!'
    }

This asynchronous architecture is a robust solution for automating document processing tasks, offering greater flexibility in handling larger documents, particularly PDFs and TIFF files. It ensures scalability while overcoming the size and page limitations of synchronous processing.

Summary

In this article, we explored how to build a serverless document processing solution using AWS Lambda and Textract, offering two distinct approaches depending on your workload. The first approach uses a simple synchronous setup, ideal for small workloads with single-page documents. It’s easy to implement and manage, making it perfect for scenarios where the document size is small, and quick processing is needed.

However, for larger workloads — particularly when dealing with PDFs and TIFF files that may contain multiple pages or large file sizes — the second approach, an asynchronous architecture, is essential. This more advanced setup offers greater flexibility, allowing for the processing of documents up to 500 MB and 3,000 pages. It uses a two-step Lambda process along with S3 and SNS to ensure scalability without running into the limitations of synchronous execution.

By choosing the appropriate architecture for your needs, you can balance ease of setup with the ability to handle larger, more complex document processing tasks.