Streamlining Data Conversion: XML to JSON with AWS Lambda and S3

Introduction:
In today's data-driven world, managing and transforming data formats efficiently is crucial for seamless integration and processing. XML (eXtensible Markup Language) and JSON (JavaScript Object Notation) are two widely used data formats, each with its own set of advantages. However, there are scenarios where converting data from XML to JSON becomes necessary to leverage JSON's simplicity and compatibility with modern web applications and APIs.

AWS (Amazon Web Services) provides a robust and scalable solution for such data transformation tasks using S3 buckets and Lambda functions. In this blog, we will walk you through the process of setting up an automated workflow that converts XML files stored in an S3 bucket to JSON format and uploads them to another S3 bucket using AWS Lambda. By the end of this tutorial, you'll have a clear understanding of how to harness the power of AWS to simplify your data processing pipelines.

Step-1: Setting Up S3 Buckets
First, create two S3 buckets: one for storing the input XML files and another for storing the converted JSON files. For this tutorial, we'll refer to them as input-bucket and output-bucket.

Step-2: Creating the Lambda Function
Navigate to the AWS Lambda console and create a new Lambda function using the Python 3.9 runtime.

Step-3: Configuring the Lambda Function
Permissions: Ensure your Lambda function has the necessary permissions to read from the input-bucket and write to the output-bucket. Attach an IAM role with S3FullAccess policies.

Code: Use the following code to handle the XML to JSON conversion and upload the converted file to the output bucket:

import json
import xml.etree.ElementTree as ET
import boto3

s3 = boto3.client('s3')

def parse_element(element):
    parsed_data = {}
    if element.attrib:
        parsed_data.update(('@' + k, v) for k, v in element.attrib.items())
    if element.text and element.text.strip():
        parsed_data['#text'] = element.text.strip()
    children = list(element)
    if children:
        child_data = {}
        for child in children:
            child_dict = parse_element(child)
            child_tag = child.tag
            if child_tag not in child_data:
                child_data[child_tag] = []
            child_data[child_tag].append(child_dict)
        for k, v in child_data.items():
            if len(v) == 1:
                parsed_data[k] = v[0]
            else:
                parsed_data[k] = v
    return parsed_data

def xml_to_json(xml_content):
    root = ET.fromstring(xml_content)
    return json.dumps(parse_element(root), indent=4)

def lambda_handler(event, context):
    source_bucket = 'input-bucket'  # Replace with your source bucket name
    destination_bucket = 'output-bucket'  # Replace with your destination bucket name

    try:
        records = event['Records']
    except KeyError:
        return {
            'statusCode': 400,
            'body': json.dumps('Event does not contain Records key.')
        }

    for record in records:
        try:
            key = record['s3']['object']['key']
        except KeyError:
            return {
                'statusCode': 400,
                'body': json.dumps('Record does not contain S3 object key.')
            }

        if key.endswith('.xml'):
            xml_obj = s3.get_object(Bucket=source_bucket, Key=key)
            xml_content = xml_obj['Body'].read().decode('utf-8')

            json_content = xml_to_json(xml_content)
            json_key = key.replace('.xml', '.json')
            s3.put_object(Bucket=destination_bucket, Key=json_key, Body=json_content)

            return {
                'statusCode': 200,
                'body': json.dumps('XML to JSON conversion and upload successful!')
            }
        else:
            return {
                'statusCode': 400,
                'body': json.dumps('The file is not an XML file.')
            }

Step-4: Setting Up S3 Trigger

Configure an S3 trigger for the input-bucket to invoke the Lambda function whenever a new XML file is uploaded. This ensures that the function is automatically triggered and processes the file as soon as it is added to the bucket.

Step-5: Testing the Workflow

Upload a sample XML file to the input-bucket and verify that the Lambda function is triggered. Check the output-bucket for the converted JSON file. For example, you can use the following XML content for testing:

<?xml version="1.0" encoding="UTF-8"?>
<note>
    <to>Tove</to>
    <from>Jani</from>
    <heading>Reminder</heading>
    <body>Don't forget me this weekend!</body>
</note>

Output:
The output of this lambda function will be:

Conclusion:
By following the steps outlined in this blog, you have successfully set up an automated system for converting XML files to JSON using AWS Lambda and S3. This approach leverages the scalability and reliability of AWS services to streamline your data processing tasks, making your workflows more efficient and easier to manage.

Whether you are handling large datasets or integrating with systems that require JSON format, this solution provides a robust and scalable way to manage your data transformations. Start leveraging AWS Lambda and S3 today to simplify your data processing pipelines and enhance your application's performance.

Top comments (1)

Deepak Kumar • Jul 1

Hello everyone,

I hope you're all doing well. I recently launched an open-source project called the Ultimate JavaScript Project, and I'd love your support. Please check it out and give it a star on GitHub: Ultimate JavaScript Project. Your support would mean a lot to me and greatly help in the project's growth.

Thank you!