rabindratamang

Posted on Feb 4

Batch Processing Large Datasets in Node.js Without Running Out of Memory

#node #aws #backend #performance

How to Process Large Datasets in Node.js Without Running Out of Memory

If you've ever tried handling massive datasets in Node.js, you know how quickly memory issues can become a nightmare. One minute, your script is running smoothly; the next, it's crashing with an out-of-memory (OOM) error. I recently faced this issue when extracting millions of log entries from OpenSearch and uploading them to S3, and I want to share what I learned.

Let’s walk through some practical ways to process large datasets without bringing your server to its knees.

The Challenge: Why Large Data Processing Can Be a Problem

So, what causes these memory issues in the first place? When I first attempted to fetch and process 3 million logs, I was running everything in memory, thinking, How bad could it be? Turns out, pretty bad.

Common Pitfalls:

Holding too much data at once – If you fetch millions of records in one go, you're asking for trouble.
Inefficient batch processing – If you don’t clean up memory between batches, things get messy fast.
Not using streams – Handling everything as large arrays instead of streaming the data keeps memory consumption unnecessarily high.

The Fix: Smarter Ways to Process Large Data

1. Process Data in Small Chunks

Rather than fetching everything at once, grab smaller chunks and process them incrementally. This way, you never hold more data in memory than necessary.

async function fetchAndProcessLogs() {
    let hasMoreData = true;
    let nextToken = null;

    while (hasMoreData) {
        const { logs, newNextToken } = await fetchLogsFromOpenSearch(nextToken);
        await processAndUpload(logs);
        nextToken = newNextToken;
        hasMoreData = !!nextToken;
    }
}

This method ensures that we’re only working with manageable amounts of data at a time.

2. Use Streams Instead of Keeping Data in Memory

Streams allow you to process data as it arrives, rather than waiting for everything to load. When compressing logs before uploading to S3, streams + zlib are your best friends:

const { createGzip } = require('zlib');
const { PassThrough } = require('stream');
const AWS = require('aws-sdk');
const s3 = new AWS.S3();

async function uploadToS3(logs) {
    const passThrough = new PassThrough();
    const gzipStream = createGzip();
    passThrough.pipe(gzipStream);

    const uploadPromise = s3.upload({
        Bucket: 'your-bucket-name',
        Key: `logs-${Date.now()}.gz`,
        Body: gzipStream,
        ContentEncoding: 'gzip',
        ContentType: 'application/json',
    }).promise();

    for (const log of logs) {
        if (!passThrough.write(JSON.stringify(log) + '\n')) {
            await new Promise(resolve => passThrough.once('drain', resolve)); // Handle backpressure while streaming logs
        }
    }

    passThrough.end();

    // Await the upload while continuing to stream
    await uploadPromise; 
}

With this approach, logs are streamed directly to S3, avoiding unnecessary memory overhead.

3. Allow Garbage Collection to Do Its Job

Garbage collection in Node.js works best when we don’t hold onto unnecessary references. Here’s how you can help it along:

Use setTimeout(0) or setImmediate() to free up memory between batches.
Make sure large arrays or objects are dereferenced once they're processed.

async function processAndUpload(logs) {
    await uploadToS3(logs);
    global.gc && global.gc(); // Manually trigger GC if supported
}

4. Increase Node.js Memory Allocation (If Needed)

If you’re working with extremely large datasets and still hitting memory issues, you can increase Node’s memory allocation like this:

node --max-old-space-size=8192 yourScript.js

That said, fixing memory management should be your first priority before resorting to increasing memory limits.

Final Thoughts

Handling large datasets in Node.js doesn't have to be a struggle. By batch processing, using streams, and optimizing memory usage, you can process millions of records smoothly without running out of RAM.

Have you faced similar challenges? Let’s discuss in the comments!

DEV Community

Batch Processing Large Datasets in Node.js Without Running Out of Memory

How to Process Large Datasets in Node.js Without Running Out of Memory

The Challenge: Why Large Data Processing Can Be a Problem

Common Pitfalls:

The Fix: Smarter Ways to Process Large Data

1. Process Data in Small Chunks

2. Use Streams Instead of Keeping Data in Memory

3. Allow Garbage Collection to Do Its Job

4. Increase Node.js Memory Allocation (If Needed)

Final Thoughts

Top comments (0)

Read next

Playwright on Cloud: Automating Review Data Extraction

Easy way to configure your kubeconfig and to debug your your EKS Cluster

Setting Up Nginx on an Amazon EC2 Instance

Harnessing AWS Cloud for Seamless DeepSeek R1 Operations