Lucas Weis Polesello

Posted on Mar 17

The Zalgo Effect and Resource Leakage - A Case

First of all, what's the `Zalgo Effect`?

Zalgo Effect is a term used to describe unexpected outcomes of mixing sync and async JavaScript code - meaning if you mix these two approaches ~SOMETHING~ weird will happen.
It's one of those things you kinda don't understand until you see it in real production systems.

Well, in the internet and Pop culture - it's more than that. Think of it like a monster.

So what it has to do with Resource Leakage?

Occasionally, our SRE team received a couple PagerDuty alerts claiming our services were restarting and not able to work properly due to Error: No channels left to allocate - ie RabbitMQ connections were maxing out in channel allocation. (For RabbitMQ reference into Channels and Connections)

(Ofc the screenshot is just an old print)

It was clear some code was leaking channel creations. No one knew what could potentially be - but God I listened somewhere about this Zalgo Effect

This "somewhere" is called Node.js Design Patterns: Design and Implement Production-grade Node.js Applications Using Proven Patterns and Techniques - Packet book.

How was I so sure this "Zalgo" was the culprit?

The service throwing that error was only responsible to fan out a couple messages for a lot of other services - so it was easy as creating a our internal Queue object and running N promises concurrently to publish some message - as the first thing the script run.
The RabbitMQ Management UI showed me that we created one channel for each promise - roughly.

But why it only happened in some scenarios?

That's where the Zalgo Effect pops in.

The PubSub code was built back in ~2015 - Node 4 - where the callback style was the go-to. Our Engineers created the abstraction Queue which dealt with almost 50% of our Event-Driven Architecture by itself. It was a very naive implementation - pointing to a single RabbitMQ Node.

Our legacy code was 100% usage of async.series - so when publishing messages it would be always sequentially. Later then parallel processing was added, issues started to appear more frequently. It was always just an silent issue.

So the code assumed the following:

Assert the exchange, queues and necessary resources - using one channel - which we could call consumeChannel.
1. The consume channel is created whenever the connection is made - and it is always created.
Our confirmChannel - ie the channel we used to publish events was lazily created - only being created when we need to publish some message via that connection.
1. This code mixed async and sync code heavily

Imagine the following:
We call assertConfirmChannel - which calls the AmqpLibConfirmChannelSingleton and then it creates the channel. If the Singleton, still has no inner-instance to refer, it calls a promise to create the confirmChannel and attaches the error listeners. Code example below:

What happens if two concurrent promises reaches the `AmqpConfirmChannelSingleton.getInstance` at the same time - in the same event-loop tick?

Code would reach this line from amqpLibChannelDecorator.createConfirmChannel - N times where the Promise is not resolved yet.
It means, we would have N promises created - mistakenly creating useless channels and de-referencing them.

The NodeJS GC won't kill this - since the channel is just an abstraction and still has "reference" inside the connection.

This is where the code was leaking channels.

Fixing the problem

The hotfix

Due to production hours - our hot-fix was to simply await the first promise and then fan out the other promises.
The real fix - below - was shipped with the PubSub refactoring.

How did we fix this in the long term?

If you want a real solution, here's what the V2 looked like - the idea is to create Promises and assign variables with them, instead of doing await on it. Example as below:

Notice how this.connection has an unawaited promise.

This easily fixes the problem - by setting a variable as promise and checking its existence.

A more robust style example - where you actually need to initialize a couple of resources, you could do something like below - which is the exact same idea.

Create a function to execute the entire Promise.
Set up some reference to it
If requested the same, just use the same Promise.

Ok - but why it fixes the problem?

NodeJS runs upon the famous Event-loop.

TL;DR: It is an ever-running while(true) with some well-defined and ordered steps from a queue of functions to execute.

Basically, what happens in this example is - too much synchronous code executing without ever reaching the Poll phase - where the IO callbacks are executed.

DEV Community

The Zalgo Effect and Resource Leakage - A Case

First of all, what's the `Zalgo Effect`?

So what it has to do with Resource Leakage?

How was I so sure this "Zalgo" was the culprit?

But why it only happened in some scenarios?

What happens if two concurrent promises reaches the `AmqpConfirmChannelSingleton.getInstance` at the same time - in the same event-loop tick?

Fixing the problem

The hotfix

How did we fix this in the long term?

Ok - but why it fixes the problem?

Thanks for reading it and leave a comment!

Top comments (0)

Read next

GNS3 Üzerine Router İmajı Eklenmesi

TDoC '24 Day 6: Building a Web Interface for Vocalshift with Flask

Week 3 Done

Claude.ai Voice Input & Text-to-Speech (OSS, Chrome Extension)

First of all, what's the Zalgo Effect?

So what it has to do with Resource Leakage?

How was I so sure this "Zalgo" was the culprit?

But why it only happened in some scenarios?

What happens if two concurrent promises reaches the AmqpConfirmChannelSingleton.getInstance at the same time - in the same event-loop tick?

Fixing the problem

The hotfix

How did we fix this in the long term?

Ok - but why it fixes the problem?

Thanks for reading it and leave a comment!

Read next

GNS3 Üzerine Router İmajı Eklenmesi

TDoC '24 Day 6: Building a Web Interface for Vocalshift with Flask

Week 3 Done

Claude.ai Voice Input & Text-to-Speech (OSS, Chrome Extension)

First of all, what's the `Zalgo Effect`?

What happens if two concurrent promises reaches the `AmqpConfirmChannelSingleton.getInstance` at the same time - in the same event-loop tick?