Sebastian Zakłada 🧛

Posted on Dec 12 • Edited on Dec 19

Avoiding the SQS Event Mapping Trap 💥 in Lambda

#aws #sqs #lambda #learning

tl;dr

Triggering Lambda invocations directly from SQS by the way of event source mapping may lead to messages silently dropped from the queue when event filtering is used

This is something that I kept forgetting so I thought I would put it in writing for my future self to finally remember. You guys can reap all the benefits!

When building event-driven architectures using Lambda and SQS, event filtering can seem like an elegant solution for message routing. If message matches the criteria trigger the lambda, otherwise ignore. Nice! What can go wrong?

When message is sent to an SQS queue, matching messages (decided by filtering) are processed by Lambda. If the processing fails, it's retried and upon reaching maximum retry count it' sent to a Dead Letter Queue (DLQ).

One might think such approach would give the best of all worlds without too much complexity

Resilient - resilience guaranteed by SQS and reliability with DLQ preventing data loss
Elegant - filtering rules provide a clean way to route messages to dedicated lambdas for different message types
Minimalistic - simple infrastructure requirements

You need to be careful though

There's a critical behavior that can lead to unexpected data loss if not properly understood.

Many developers will assume that when a message doesn't match the event mapping filter criteria, it will be put back in the queue for other consumers or will not be picked up from the queue by Lambda at all.

This is not the case

Once a message batch is picked up by Lambda's polling mechanism, messages that don't match the filter criteria are deleted from the queue without being processed by any function.

Yes, you read it right. Messages are PERMANENTLY DROPPED.

Real-World Impact

Consider this scenario:

// Message in SQS queue
{
    "type": "foo",
    "action": "login",
    "userId": "12345"
}

// Filter criteria - Lambda A 
{
    "type": ["foo"],
    "action": ["signup"]
}

// Filter criteria - Lambda B
{
    "type": ["foo"],
    "action": ["login"]
}

If Lambda A polls this message first:

The message is picked up by the Lambda A trigger and made invisible to other consumers
It doesn't match filter criteria
Message is dropped and never put back in the queue
Lambda B never gets a chance to process it, even though it matches its criteria

This behavior can manifest gradually - during testing with low message volumes, messages might get processed correctly by chance. As production traffic increases, concurrent Lambda polling increases and the likelihood of messages being dropped grows with scale and there's no built-in monitoring or alerting for filtered-out messages.

This behavior creates a well-concealed pitfall in your event-driven architecture. While event filtering appears to offer a clean solution for message routing, it is not the case with SQS.

The issue becomes particularly treacherous in systems with multiple consumers where the order of message polling can determine whether a message gets processed or discarded. Without careful consideration, you might end up losing important data without any indication of failure or warning. Even more deceptively, this issue might not surface during initial testing, only to emerge as a serious problem later when message volumes and concurrent processing increase.

Why this is happening

It's important to understand the difference between SQS and other messaging services in AWS

Characteristic	SQS	SNS	EventBridge	DynamoDB Streams Kinesis Data Streams
Type	One-to-one queue	Pure fanout	Event bus	Streaming
Message Processing	Exactly one Lambda processes each message with visibility timeout	Each subscribed Lambda gets its own message copy	Each rule's target gets its own event copy	Each Lambda reads independently from its checkpoint
Filtering Behavior	Filtered messages permanently deleted from queue	Filtered messages silently dropped but still reach other subscribers	Filtered events dropped for that target only	Filtered records stay in stream
Error Handling	Failed messages return to queue until `maxReceiveCount`, then go to DLQ if configured	Failed Lambda invocations go to DLQ if configured, not affecting other subscribers	Failed deliveries retry based on target policy then go to target's DLQ if configured	Failed processing retries until record expires (24h DDB, configurable in Kinesis)

Knowing that, the 🎲 dropped messages behavior becomes clearer when we understand how SQS handles message visibility and processing.

Source: Amazon Simple Queue Service documentation

When a message enters an SQS queue, its lifecycle follows a pattern:

Publisher sends the message to the queue where it becomes available for processing. When a consumer (in this case, Lambda) retrieves the message, it triggers a visibility timeout period during which the message is invisible to other consumers.
During this period, Lambda applies its filter criteria, and if the message doesn't match, it's deleted from the queue - just as if it had been successfully processed.

The key insight is that the message deletion occurs at the SQS level, before your Lambda function even sees the message. This means:

Once a Lambda polls a message batch, the filtering happens during the visibility timeout
Messages failing the filter criteria are deleted immediately
There's no opportunity for other consumers to process these messages

This is particularly important because once the deletion occurs, there's no mechanism to recover or reprocess the message, even if other consumers might have been interested in it.

In other words - the visibility timeout, which normally serves as a safety mechanism to prevent duplicate processing, becomes the window during which messages can be permanently lost if filter criteria don't match.

Unnecessary ambiguity

While not explicitly documented, I've consistently observed this behavior across multiple implementations and test scenarios. Each attempt to work around this limitation led to the same conclusion.

When Lambda receives messages from SQS with an event filter configured, any message that doesn't match the filter criteria is permanently dropped.

Call to Action for AWS

Please, pretty please with sugar on top - make it less confusing. The current behavior should be more explicit in the documentation. While the docs state that non-matching records are "discarded," they should clearly emphasize that messages are permanently deleted from the queue and cannot be processed by other consumers.

Moreover, the current documentation suggests using event filtering to "reduce unnecessary Lambda invocations" by filtering messages with certain data parameters. This guidance is problematic because:

It actively encourages an architectural pattern that can lead to data loss
The suggested use case could be better served by SNS fanout, EventBridge rules, or separate SQS queues

AWS should consider improving the implementation of event filtering for SQS to Lambda integrations to prevent silent data loss while maintaining simplicity. Here are two potential high-level solutions:

1. Make the current filtering behavior configurable through a simple parameter in the event source mapping:

{
  "FilterCriteria": { ... },
  "FilteredMessageBehavior": "DELETE | RETURN_TO_QUEUE"
}

This would allow teams to choose whether non-matching messages should be deleted or remain available for other consumers.

2. Or event better, add support for automatic SQS queue creation when setting up EventBridge rules or SNS topic subscriptions. Instead of manually creating queues for each filter pattern, allow developers to define their routing logic and have AWS automatically manage the underlying queues:

# SNS example
  OrdersTopic:
    Type: AWS::SNS::Topic
    Properties:
      Subscriptions:
        - Protocol: sqs-managed  # New protocol type
          FilterPolicy: 
            type: ["order_created"]
          EndpointProperties:    # Queue will be created automatically
            FunctionArn: !GetAtt OrderProcessor.Arn
            DLQEnabled: true

This would automatically create the necessary SQS queues behind the scenes, while presenting a simple interface to developers.

Currently, developers must choose between using Lambda's problematic event filtering or managing complex queue infrastructure. By automating queue creation while preserving control over event routing, AWS could provide a better developer experience without sacrificing reliability or flexibility.

Until these improvements are made, users should be strongly cautioned against using event filtering with SQS to Lambda integrations.

I am struggling to understand the use case for SQS event filtering

With the current behavior of silent message dropping, SQS event filtering appears to be a solution in search of a problem. The most obvious use cases where filtering might seem appealing actually turn out to be anti-patterns that could be better handled through other AWS services.

For example, trying to route different message types to specific Lambda functions would be better handled by SNS fanout or EventBridge rules, as these services are designed for message routing. Using SQS filtering for this purpose introduces the risk of silent message loss due to the polling order of Lambda functions. Improved resilience in this pattern comes at a cost of having to maintain dedicated SQS queues for each of the filters, which is exactly what an SQS filter misleadingly promises to avoid.

Optimization through reduced Lambda invocations might seem like another potential use case, but this comes at the dangerous trade-off of potentially losing business-critical messages. The minimal cost savings from fewer Lambda invocations rarely justifies the operational risk of missing important events. This would be better handled either through proper capacity planning or if using EB/SNS is not an option, by implementing filtering logic within the Lambda function itself.

Another apparent use case might be implementing priority queues by filtering for high-priority messages first. However, this approach is fundamentally flawed due to the message deletion behavior - lower priority messages could be permanently lost if they're picked up by the high-priority consumer first. Better solutions exist for this scenario.

The current implementation seems to fall into an awkward place - it creates a significant risk of data loss while not providing unique benefits. What's more the event filtering implementation breaks SQS core guarantee of reliable message delivery. It introduces silent failure modes that contradict the typical reliability patterns developers expect.

What's the alternative?

Consider these approaches. Just remember, the best solution will always depend on what you are solving for.

Separate Queues

Instead of using filter criteria, create dedicated queues for different message types. This ensures reliable message delivery at the cost of additional infrastructure.

Message Pre-filtering

Implement filtering logic before messages enter SQS, such as using SNS with multiple SQS subscriptions or EventBridge with dedicated pipes or rules for message routing.

While adding infrastructure complexity, it provides important benefits:

Explicit message routing with reliable delivery
Clear visibility into message flow
Independent scaling for each consumer
No risk of accidental message loss

The required SQS infrastructure provisioning part can be simplified by templating and scripting, creating reusable patterns that make queue management more maintainable.

In-Function Filtering

Process all messages in a single Lambda function and implement filtering logic within your code. While this may result in more Lambda invocations, it guarantees reliability and is a dead simple solution that's easy to understand even for junior developers.

Don't use Queues

For simple workflows triggering Lambda directly from SNS can work well. This approach is suitable when message ordering isn't critical, and event volume is predictable and within Lambda's concurrency limits (no support for buffering, batching and less flexibility in handling traffic spikes).

Wrapping up

After digging into SQS event filtering, it's clear there's more to this feature than meets the eye.

Here's what we've learned:

The Unexpected Behavior - messages that don't match your filters are silently deleted without notification - a concerning end result
Alternative Approaches - several more reliable options exist:
- Separate queues for different message types
- SNS or EventBridge for message routing
- Filtering logic within Lambda functions
- Direct SNS-to-Lambda for simpler workflows

The reality is that sometimes seemingly elegant solutions come with hidden complexities. Understanding these nuances helps us make better architectural decisions, even if it means taking a slightly longer path to get there.

I'm still searching for compelling use cases where SQS event filtering provides clear advantages over traditional approaches. If you've successfully implemented this feature in production or found specific scenarios where it works well, I'd be interested in hearing about your experience in the comments.

Disclaimer

This article is an independent developer guide. All views, experiences and recommendations are my own. Your mileage may vary.
Always refer to the latest AWS documentation for the most current information about service behaviors and limitations. Confirm any assumptions by the way of test implementations.

DEV Community

Avoiding the SQS Event Mapping Trap 💥 in Lambda

`tl;dr`

You need to be careful though

Real-World Impact

Why this is happening

Unnecessary ambiguity

Call to Action for AWS

I am struggling to understand the use case for SQS event filtering

What's the alternative?

Separate Queues

Message Pre-filtering

In-Function Filtering

Don't use Queues

Wrapping up

Disclaimer

Top comments (0)

Read next

Top 7 Artificial Intelligence Concepts Every Beginner Should Learn

Move objects from one folder to other in the same S3 Bucket using C# in AWS

Roadmap.sh Learning Guide Platform for Continue to Grow

O que achei do Bootcamp Java com Spring Boot organizado pela Dio e Claro