Ably Blog for Ably

Posted on Oct 15 • Originally published at ably.com

Realtime reliability: How to ensure exactly-once delivery in pub/sub systems

#pubsub #webdev #architecture

In pub/sub systems, the question isn't just, “Will my data reach its destination?”, it's also, “How many times will it get there?”

Exactly-once delivery is the gold standard of distributed systems but achieving it with pub/sub requires significant engineering effort. In fact, for some systems, the challenge is so great that they settle for at least once delivery – meaning your subscribers might receive individual messages more than once.

So, why is exactly-once delivery so hard to achieve? And how do the architectural challenges of achieving exactly-once semantics change as your system scales?

To get the answers, first we’ll need to remind ourselves of how the pub/sub pattern is architected.

The publish/subscribe architecture pattern

The name pub/sub, or publish/subscribe, is a particularly useful description of how the pattern works.

Publishers publish messages to topics by sending them to a broker. The broker then forwards those messages to subscribers that are subscribed to the topic.

Unlike in a message queue, where there’s usually a one-to-one relationship between message producers and consumers, there can be any number of publishers sending messages to a topic and any number of subscribers receiving messages from that topic. Including zero.

That’s because pub/sub is loosely coupled. Publishers have no knowledge of what happens to their messages after they enter the broker. And subscribers don’t need to know anything about the origins of the messages they receive.

That’s useful because both publishers and subscribers need to integrate only with the broker and it allows them to operate independently from one another. But the trade-off is that decoupling in this way means pub/sub systems need extra guardrails to track and ensure message delivery. And that’s where pub/sub delivery guarantees come into play.

Compares tightly coupled vs. decoupled architectures in authentication systems, and how pub/sub systems enable asynchronous communication flows.

Pub/sub delivery guarantees

Pub/sub systems offer three primary delivery guarantees, each with its own benefits and trade-offs. Let’s look at how each guarantee works and the potential downsides.

At-most-once delivery

The simplest approach. If everything works, your message will arrive. But there’s no tracking or retries. If a failure occurs, the message is lost.

How it can go wrong

Publisher failures: If the publisher fails to send a message or crashes, the message is lost forever. There are no retries to recover it.
Broker failures: If the broker crashes or loses the message in transit, there’s no mechanism to resend it. The message is gone.
Subscriber failures: If the subscriber disconnects or fails before receiving a message, it won’t be retried and the message will be lost.

At-most-once delivery trades simplicity for reliability, making it a poor choice for systems where message loss is unacceptable but potentially suitable for ephemeral data.

At-least-once delivery

Messages are guaranteed to arrive, but they may arrive more than once. If a failure occurs, the system retries delivery until it receives an acknowledgement (ACK).

How it can go wrong

Publisher failures: If the publisher doesn’t receive an ACK, it will retry sending the message, which can result in duplicates if the original message did go through but wasn’t acknowledged.
Broker failures: Brokers store messages and attempt retries if they don’t receive an ACK from the subscriber. However, a broker failure or restart can cause it to resend messages already delivered, leading to duplicates.
Subscriber failures: When subscribers disconnect, the broker retries sending the message once they reconnect. If no state tracking exists, the subscriber may receive the same message multiple times.

At-least-once delivery ensures messages arrive but at the risk of multiple copies arriving rather than just once, requiring deduplication on the subscriber side.

Exactly-once delivery

Every message is delivered exactly once, without duplicates or losses.

How it prevents message loss and duplication

Publisher failures: Publishers must implement idempotent publishing to ensure that retries after a failure don’t result in duplicates. Each message is assigned a unique identifier so the broker can detect and discard duplicates.
Broker failures: Brokers must use persistent message tracking to store message states, ensuring they can recover from failures without redelivering messages. They also need to ensure coordinated acknowledgments (ACKs) are sent only after confirming exactly-once processing. In a distributed system with multiple brokers, brokers must coordinate.
Subscriber failures: Subscribers must track their last successfully received message. If they disconnect and reconnect, they need to inform the broker where to resume sending, ensuring no duplicates or message loss.

So, why don’t all pub/sub systems offer exactly-once delivery? It comes down to the complexity of engineering it. For many use cases, the effort may not be worth it. But if you’re building a chat app, distributing financial data, or enabling realtime collaboration, you need to be certain that your data arrives once and only once.

In those cases, you face a choice: either take on the challenge of building exactly-once delivery yourself or rely on a pub/sub provider that can handle it for you.

Pub/sub exactly-once delivery: why is it so hard?

As we saw earlier, pub/sub systems separate the various components involved in publishing, distributing, and receiving data. This decoupling makes it easier to scale and work around problems because you can add and remove components while the system is running. But it also means that keeping track of communication between those components takes more effort. Let’s look at why.

Pub/Sub is stateless (mostly): The basic pub/sub architecture pattern decouples the publisher, broker, and subscriber, with the broker acting as a middle layer that doesn’t track the state of messages across the system. The broker passes messages publisher to broker to subscriber, but the broker doesn't have to know if a message was successfully processed by all subscribers. It requires an intentional engineering choice to track that state and lay the groundwork for enabling exactly-once delivery.
Networks can be unreliable: The first lesson of distributed systems is that things will go wrong. Messages can be delayed, lost, or duplicated, meaning we need to add state to pub/sub to enable delivery guarantees, such as exactly-once semantics.
Computers and software can be fragile: Even if the network plays its part, publishers, brokers, and subscribers can all fail at different stages of the message flow, leading to retries, message duplication, or loss. Handling these failures gracefully without duplication requires verification at each step of the pub/sub system.
Concurrency: Running multiple systems concurrently and sharing the workload between them is the fundamental strength of decoupled patterns like pub/sub. But it does mean that messages in pub/sub systems are often processed concurrently, which can lead to race conditions or duplicate processing without proper synchronization.

When you’re operating at smaller scales, these issues are less likely to manifest because there are fewer moving parts.

But what happens when you need to handle more traffic?

As your pub/sub system grows, complexity increases exponentially due to the n-squared problem, which means that each additional component—whether a publisher, broker, or subscriber—results in an outsized increase in interactions. In a system with n components, the number of potential interactions grows as n².

As the system scales, the chances of missed acknowledgments and duplicate messages rise due to two main factors:

Increased opportunity for failure: Every additional publisher, broker, or subscriber raises the chances that something will go wrong.
Greater interaction complexity: Each new node introduces more dependencies, more retries, and more potential for failures to cascade through the system.

So, what might those failures look like?

Race conditions: Redundancy, where multiple components handle copies of the same message concurrently, helps protect against failures but also raises the risk of multiple brokers processing the same message in parallel. Without proper synchronization, both brokers might deliver the message to subscribers, leading to duplicate processing.
Duplicate deliveries: When the status of a message is unclear due to lost or delayed acknowledgments, the system may retry sending it. For example, a publisher may resend a message if it doesn’t receive a timely acknowledgment from the broker, leading to duplicate messages for the subscriber.
Missing acknowledgments: In larger systems, acknowledgments may be delayed or lost at various points. A broker might successfully deliver a message to a subscriber, but if the acknowledgment doesn’t reach the broker due to network issues, the message could be resent, creating duplicates. Alternatively, if retries aren’t in place, the message might be lost entirely.

The impact of the n-squared problem amplifies these failure modes significantly. As interactions multiply, the risks of race conditions, duplicate deliveries, and missing acknowledgments grow exponentially, requiring stronger mechanisms to maintain exactly-once message delivery.

Designing for exactly-once delivery in pub/sub systems

So, what practical engineering steps can we take to avoid these issues and achieve exactly-once delivery?

Think about the challenge. Our message must pass through multiple decoupled components, each with its own potential failure points. Along the way, we need to avoid introducing new problems, like duplicates, and that becomes an even greater risk as the pub/sub system scales out to multiple nodes or cloud regions.

To explore some of the engineering techniques that help us achieve exactly-once delivery, let’s imagine a fan engagement app that allows people to follow live sports events and interact in realtime.

During critical moments—like a last-minute goal in football or a buzzer-beater in basketball—thousands of fans might send messages simultaneously. Ensuring these messages are delivered correctly, and only once, is a great example of the engineering challenge we’ve been looking at. That’s especially true at scale, where the n-squared problem means that each new fan interaction could generate thousands, millions, or even billions of message deliveries.

Thankfully, there are a few ways to make exactly-once delivery possible.

Idempotent publishing and deduplication

Imagine a fan sends a "GOAL!" message, but their app (the publisher) loses connection. By assigning each message a unique identifier, such as a serial number or some form of timestamp, the brokers can recognize and discard attempted duplicates. That way, rather than overwhelming other fans with multiple messages, the “GOAL!” message arrives with each user only once, no matter how many times the publisher retries.

Throughout the system, if a message does arrive more than once then deduplication logic can identify it as a duplicate and discard it before it is processed.

This becomes more important at scale, where improper retries could overwhelm the system.

Coordinated acknowledgments (ACKs)

In some systems, brokers send an acknowledgment (ACK) as soon as a message arrives, informing the publisher that it has reached the broker. However, this doesn’t confirm that the message has been processed and delivered to subscribers. If a failure occurs while the broker is processing the message or before it reaches subscribers, the initial ACK becomes meaningless, potentially leading to lost messages.

To ensure reliability, there needs to be a chain of acknowledgments. The broker should only send an ACK back to the publisher after receiving confirmation that the message has not only been processed but also successfully delivered to the subscriber(s). This way, each component in the system—publisher, broker, and subscriber—confirms message receipt and processing, ensuring exactly-once delivery even in the face of failures.

As the number of brokers and subscribers grows—particularly in systems with redundancy across multiple availability zones—tracking and coordinating ACKs becomes more complex. If multiple brokers handle the same message, the system must ensure only one ACK is sent after the message is fully processed.

For example, after receiving a message, the broker must confirm that it has been successfully processed and delivered to subscribers before issuing an ACK. If brokers are working across different zones, coordination is essential to prevent duplicate ACKs or the premature acknowledgment of a message that has not been fully processed. This coordination ensures that messages are reliably delivered exactly once, even as the system scales and failures occur.

Message persistence for brokers

Enabling brokers to withstand failures is an important part of guaranteeing exactly-once delivery, even without multiple availability zones. Message persistence enables brokers to recover undelivered messages and continue processing after a failure.

State management in subscribers

Similarly, subscribers must keep track of the last message they successfully processed. If a subscriber disconnects and later reconnects, it needs to inform the broker of where it left off. This allows the broker to resume sending new messages without re-sending any previously processed data. Subscriber state tracking guarantees that during disconnections, the message stream resumes correctly without loss or duplication.

Batching

If a goal occurs in a sports match, it’s likely that more than one fan will want to share their joy. But sending hundreds of identical messages is highly wasteful, especially as each new publisher and subscriber adds exponentially to the number of message movements.

Batching is a simple way to reduce the overall number of deliveries. For example, instead of processing each fan reaction individually, the system could group them into batches, minimizing delivery overhead. That way, tens, hundreds, or even thousands of messages could be sent as one message.

Aggregation

Aggregation addresses the problem in a slightly different way by consolidating multiple identical or similar messages into one. For example, if 100 fans send "GOAL!" messages, the system can aggregate them into a single message, reducing the number of messages handled by the broker.

{

  "message": "GOAL!",

  "aggregation_count": 100,

  "timestamp": "2024-09-27T15:32:20.123Z"

}

If you need exactly-once delivery, should you build or buy?

Managing your own pub/sub system can quickly consume your team’s time and resources. What starts as a small distraction can soon become a major obstacle, slowing your ability to deliver core features. When that happens, switching to a managed pub/sub platform might be the more efficient choice.

But which platforms offer exactly-once delivery?

When it comes to managed pub/sub platforms, your options are relatively limited if you need exactly-once delivery. For example, Redis Pub/Sub promises at-most-once delivery and, similarly, PubNub’s support materials show that their platform does not guarantee delivery.

But there are some platforms that do promise exactly-once delivery, such as:

Google Pub/Sub: While Google Pub/Sub offers an option for exactly-once delivery, its documentation warns that the “guarantee only applies when subscribers connect to the service in the same region”. In other words, if a subscriber connects to a region other than the one hosting the publisher, the guarantee no longer applies.
Confluent Kafka: By default, standard Kafka guarantees at-least-once delivery. Exactly-once delivery can be achieved with Kafka Streams using transactional producers and consumers or when connecting external systems using Kafka Connect’s offset management. However, exactly-once delivery isn’t the default path for Kafka and requires additional configuration.
Amazon Kinesis: Instead of offering exactly-once delivery, Kinesis provides deduplication, ensuring that data is processed only once, even if it is delivered multiple times. This approach handles duplicates transparently, achieving similar results to exactly-once delivery without the same guarantees.
Apache Pulsar: Pulsar offers what it calls effectively-once delivery.
Amazon SQS: FIFO queues in SQS use hash-based and identity based deduplication to effect exactly-once delivery.
Ably: Ably’s global realtime platform guarantees exactly-once delivery.

At Ably, exactly-once delivery isn’t just a nice-to-have. We believe that it’s a fundamental part of the promise that any realtime platform should offer. But, as we’ve seen, achieving that type of guarantee requires substantial investment in engineering and infrastructure reliability. So, how do we make it happen at Ably?

How Ably provides pub/sub messaging with exactly-once delivery guarantees

Guaranteeing exactly-once delivery in a pub/sub system is a significant engineering challenge. That might be why our realtime platform at Ably is one of the few to guarantee exactly-once delivery of pub/sub messages across a global network of edge locations and cloud regions.

Idempotent publishing

Ably implements idempotent publishing by assigning a unique identifier to each message. If the publisher doesn't receive an acknowledgment due to a network issue and retries sending the message, the Ably platform uses this serial number to throw out duplicates. That way, even sending the same message multiple times results in it being processed and delivered just once.

But how does that work across multiple brokers, which could each be processing the same message concurrently? When a message first reaches Ably, one of the first steps the system takes is to make sure that every broker is aware of that message’s unique ID. That way, even if the publisher tries to resend the message and it reaches a different node within the Ably system, it can be safely discarded.

Eight nines message survivability

Ably brokers ensure fault tolerance and high availability to provide eight nines message survivability: 99.999999%. It does this through multiple availability zones.

Ably acknowledges receipt of a message only when it is sure it can guarantee its processing. That involves immediate replication across multiple availability zones so even if an instance or cloud region fails then the broader system can still process delivery.

Subscriber state management

If a subscriber disconnects, the client SDK keeps track of the last message received. When it reconnects, the SDK provides the broker with the ID of that message. In the meantime, Ably has been storing the messages that it couldn’t deliver to the disconnected subscriber, meaning it can send everything since the last message the subscriber received.

Persistent data

Even if a major outage were to occur, disconnecting multiple subscribers for an extended period, Ably offers a 99.99999999% data survivability.

Resilient SDKs

Ably provides SDKs for all popular languages and frameworks, with each one designed to abstract away the complexities of distributed systems through:

Automatic retries and state synchronization: The SDKs handle message IDs, retries, and state synchronization without asking developers to get into the details.
Fault detection: Ably’s SDKs monitor the connection, allowing them to detect and connect to healthy datacenters in the event of an issue, minimizing downtime.

Ensure your data’s integrity with Ably’s pub/sub platform
We’ve built the Ably platform to provide exactly-once delivery guarantees, across a global network that brings your data closer to your end users, ensures low latencies, and allows you to focus on delivering value to your end users.

Try Ably and start building with our scalable, fault tolerant, realtime platform today.

DEV Community