GoogleCloud PubSub is the serverless implementation of a Publisher - Subscriber management service. It's built around the concept of topics (where messages are published to) and subscriptions (where message are consumed from).
There are 3 types of subscriptions:
- Push: message consumption is initiated by PubSub
- Pull: message consumption is initiated by the subscriber
- BigQuery: special mode where the subscriber is a BigQuery agent which stores messages in a Bigquery table.
PubSub now offers Exactly-once delivery option for Pull subscriptions ; this option is GA since December 2022 (doc).
Let's see what does that mean in details.
What is a delivery ?
In PubSub context, a delivery is a process which encompasses the following items:
- sending a message to a consumer
- receiving an acknowledgement (ack) from the consumer before the ack delay of the sent message times out.
- Alternatively, the consumer can send a "nack" (negative acknowledgment) instead of an ack. It tells the sender that the message could not be processed and must be sent again.
When the ack of the message is received, the message is considered delivered by PubSub
The following schema illustrates the delivery process in Pub/Sub:
- The publisher sends the message in the topic
- The subscriber pulls the message from the subscription. The message ack delay starts here
- When the processing is done, the subscriber acknowledges the message
- The message is marked as delivered.
Any failure occurring during this flow - networking issue, VM crash - can potentially lead to new delivery attempts resulting in duplicate outputs if multiple attempts finally succeed for the same message.
Exactly-once delivery
The usual guarantee offered by PubSub is at least once delivery. It means that in case of such a failure, PubSub will attempt to deliver the message again, until it's successfully acked (or the subscription retention limit is reached).
The exactly-once delivery option ensures that PubSub will not resend messages
- while the ack (or nack) is not received and the delay has not expired
- once the ack is received
This guarantee is made possible by the usage of persistent storage by PubSub agents: Contrary to the default mode where messages' status are stored in transient memory, Exactly-once uses a regional persistent storage service. Hence, this guarantee is enforced at regional level.
Consequences
The Exactly-once mode enforces that under the conditions detailed above, the message will only be delivered once. However this doesn't mean that no message will ever be sent multiple times. How come ?
Indeed, one has to make the difference between duplicates and legitimate redeliveries. For example, if the consumer takes too much time to process a message, so much that the ack delay expires, or if the process crashes and no ack is sent back to PubSub, then PubSub has no way to know that the message was already processed. Thus the message will be sent again in response to the next Pull request.
As a consequence, if "exactly-once" message processing is of primary importance for you and you absolutely want to avoid any duplicate, you have to put extra care at every layer of your system. You need to pick ack delays according to the maximum time the processing can take (including retries and such). You also need to mitigate all risks of duplicates at each stage of your workflow. As you can see, Exactly-once solves the central step of the pipeline below, not the others who still have to be taken care of:
PubSub's Exactly-once option is definitely a good step forward in helping to reach this goal, but it doesn't solve everything. End-to-end exactly-once delivery remains a challenge for any event-based system !
Thanks for reading! Iām Matthieu, data engineer at Stack Labs.
If you want to discover the Stack Labs Data Platform or join an enthousiast Data Engineering team, please contact us.
Cover picture by Brett Jordan on Unsplash
Top comments (0)