DEV Community

Martin Simon
Martin Simon

Posted on

Building Real-Time Data Pipelines with Kafka Streams

Microservice architecture may offer numerous benefits, but it also comes with its fair share of trade-offs. One of these is the challenge of managing distributed data across multiple databases. Providing even basic reporting can become a daunting task when data from various sources must be combined. For such scenarios, building a reporting service on top of a relational database can make a lot of sense. But how do you transfer the data efficiently without adding technical overhead with tools like Apache NiFi or Flink? In this article, I want to take a closer look at data streams using Kafka and discuss strategies to achieve this goal. I’d love to hear your thoughts and experiences as well.

Image description

Stream the Data, Not Just Events

Consider a simple microservice for processing orders. An order can transition through the following states:

  • CREATED
  • PAID
  • DELIVERING
  • DELIVERED

If we want to update the order data in the reporting system every time there’s a change, we need to listen to those changes. However, propagating data through streams should not rely solely on events but rather on the complete data itself. Avoid designing events that carry only partial state information about your domain.

Here’s an example of a poorly designed event:

{
  "specversion": "1.0",  
  "type": "OrderStatusUpdated",
  "source": "/orders-service",
  "id": "1234-5678-9012",
  "time": "2025-01-14T12:00:00Z",
  "datacontenttype": "application/json",        
  "data": {    
    "orderId": "order-001",    
    "status": "SHIPPED",  
    "timestamp": "2025-01-14T12:00:00Z"  
  }
}
Enter fullscreen mode Exit fullscreen mode

The complexity of handling such events is significantly higher compared to handling data streams. Imagine a scenario where you want to restore the data warehouse by replaying all events from Kafka. With an event-per-status approach, the system would process each individual status change of an order, drastically increasing the load on the consumer. In contrast, handling complete data objects reduces the workload and ensures a more efficient pipeline.

A simple solution is to use event structures that encapsulate the full state of the entity. For example:

{
  "specversion": "1.0",
  "type": "OrderCreateUpdate",
  "source": "/orders-service",
  "id": "event-5678",
  "time": "2025-01-14T13:05:00Z",
  "datacontenttype": "application/json",
  "data": {
    "orderId": "order-001",
    "customer": {
      "id": "customer-123"
      "name": "John Doe",
      "email": "john.doe@example.com"
    },
    "items": [
      {
        "productId": "prod-123",
        "quantity": 2,
        "price": 20.0
      }
    ],
    "status": "SHIPPED",
    "statusHistory": [
      {
        "status": "PLACED",
        "timestamp": "2025-01-14T10:00:00Z"
      },
      {
        "status": "PROCESSING",
        "timestamp": "2025-01-14T11:00:00Z"
      },
      {
        "status": "SHIPPED",
        "timestamp": "2025-01-14T12:00:00Z"
      }
    ],
    "shippingDetails": {
      "address": "123 Main St, Springfield",
      "carrier": "DHL"
    },
    "totalAmount": 40.0,
    "lastUpdated": "2025-01-14T12:00:00Z"
  }
}
Enter fullscreen mode Exit fullscreen mode

Idempotency: The Cornerstone of Data Consistency

Let’s revisit the example of a poorly designed event that contains only a partial status update. Beyond the increased load on the consumer when restoring data, the critical flaw is the lack of idempotency in the event design.

While the event includes an id, the status update itself lacks a unique identifier. This creates challenges during consumption. Although consumers can verify whether they’ve already processed an event based on its id and skip it if necessary, what happens if there’s an issue with the producer? Imagine the producer unintentionally generates multiple events for the same status change. Without a unique identifier for the status update, it becomes nearly impossible for consumers to determine whether the status has already been processed.

This flaw can lead to data duplication during event processing, where the same change is applied multiple times, resulting in inconsistencies in the downstream system. For instance, in the case of a reporting database, duplicated events could inflate metrics, misrepresent totals, or even corrupt the overall dataset.

To ensure idempotency, each event should carry a unique identifier not only for the event itself but also for the specific change it represents. This allows consumers to reliably deduplicate messages and maintain consistent data, even in the face of producer-side issues or message re-delivery.

Recreating Data in Your Data Store

When it comes to initially loading data into a reporting system or restoring data for any reason, it’s essential to consider various strategies. Let’s explore some techniques to handle these scenarios effectively.

Replaying the Topic

Resetting the offset of the Kafka consumer is the simplest way to retrieve all events from the beginning of the topic and restore data in your reporting system. However, this approach comes with several prerequisites and limitations:

  • Retention Time: The event retention time must be set to -1 (infinite retention) to ensure no messages are lost since the beginning. This is often impractical due to legal regulations or storage limitations.

  • Topic Existence: The topic for exposing data must have existed since the inception of the order service. However, as systems evolve and functionalities are added iteratively, this is rarely the case in reality.

These constraints necessitate alternative solutions to ensure reliable data restoration.

Republishing All Events

One effective solution is to provide a mechanism to republish all events from scratch. This involves reading all data from the source database, transforming it into events, and publishing those events to the topic. Here’s a simple way to implement this:

Providing an Admin Endpoint

Expose an admin endpoint that can trigger the republishing process. For example:

POST /admin/expose-order-events
Enter fullscreen mode Exit fullscreen mode

This endpoint should have limited access, such as being protected behind authentication and authorization layers, to prevent misuse. When called, it starts reading data from the source database, transforms the data into events, and publishes them to Kafka.

Leveraging Idempotent Events

If the events are designed to be idempotent, you can safely call this endpoint multiple times without causing data duplication or side effects. Idempotency ensures that the same event can be processed multiple times with consistent results, making this approach both reliable and robust.

By combining these techniques, you can effectively manage data restoration scenarios while maintaining data consistency and reducing the complexity of your architecture.

Summary

This article focused on key recommendations for implementing data streams using Kafka, intentionally avoiding the inclusion of additional tools to keep the architecture as straightforward as possible.

By adhering to a few fundamental principles, you can significantly reduce future effort, minimize unnecessary load on consumers, and prevent potential data inconsistencies in your data warehouse.

To achieve these goals, I recommend focusing on:

  • Ensuring idempotency: Design events with unique identifiers to avoid data duplication and maintain consistency.
  • Streaming full data objects, not just events: Provide the complete state of domain entities to simplify downstream processing.
  • Enabling easy data restoration: Ensure data can be easily restored from the source database to handle failures or rebuild the warehouse efficiently.

By following these principles, you can create a robust and maintainable data pipeline that scales well and integrates seamlessly with your reporting systems.

Let me know about your thoughts and opinions!

Top comments (0)