Eric Katumo for LuxDevHQ

Posted on Mar 10

The Ultimate Guide to Apache Kafka: Basics, Architecture, and Core Concepts

#dataengineering #datascience #data

Introduction

Apache Kafka is a widely-used open-source platform for distributed event streaming, supporting high-performance data pipelines, streaming analytics, data integration, and mission-critical applications across thousands of companies https://kafka.apache.org/

Originally developed by LinkedIn, Kafka is renowned for its high throughput, scalability and durability. It enables real-time data processing and is a key component in modern event-driven architectures.

Kafka Architecture

Brokers

A single Kafka server is called a Kafka Broker. Each broker operates as an independent process on a distinct machine, communicating with other brokers via a reliable and high-speed network.

There can be any number of brokers but 3 are typically shown as a minimum. This allows one out for maintenance and one to fail at the same time.

Producers

A Producer is a client application that publishes (writes) events to a Kafka cluster. The Kafka producer is responsible for creating messages of the appropriate structure and sending them using the Kafka protocol. It has several configuration options to control message creation and delivery.

Consumers

The Kafka consumer works by issuing “fetch” requests to the brokers leading the partitions it wants to consume. The consumer receives back a chunk of log that contains all of the messages in that topic beginning from the offset position.

A consumer group is a set of consumers that cooperate to consume data from some topics

Topic

Kafka topics are the categories used to organize messages. Topics are handled by Kafka as independent queues. This means that a consumer can subscribe to a specific topic and only receive messages marked with that topic.

In Kafka, topics are partitioned and replicated across brokers throughout the implementation. Brokers refer to each of the nodes in a Kafka cluster

Cluster

Kafka clusters are a group of interconnected Kafka brokers that work together to manage the data streams entering and leaving a Kafka system. As user activity increases, so does the need for additional Kafka brokers to cope with the volume and velocity of the incoming data streams.

Kafka clusters enable the replication of data partitions across multiple brokers, ensuring high availability even in the case of node failures. Your data pipeline remains robust and responsive to fluctuating demand.

Partitions

Partitions are essential components within Kafka's distributed architecture that enable Kafka to scale horizontally, allowing for efficient parallel data processing. They are the building blocks for organizing and distributing data across the Kafka cluster.

Each partition can have multiple replicas spread across different brokers, guaranteeing fault tolerance and data redundancy.

KRaft

The consensus protocol that was introduced in KIP-500 to remove Apache Kafka’s dependency on ZooKeeper for metadata management. It leverages the Raft consensus algorithm to manage metadata and handle leader election natively within Kafka.

This eliminates the dependency on an external coordination system, allowing Kafka to function as a self-contained system.

Core Concepts

Events

A digital logbook that keeps track of everything important happening in your system. This logbook is filled with "events" – like a record of each significant action or change. In Kafka, these events are the core of how data is stored and shared.

Think of an event as a single entry in this logbook, with a few key pieces of information:

Key: A unique identifier for the event. This helps you categorize or group related events.
Value: The actual details of what happened.
Timestamp: When the event occurred.
Headers (optional): Extra bits of information about the event.

Replication

Replication is the process of having multiple copies of the data for the sole purpose of availability in case one of the brokers goes down and is unavailable to serve the requests. Copies of the partition are maintained at multiple broker instances using the partition’s write-ahead log.

The write-ahead log is where all the messages for that partition are stored in order. The messages are identified by the unique offset.

Offsets

The consumer offset is a way of tracking the sequential order in which messages are received by Kafka topics. Keeping track of the offset, or position, is important for nearly all Kafka use cases and can be an absolute necessity in certain instances, such as financial services.

The Kafka consumer offset allows processing to continue from where it last left off if the stream application is turned off or if there is an unexpected failure. In other words, by having the offsets persist in a data store, data continuity is retained even when the stream application shuts down or fails.

Consumer Groups

Consumer groups allow Kafka consumers to work together and process events from a topic in parallel. Each topic consists of one or more partitions. When a new consumer is started it will join a consumer group (this happens under the hood) and Kafka will then ensure that each partition is consumed by only one consumer from that group.

So, if you have a topic with two partitions and only one consumer in a group, that consumer would consume records from both partitions. After another consumer joins the same group, each consumer would continue consuming only one partition.

Retention

Kafka retention provides the ability to control the size of the Topic logs and avoid outgrowing the existing disk size. The retention can be configured or controlled based on the size of the logs (log retention) or based on the configured duration (Time Based Retention).

Also, the same retention can be set across all the Kafka topics or it can be configured per topic, depending on the nature of the topic we can set the retention accordingly.

Conclusion

Apache Kafka is a powerful distributed streaming platform that combines messaging and storage capabilities. Its architecture, featuring brokers, topics, and partitions, delivers scalability and fault tolerance. Kafka's core concepts, such as consumer groups and offsets, enable efficient and reliable stream processing.

With its ability to handle high-volume data streams and support real-time applications, Kafka is a crucial component of modern data architectures. It empowers developers to build robust, data-driven applications that address the challenges of today's data-intensive world. To get started with Apache Kafka visit https://kafka.apache.org/quickstart

DEV Community

The Ultimate Guide to Apache Kafka: Basics, Architecture, and Core Concepts

Introduction

Kafka Architecture

Brokers

Producers

Consumers

Topic

Cluster

Partitions

KRaft

Core Concepts

Events

Replication

Offsets

Consumer Groups

Retention

Conclusion

Top comments (0)

Read next

How CI/CD is Changing the Future of Software Development

The Ultimate Guide to Apache Kafka: Basics, Architecture, and Core Concepts

Introducing krep: Building a High-Performance String Search Utility

CoTester vs Other Agentic AI Platforms for Testing