The Ultimate Guide to Apache Kafka

What is Data Streaming?

Data streaming is the practice of continuously capturing data in real-time from data sources such as databases, cloud services, sensors, and software applications; manipulating, processing and reacting to it instantly to enable real-time decision-making and insights.

What can data streaming be used for?

Some of its many uses include:

To process payments and financial transactions in real-time such as in banks
To monitor patients in hospital care and predict changes in condition to ensure timely treatment in emergencies.
To continuously capture and analyze sensor data from IoT devices or other equipment, such as in factories and wind parks.

Here are some of its use cases.

What is Apache Kafka?

Apache Kafka is a distributed, highly scalable streaming platform that manages and processes large amounts of data in real time.

Main Concepts and Terminology

Servers: Kafka is run as a cluster of one or more servers that can span multiple datacenters or cloud regions. Some of these servers form the storage layer, called the brokers. Other servers run Kafka Connect to continuously import and export data as event streams to integrate Kafka with your existing systems such as relational databases as well as other Kafka clusters.

Event: Also called a record or a message. Events typically contain information about what happened, when it happened, and relevant details.

Topics: This is where events are organized and stored. It is like a table in relational databases.

Producers: Client applications that publish(writes) events to kafka topics.

Consumers: Client applications that subscribe to(reads and processes) events in the topics.

Partations: Divisions of a topic for scalability and parallelism. A topic is spread over a number of "buckets" located on different Kafka brokers. This allows client applications to both read and write the data from/to many brokers at the same time.

Replications: this is the process of duplicating topic partitions across multiple brokers to ensure fault tolerance, and high availability.

Connector: a component of Kafka Connect that allows seamless integration between Kafka and external data systems (such as databases, cloud storage, and software applications).

Here is a step by step guide to getting started with Apache Kafka:

Installation

Kafka works well on Linux operating system. If you are on windows, you can download Windows Sub-Linux(WSL).
Before you start the installation make sure you have Java(Version 11 or 17) installed on your system.
Download your preferred version of the Kafka binaries, unzip it and change the current working directory:

$ wget https://archive.apache.org/dist/kafka/3.6.0/kafka_2.12-3.6.0.tgz
$ tar  -xzf kafka_2.12-3.6.0.tgz

You can rename the directory to your preferred name:

$ mv kafka_2.12-3.6.0 kafka

Start the Kafka environment

Apache kafka can be started using KRaft or zookeeper. In our case we will use zookeeper.
To start a zookeeper server run the following commands:

$ kafka/bin/zookeeper-server-start.sh kafka/config/zookeeper.properties

Open another terminal and run the following command to run the kafka broker service:

$ kafka/bin/kafka-server-start.sh kafka/config/server.properties

You are now running a kafka environment that is ready to use!

Create a Topic to store your Events

A topic is like a table in relational databases while events are like the records in the table.
So, before writing events you need first to create a topic.
Open another terminal and run the following command:

$ kafka/bin/kafka-topics.sh --create --topic --topic-name --bootstrap-server 127.0.0.1:9092

By default kafka runs on port 9092, while this '127.0.0.1' is the IP address for the localhost.
You can list the topics created using the following command:

$ kafka/bin/kafka-topics.sh --list --bootstrap-server 127.0.0.1:9092

Write Events to kafka topic

A Kafka client communicates with the Kafka brokers via the network for writing (or reading) events.
Once the brokers receive the events, they will store them in the specified topic for as long as you need.

Run the console producer client to write some events into your topic:

$ kafka/bin/kafk-console-producer.sh --topic topic-name --bootstrap-server 127.0.0.1:9092
>My first event in topic-name
>My second event in topic-name

You can stop the producer client with ctrl+c.

Consume(Read) the Events from the topic

Run the console consumer client to read the events you just created:

 $ kafka/bin/kafk-console-consumer.sh --topic topic-name --from-beginning --bootstrap-server 127.0.0.1:9092
My first event in topic-name
My second event in topic-name

Perfect, both records were successfully sent from the producer to the consumer!
You can stop the consumer client with ctrl+c