DEV Community

Cover image for Apache Kafka — The Big Data Messaging tool
Gursimar Singh
Gursimar Singh

Posted on

Apache Kafka — The Big Data Messaging tool

Apache Kafka was built for real-time. In programming we consider everything to be events. Now, events have states as well. But the primary idea is that the event is an indication in time that the thing took place.

Now it’s a little bit cumbersome to store events in databases. Apache Kafka was built for real-time. In programming we consider everything to be events. Now, events have states as well. But the primary idea is that the event is an indication in time that the thing took place.

Now it’s a little bit cumbersome to store events in databases. Instead, we use a structure called a log. And a log is just an ordered sequence of these events. An event happens, and we write it into a log, a little bit of state, a little bit of description of what happens.

Logs are really easy to think about, and they’re also easy to build at scale. Historically, this has not quite been true of databases.

Apache Kafka is the system that is responsible for maintaining logs. It refers to them as topics, which is a very traditional phrase. A theme is nothing more than an organized list of occurrences.

What precisely is Kafka?

Apache Kafka is a distributed publish-subscribe messaging system and a robust queue that can manage a high amount of data and that enables you to transmit messages from one end-point to another. These features allow you to pass messages from one end-point to another. The reading of Kafka’s messages can be done so either offline or in an online setting. In order to avoid any data from being lost, Kafka messages are stored on the disc and duplicated across the cluster. The synchronization service ZooKeeper serves as the foundation upon which Kafka was constructed. For the purpose of performing real-time streaming data analysis, it interfaces very well with Apache Storm and Spark.

Okay. Now, what is a Messaging system?

The sharing and transmission of data between applications is handled by a messaging system, which enables the programs to focus solely on the data itself rather than being sidetracked by the sharing and transmission of data. The core component of distributed messaging is message queuing that can be trusted. Client applications and the messaging system engage in asynchronous message queuing amongst one another. There are two distinct patterns for conveying information. The first type of messaging system is known as a point-to-point messaging system, while the second type of messaging system is known as a publish-subscribe, also knows as pub-sub, messaging system. The pub-sub architecture is utilised by the vast majority of messaging systems today.

What are the benefits?

Image description

Before delving further into Kafka, it is imperative that we have a solid understanding of the primary terms, such as topics, brokers, producers, and consumers. The important parts are broken down into their component parts and illustrated in the following picture.

Image description

Definitions

Topics: Messages that fall into a certain category are referred to as a subject, and the stream of messages that make up a "topic" is termed a topic. Topics are used to organize and store the data. The various subjects have been separated into their own individual sections. Kafka maintains at the very least one partition dedicated specifically to each subject. Every one of these partitions carries messages that are organized in an unchangeable sequential order. A partition is realized as a collection of segment files of consistent sizes.

Partition: Because Topics may have any number of partitions, it can store and process any amount of data. Each message that has been partitioned has its own distinctive sequence id, which is referred to as the offset.

Replicas: It’s important to note that replicas are not the same thing as backups of a partition. There is never any data reading or writing done on replicas. They are utilised in operations to ensure that data is not lost.

Brokers: Brokers are straightforward systems that are tasked with the responsibility of preserving the published data. It’s possible for each broker to have zero, one, or several divisions for each subject. Assume that there are N partitions in a subject, and N number of brokers. Each broker will have one partition, if this is the case.

Producers: Producers are individuals or organisations that publish messages to one or more Kafka subjects. Producers transmit data to Kafka brokers.

Consumers: Consumers are those who read the data that is provided by brokers. A consumer can subscribe to one or more topics and will then consume published messages by fetching data from the brokers.

Kafka Clusters: Clusters of Kafka are referred to as Kafka clusters, and they are characterised by the presence of more than one broker. A Kafka cluster may have more nodes added to it without experiencing any downtime. The management of the durability and replication of message data is handled by using these clusters.

History of Apache Kafka

Linkedin created Kafka as a result of the difficult architectural decisions the company was forced to make throughout the course of its existence in order to get to the point where it could create Kafka. In order to develop this project into something that is capable of supporting the greater than 1.4 trillion messages that go through the Kafka infrastructure at LinkedIn, they also needed to find solutions to several fundamental problems. The engineering team at LinkedIn needed to do a complete redesign of their infrastructure. They had previously made the transition from a monolithic application infrastructure to one that was built on microservices in order to support the growing number of users they had as well as the increasing complexity of their website. Because of this adjustment, the search, profile, and communication platforms, along with any other platforms, were able to grow more effectively. It also resulted in the introduction of a second set of mid-tier services to enable API access to data models and back-end services to provide consistent access to the databases. Both of these developments were brought about as a direct consequence of this event.

In the beginning, they constructed a number of unique proprietary data pipelines to handle the numerous streams and queues of data that they had. Use cases for these systems ranged from simple things like tracking site events like page visits to more complex tasks like compiling aggregated logs from a variety of different services. The queuing capability for the InMail message system, as well as other systems, was provided by other pipelines. These needed to scale along with the site as it became bigger. They decided to invest in the construction of a single, distributed pub-sub platform as opposed to managing and growing each pipeline on an individual basis.

Thus, Kafka was consequently brought into the world at this time.

Why Kafka?

Let’s look at an illustration to understand this in a simple manner.

Before:

Image description

After:

Image description

I believe the figures are self-explanatory and can be understood easily, so let’s not spend time reading an obvious explanation for the same.

Use cases

Some of the popular use cases for Apache Kafka include,

  • Messaging

  • Website Activity Tracking

  • Metrics

  • Log Aggregation

  • Stream Processing

  • Event Sourcing

  • Commit Log

Now let’s have a look at some of the companies that make use of Apache Kafka. The list contains some recognisable names, such as the following:

  • Uber

  • LinkedIn

  • Twitter

  • Netflix

  • Pinterest

  • Airbnb

Features

  • Quick: A single Kafka broker can manage the reads and writes of clients at a rate of up to 100Mbps.

  • Scalable: Data streams are partitioned and disseminated over a cluster of machines to allow for data streams that are bigger than the capacity of any one machine.

  • Durable: Messages are stored permanently on a disc and duplicated throughout the cluster to ensure that data is never lost. Each broker has the capacity to process gigabytes of messages without a noticeable decrease in performance.

  • Distributed: Kafka is designed to be distributed, and its architecture is contemporary and cluster-centric; this configuration provides high fault tolerance and excellent durability.

Architectural Flow for Pub-Sub Messaging

Kafka provides a solitary consumer/client abstraction that summarises both Queuing and Publish-Subscribe at the same time. The process of Pub-Sub Messaging may be broken down into the following steps:

  • Producers will deliver messages to a subject at predetermined intervals.

  • The Kafka broker saves all messages in the partitions that are specified specifically for the subject in question. It guarantees that the messages are exchanged in an equitable manner between the divisions. In the event that the producer delivers two separate messages and there are two partitions, Kafka will save the first message in the first partition and the second message in the second partition.

  • The consumer registers their interest in a certain subject.

  • After the consumer has subscribed to a topic, Kafka will give the consumer with the current offset of the subject while simultaneously saving the offset in the Zookeeper ensemble.

  • The consumer will send a request to Kafka at certain intervals (for example, every 100 milliseconds) in order to get newly published messages.

  • As soon as Kafka is in possession of the messages that have been sent to it by producers, the company then transmits these messages to the users who are consumers.

  • The message will be delivered to the consumer, who will then process it. When the messages have been completely digested, the consumer will then send an acknowledgement to the Kafka broker.

  • After Kafka has been given an acknowledgement, it will alter the offset such that it corresponds to the new value and will then update the information in the Zookeeper. The consumer is able to appropriately read the next message even in the event that the server has an outage. This is because the offsets are kept in the Zookeeper.

  • This cycle, which has been described above, will continue until the customer cancels the request.

  • The consumer can rewind or skip to the appropriate offset of a subject at any moment and view all of the subsequent messages. This functionality is available round-the-clock.

Installation

Image description

Step 1: Install Java/Verify Java Installation

With any luck, you already have Java installed on your computer at the moment; if so, all you need to do is run the following command to confirm that it is indeed installed.

$ java --version

Please visit the following URL and download the most recent version of JDK if you have not already downloaded Java. If you have already downloaded Java, you may skip this step.http://www.oracle.com/technetwork/java/javase/downloads/index.html

Step 2: Install Zookeeper Framework

The Kafka brokers and the consumers communicate with one another using ZooKeeper, which acts as the coordination interface.

Visit the following link to obtain the most recent version of ZooKeeper, which you will need in order to install the ZooKeeper framework on your system.http://zookeeper.apache.org/releases.html

Extract tar file using the following command,

$ cd opt/

$ tar -zxf zookeeper-3.x.x.tar.gz

$ cd zookeeper-3.x.x

$ mkdir data

Open Configuration File named conf/zoo.cfg using the command vi “conf/zoo.cfg” and all the following parameters to set as starting point.

$ vi conf/zoo.cfg

tickTime=3000  
dataDir=/path/to/zookeeper/data  
clientPort=2811  
initLimit=6  
syncLimit=3
Enter fullscreen mode Exit fullscreen mode

Once the configuration file has been saved successfully and return to terminal again, you can start the zookeeper server.

$ bin/zkServer.sh start

After executing the above command, you will get a response as,

$  JMX  enabled  by  default 
$  Using  config:  /Users/../zookeeper-3.x.x/bin/../conf/zoo.cfg 
$  Starting  zookeeper  ...  STARTED
Enter fullscreen mode Exit fullscreen mode

Now, let’s run the CLI

$ bin/zkCli.sh

After executing the command, you will be connected to the zookeeper server and will get the response as,

Connecting to localhost:2811 
................ 
................ 
................ 
Welcome to ZooKeeper! 
................ 
................ 
WATCHER:: 
WatchedEvent state:SyncConnected type: None path:null 
[zk: localhost:2811(CONNECTED) 0]
Enter fullscreen mode Exit fullscreen mode

After connecting the server and performing all the operations, you can stop the zookeeper server using the command,

$ bin/zkServer.sh stop

Now, as you successfully installed Java and ZooKeeper on the machine, let us look at the steps to install Apache Kafka.

Step 3: Install Kafka

To install Apache Kafka on the machine, you can visit the official Apache website and download the tar file link,

Extract the tar file,

$ tar -zxf kafka_x.x.x.x.x.x tar.gz

$ cd kafka_x.x.x.x.x.x

Step 4: Run the server

Now that you have downloaded Apache Kafka on the machine, you can start the server,

$ bin/kafka-server-start.sh config/server.properties

After the server starts, you would see a response on the screen,

 $  bin/kafka-server-start.sh  config/server.properties 
  INFO KafkaConfig values: 
  request.timeout.ms = xxxxx 
  log.roll.hours = xxx 
  inter.broker.protocol.version = x.x.x.x 
  log.preallocate = false 
  security.inter.broker.protocol = PLAINTEXT 
  ……………………………………………. 
  …………………………………………….
Enter fullscreen mode Exit fullscreen mode

Produce and consume few messages

Kafka is a platform for distributed event streaming that enables users to read, write, store, and process events (sometimes referred to as records or messages in documentation) over a large number of machines.

Payment transactions, geolocation updates from mobile phones, shipment orders, sensor measurements from internet-of-things (IoT) devices or medical equipment, and a great many more types of events are examples of events. These occurrences are categorised and saved under their respective topics. At its most basic level, a subject may be compared to a folder inside a file system, and the events that occur within that folder can be compared to the files that occur within that folder.

Launch a new session of the terminal and type in:

$ bin/kafka-topics.sh --create --topic demo-events --bootstrap-server localhost:2990

All of Kafka’s command line tools have additional options: run $ kafka-topics.sh command without any arguments to display usage information.

Open two terminal windows,

Producer: $ bin/kafka-console-producer.sh --topic demo-events --bootstrap-server localhost:2990

Consumer: $ bin/kafka-console-consumer.sh --topic demo-events --from-beginning --bootstrap-server localhost:2990

Position the windows for the production terminal and the consumer terminal such that they are side by side. To continue, type a few more messages into the producer terminal, and then observe how those messages are shown on the consumer terminal.

How to stop and exit the Apache Kafka environment?

When you are finished playing around with Kafka, you should follow these procedures to get out of the Apache Kafka environment:

  1. Stop the consumer and producer clients using Ctrl+C

  2. Stop the Kafka broker using Ctrl+C

  3. Stop the ZooKeeper server using Ctrl+C

  4. Run the following command for cleaning up:

rm -rf /tmp/kafka-logs /tmp/zookeeper

Conclusion

We have covered the basics including a hands-on demo for getting started with Kafka. There is a lot more that goes into Apache Kafka. I hope you’ve enjoyed it and have learned something new.

I’m always open to suggestions and discussions on LinkedIn. Hit me up with direct messages.

Till the next one, stay safe and keep learning.

Top comments (1)

Collapse
 
hadisamadzad profile image
Hadi Samadzad

Can you mention some UI tools to work with like a kafka dashboard?