Apache Kafka Tutorial

IONOS editorial team11/15/202210 mins

The Apache Kafka open source software is one of the best solutions for storing and processing data streams. This messaging and streaming platform, which is licensed under Apache 2.0, features fault tolerance, excellent scalability, and a high read and write speed. These factors, which are highly appealing for Big Data applications, are based on a cluster which allows distributed data storage and replication. There are four different interfaces for communicating with the cluster with a simple TCP protocol serving as the basis for communication.

This Kafka tutorial explains how to get started with the Scala-based application, beginning with installing Kafka and the Apache ZooKeeper software required to use it.

Requirements for using Apache Kafka

To run a powerful Kafka cluster, you will need the right hardware. The development team recommends using quad-core Intel Xeon machines with 24 gigabytes of memory. It is essential that you have enough memory to always cache the read and write accesses for all applications that actively access the cluster. Since Apache Kafka’s high data throughput is one of its draws, it is crucial to choose a suitable hard drive. The Apache Software Foundation recommends a SATA hard drive (8 x 7200 RPM). When it comes to avoiding performance bottlenecks, the following principle applies: the more hard drives, the better.

In terms of software, there are also some requirements that must be met in order to use Apache Kafka for managing incoming and outgoing data streams. When choosing your operating system, you should opt for a Unix operating system, such as Solaris, or a Linux distribution. This is because Windows platforms receive limited support. Apache Kafka is written in Scala, which compiles to Java bytecode, so you will need the latest version of the Java SE Development Kits (JDK) installed on your system. This also includes the Java Runtime Environment, which is required for running Java applications. You will also need the Apache ZooKeeper service, which synchronizes distributed processes.

To display this video, third-party cookies are required. You can access and change your cookie settings here.

Apache Kafka tutorial: how to install Kafka, ZooKeeper, and Java

For an explanation of what software is required, see the previous part of this Kafka tutorial. If it is not already installed on your system, we recommend installing the Java Runtime Environment first. Many newer versions of Linux distributions, such as Ubuntu, which is used as an example operating system in this Apache Kafka tutorial (Version 17.10), already have OpenJDK, a free implementation of JDK, in their official package repository. This means that you can easily install the Java Development Kit via this repository by typing the following command into the terminal:

sudo apt-get install openjdk-8-jdk

Immediately after installing Java, install the Apache ZooKeeper process synchronization service. The Ubuntu package repository also provides a ready-to-use package for this service, which can be executed using the following command line:

sudo apt-get install zookeeperd

You can then use an additional command to check whether the ZooKeeper service is active:

sudo systemctl status zookeeper

If Apache ZooKeeper is running, the output will look like this:

Under the “Active” entry, you can find out whether ZooKeeper is active and how long it has been active.

If the synchronization service is not running, you can start it at any time using this command:

sudo systemctl start zookeeper

To ensure that ZooKeeper will always launch automatically at startup, add an autostart entry at the end:

sudo systemctl enable zookeeper

Finally, create a user profile for Kafka, which you will need to use the server later. To do so, open the terminal again and enter the following command:

sudo useradd kafka -m

Using the passwd password manager, you can then add a password to the user by typing the following command followed by the desired password:

sudo passwd kafka

Next, grant the user “kafka” sudo rights:

sudo adduser kafka sudo

You can now log in at any time with the newly-created user profile:

su – kafka

We have arrived at the point in this tutorial where it is time to download and install Kafka. There are a number of trusted sources where you can download both older and current versions of the data stream processing software. For example, you can obtain the installation files directly from the Apache Software Foundation’s download directory. It is highly recommended that you work with a current version of Kafka, so you may need to adjust the following download command before entering it into the terminal:

wget http://www.apache.org/dist/kafka/2.1.0/kafka_2.12-2.1.0.tgz

Since the downloaded file will be compressed, you will then need to unpack it:

sudo tar xvzf kafka_2.12-2.1.0.tgz --strip 1

Use the --strip 1 flag to ensure that the extracted files are saved directly to the ~/kafka directory. Otherwise, based on the version used in this Kafka tutorial, Ubuntu would store all files in the ~/kafka/kafka_2.12-2.1.0 directory. To do this, you must have previously created a directory named “kafka” using mkdir and switched to it (via “cd kafka”).

Kafka: how to set up the streaming and messaging system

Now that you have installed Apache Kafka, the Java Runtime Environment and ZooKeeper, you can run the Kafka service at any time. Before you do this, however, you should make a few small adjustments to its configurations so that the software is optimally configured for its upcoming tasks.

Enable deleting topics

Kafka does not allow you to delete topics (i.e. the storage and categorization components in a Kafka cluster) in its default set-up. However, you can easily change this by using the server.properties Kafka configuration file. To open this file, which is located in the config directory, use the following terminal command in the default text editor nano:

sudo nano ~/kafka/config/server.properties

At the end of this configuration file, add a new entry, which enables you to delete topics:

delete.topic.enable=true

In the configuration file, you can also change some other settings such as which TCP port (default: 2181) ZooKeeper uses.

Tip

Remember to save the new entry in the Kafka configuration file before closing the nano text editor again

Creating .service files for ZooKeeper and Kafka

The next step in the Kafka tutorial is to create unit files for ZooKeeper and Kafka that allow you to perform common actions such as starting, stopping and restarting the two services in a manner consistent with other Linux services. To do so, you need to create and set up .service files for the systemd session manager for both applications.

How to create the appropriate ZooKeeper file for the Ubuntu systemd session manager

First, create the file for the ZooKeeper synchronization service by entering the following command in the terminal:

sudo nano /etc/systemd/system/zookeeper.service

This will not only create the file but also open it in the nano text editor. Now, enter the following lines and then save the file.

[Unit]
Requires=network.target remote-fs.target
After=network.target remote-fs.target
[Service]
Type=simple
User=kafka
ExecStart=/home/kafka/kafka/bin/zookeeper-server-start.sh /home/kafka/kafka/config/zookeeper.properties
ExecStop=/home/kafka/kafka/bin/zookeeper-server-stop.sh
Restart=on-abnormal
[Install]
WantedBy=multi-user.target

As a result, systemd will understand that ZooKeeper requires the network and the file system to be ready before it can start. This is defined in the [Unit] section. The [Service] section specifies that the session manager should use the files zookeeper-server-start.sh and zookeeper-server-stop.sh to start and stop ZooKeeper. It also specifies that ZooKeeper should be restarted automatically if it stops unexpectedly. The [Install] entry controls when the file is started using ".multi-user.target” as the default value for a multi-user system (e.g. a server).

How creating a Kafka file for the Ubuntu systemd session manager works

To create the .service file for Apache Kafka, use the following terminal command:

sudo nano /etc/systemd/system/kafka.service

Then, copy the following content into the new file that has already been opened in the nano text editor:

[Unit]
Requires=zookeeper.service
After=zookeeper.service
[Service]
Type=simple
User=kafka
ExecStart=/bin/sh -c '/home/kafka/kafka/bin/kafka-server-start.sh /home/kafka/kafka/config/server.properties > /home/kafka/kafka/kafka.log 2>&1'
ExecStop=/home/kafka/kafka/bin/kafka-server-stop.sh
Restart=on-abnormal
[Install]
WantedBy=multi-user.target

The [Unit] section in this file specifies that the Kafka service depends on ZooKeeper. This ensures that the synchronization service is automatically started whenever the kafka.service file is run. The [Service] section specifies that the kafka-server-start.sh and kafka-server-stop.sh shell files should be used for starting and stopping the Kafka server. You can also find the specification for an automatic restart after an unexpected disconnection as well as the multi-user entry in this file.

Lines such as ExecStart contain the commands for their respective service. Within the .service files, these are identified through their capitalization (at the start of the word and in the middle of it).

Kafka: launching for the first time and creating an autostart entry

Once you have successfully created the session manager entries for Kafka and ZooKeeper, you can start Kafka with the following command:

sudo systemctl start kafka

By default, the systemd program uses a central protocol or journal in which all log messages are automatically written. As a result, you can easily check whether the Kafka server has been started as desired:

sudo journalctl -u kafka

The output should look something like this:

The systemd session manager has been an integral part of the Linux operating system since Ubuntu 15.04, which is why all required components are installed by default in current versions.

If you have successfully started Apache Kafka manually, enable finish by activating automatic start during system boot:

sudo systemctl enable kafka

Apache Kafka tutorial: getting started with Apache Kafka

This part of the Kafka tutorial involves testing Apache Kafka by processing an initial message using the messaging platform. To do so, you need a producer and a consumer (i.e. an instance which enables you to write and publish data to topics and an instance which can read data from a topic). First of all, you need to create a topic, which in this case should be called TutorialTopic. Since this is a simple test topic, it should only contain a single partition and a single replica:

> bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic TutorialTopic

Next, you need to create a producer that adds the first example message "Hello, World!" to the newly created topic. To do so, use the kafka-console-producer.sh shell script, which needs the Kafka server’s host name and port (in this example, Kafka’s default path) as well as the topic name as arguments:

echo "Hello, World!" | ~/kafka/bin/kafka-console-producer.sh --broker-list localhost:9092 --topic TutorialTopic > /dev/null

Next, using the kafka-console-consumer.sh script, create a Kafka consumer that processes and displays messages from TutorialTopic. You will need the Kafka server’s host name and port as well as the topic name as arguments. In addition, the “--from-beginning” argument is attached so that the consumer can actually process the “Hello, World!” message, which was published before the consumer was created:

> bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic TutorialTopic --from-beginning

As a result, the terminal presents the “Hello, World!" message with the script running continuously and waiting for more messages to be published to the test topic. So, if the producer is used in another terminal window for additional data input, you should also see this reflected in the window in which the consumer script is running.

You can stop running consumer scripts at any time by pressing [Ctrl] + [C].

Apache Hadoop

Are you looking to execute complicated computing processes with large amounts of data? This is exactly what the Big Data framework, Hadoop, focuses on. The open source Apache software offers a Java based frame with which various Big Data applications on computer clusters can be…

ToolsLinuxStorage SystemsApache

Apache Lucene Tutorial

Who wouldn’t want to build their own search engine that’s adapted precisely to their requirements? Apache Lucene makes this possible. The open source project can be precisely adjusted and also works extremely quickly, which is why even large companies such as Twitter rely on…

BrowserToolsOpen SourceApache

How to install and run an Apache web server

The Apache http server is the standard web server for providing HTTP documents on the internet. But did you know it’s also possible to install an apache web server locally on your Windows PC? This is a great method of testing out web pages and checking scripts. To get started,…

ToolsTutorialsNetworkApache

Kubernetes Tutorial for Beginners

Kubernetes helps you manage containers – if you know how it works. But the first steps in particular can often be difficult. We explain the installation and most important features in our Kubernetes tutorial simply and concisely. Learn step by step how you can create a cluster…

Kubernetes

Redis Tutorial

Working with this in-memory database is not that difficult, but without an easily understandable Redis tutorial, one can quickly become discouraged. Here you will learn how to install the database, configure it, and enter data. We will also show you how to add, get, and delete…

Tutorials