In this article series, we will learn Kafka basics, Kafka delivery semantics, and configuration to achieve different semantics, Spark Kafka integration, and optimization. In Part 1 of this series we’ll look at Kafka basics.
The following could be some of the problem statements:
- Many sources and target systems to integrate. Generally, integration of many systems involves complexities like dealing with many protocols, messaging formats, etc.
- Message systems handle high volume streams.
Some of the use cases include:
- Streaming processing
- Tracking user activity, log aggregation, etc.
- De-coupling systems
What Is Kafka?
Kafka is a horizontally scalable, fault tolerant, and fast messaging system. It’s a pub-sub model in which various producers and consumers can write and read. It decouples source and target systems. Some of the key features are:
- Scale to 100s of nodes.
- Can handle millions of messages per second.
- Real-time processing (~10ms).
Topic, Partitions, and Offsets
A topic is a specific stream of data. It is very similar to a table in a NoSQL database. Like tables in a NoSQL database, the topic is split into partitions that enable topics to be distributed across various nodes. Like primary keys in tables, topics have offsets per partitions. You can uniquely identify a message using its topic, partition and offset.
Partitions enable topics to be distributed across the cluster. Partitions are a unit of parallelism for horizontal scalability. One topic can have more than one partition scaling across nodes.
Messages are assigned to partitions based on partition keys; if there are no partition keys then the partition is randomly assigned. It’s important to use the correct key to avoid hotspots.
Each message in a partition is assigned an incremental id called an offset. Offsets are unique per partition and messages are ordered only within a partition. Messages written to partitions are immutable.
The diagram below shows the architecture of Kafka.
ZooKeeper is a centralized service for managing distributed systems. It offers hierarchical key-value store, configuration, synchronization, and name registry services to the distributed system it manages. ZooKeeper acts as ensemble layer (ties things together) and ensures high availability of the Kafka cluster. Kafka nodes are also called brokers. It’s important to understand that Kafka cannot work without ZooKeeper.
From the list of ZooKeeper nodes, one of the nodes is elected as a leader and the rest of the nodes follow the leader. In the case of a ZooKeeper node failure, one of the followers is elected as leader. More than 1 node is strongly recommended for high availability and more than 7 is not recommended.
ZooKeeper stores metadata and the current state of the Kafka cluster. For example details, like topic name, the number of partitions, replication, leader details of petitions, and consumer group details are stored in ZooKeeper. You can think of ZooKeeper like a project manager who manages resources in the project and remembers the state of the project.
Key things to remember:
- Manages list of brokers.
- Elects broker leaders when a broker goes down.
- Sends notifications on a new broker, new topic, deleted topic, lost brokers, etc.
- From Kafka 0.10 on, consumer offsets are not stored in ZooKeeper, only the metadata of the cluster is stored in ZooKeepr.
- The leader in ZooKeepr handles all writes and follower ZooKeepr handle only reads.
A broker is a single Kafka node that is managed by ZooKeeper. Set of brokers form a Kafka cluster. Topics that are created in Kaka are distributed across brokers based on the partition, replication, and other factors. When a broker node fails based on the state stored in zookeeper it automatically rebalances the cluster and also in case if a leader partition is lost then one of the follower petition is elected as the leader.
You can think of broker as a team leader who takes care of the assigned tasks, in case if a team lead isn’t available then the manager takes care of assigning tasks to other team members.
A replication is making a copy of a partition available in another broker. Replication enables Kafka to be fault tolerant. When a partition of the topic is available in multiple brokers then one of the partitions in a broker is elected as leader and rest of the replication of partition are followers.
Replication enables Kafka to be fault tolerant even when a broker is down. For example, Topic B partition 0 is stored in both broker 0 and broker 1. Both producers and consumers are severed only by the leader. In case of a broker failure the partition from another broker is elected as a leader and it starts serving the producers and consumer groups. Replica partitions that are in sync with the leader are flagged as ISR (In Sync Replica).
IT Team and Kafka Cluster Analogy
The diagram below depicts an analogy of an IT team and Kafka cluster.
Below is the summary of core components in Kafka.
- ZooKeeper manages Kafka brokers and their metadata.
- Brokers are horizontally scalable Kafka nodes that contain topics and it’s replications.
- Topics are message streams with one or more partitions.
- Partitions contains messages with unique offsets per partition.
- Replication enables Kafka to be fault tolerant using follower partitions.
Refer Kafka quickstart for Kafka setup.