Key role of key: How multiple partitions are used to distribute data records in Kafka
Apache Kafka is a distributed streaming platform that:
- Publishes and subscribes to streams of records, similar to a message queue or enterprise messaging system.
- Stores streams of records in a fault-tolerant durable way.
- Processes streams of records as they occur.
Every kafka data record will have two main parts: Key and value
Key is optional in any data record and if not provided will be considered as null by Kafka.
Before we go further on crucial role key plays in Kafka lets understand few basic concepts on how Kafka stores data published by publisher and how it gets consumed by consumer group.
Topic
Data records are organized and durably stored in topics. Very simplified, a topic is similar to a folder in a filesystem, and the data records are the files in that folder.
Partition
Topic is divided into partition, which are distributed across different kafka brokers across multiple data centres. This distribution of data across partition is required for scalability because it allows client applications to both read and write the data from/to many brokers at the same time. When a new data record is published to a topic, it is actually appended to one of the topic’s partitions. Each topic can have one or more partition.
Offset
Each partition maintains the messages it has received in a sequential order where they are identified by an offset, also known as a position. This helps in reading message by consumer in sequential manner.
Consumer group
A consumer group is a set of consumers which cooperate to consume data from some topics. The partitions of all the topics are divided among the consumers in the group. As new group members arrive and old members leave, the partitions are re-assigned so that each member receives a proportional share of the partitions. This is known as rebalancing the group.
As a consumer in the group reads messages from the partitions assigned by the coordinator, it must commit the offsets corresponding to the messages it has read. If the consumer crashes or is shut down, its partitions will be re-assigned to another member, which will begin consumption from the last committed offset of each partition.
Distribution of data records among partition in a topic
To understand how data is distributed among partition for using keys. Assume we have one topic with 2 partition and have following data records:
Kafka will equally divide records in each partition:
Now if we get another 2 records as below:
Based on using hash of the key for deciding the partition Kellogs records will go to partition 1, Starbucks record might go to P1 or P2. Assume it goes to P2 as follow.
When we add another Partition in topics. Kafka will rebalance existing records among 3 partitions.
If we get another 3 records like below:
Kafka will distribute records based on hash of keys as below:
Now if one more partition is added then rebalance will trigger again and which can lead to something like below:
It explains it is very important to choose right key for data records for distributing the data records among partitions and achieving higher level of scalability.
Having enough number of partition and consumers are not just enough to achieve higher level of parallelism with Kafka.
If key is not provided with data records kafka will use round robin to distribute data records among multiple partition.
How data might be used on consumer side should be consider while deciding the key. For example, if there are lots of aggregate operations on certain kinds of records in that case deciding a key which helps in storing similar records required for aggregation in single partition might improve performance.