Kafka stream Semantics

Hard to solve system failures

In a distributed publish-subscribe messaging system, services are running in distributed fashion which can fail for different reasons. This failure can lead to incorrect processing and results and duplication as well. In the case of Kafka, following could be the reason which can lead to failures:

1. Broker can fail just before sending acknowledgement to producer.

2. Producer instance might go down before receiving acknowledgment.

3. Consumer fails before sending acknowledgment.

Depending on the action the producer takes to handle such a failure, you can get different semantics:

At least once semantics: Producer will make sure that message is delivered at least once and it might lead to duplicate messages in broker. If producer does not get acknowledgement before time out or producer failure it will send the same message again.

At most once semantics: Producer will send message only once and will not wait for any acknowledgement. This might lead to message not written to kafka topic at all, hence not delivered to consumer.

Exactly once semantics: Messages will be written only once to Kafka topic and will be consumed by consumer only once. In case of any failure if producer try to send messages again Kafka will take care of discarding them. Kafka has added support for exactly once semantic(EOS) from version 0.11.0.

How Kafka achieves EOS

Kafka introduced the concept of transactions to achieve EOS

Every Kafka producer will have transactional id registered with kafka broker before transaction starts. This helps in identifying same producer across multiple system restart or failure.

The API requires that the first operation of a transactional producer should be to explicitly register its transactional.id with the Kafka cluster. When it does so, the Kafka broker checks for open transactions with the given transactional.id and completes them. It also increments an epoch associated with the transactional.id. The epoch is an internal piece of metadata stored for every transactional.id.

Once the epoch is bumped, any producers with same transactional.id and an older epoch are considered zombies(stale instances) and any future transactional writes from those producers are rejected.

On the consumer side, consumer will not read message if its a part of transaction and transaction is not committed yet.

How exactly once is achieved in kafka stream

exactly-once for stream processing guarantees that for each received record, its processed results will be reflected once, even under failures.

In any stream application is like read-process-write application and below are the steps which happens to achieve that:

  1. Application reads from input topic.
  2. Application processes input data and stores in local state store
  3. Application write to output topic
  4. Application receives acknowledgement from output topic for successful write.
  5. Application sends acknowledgement to input topic and commits offset at input topic about success.

Is step 3, 4 and 5 fails for any reason it can result in duplication in messaging processing and producing output message to topic.

So Step3, 4 and 5 needs to be atomic to achieve exactly-once processing.

Kafka stream achieves this by maintaining different state in internal topics.

Once application reads from input topic it will commit the offset in a internal offset topic so avoiding the duplication to read same message again.

Any changes occurring in state store will be written in changelog topic which are backed up by replication strategy. In case of one instance of application going down entire local state store can be replicated in another instance using changelog topic.

Upon starting the application embedded producer in stream application will use transaction APIs and transactions IDs to achieve EOS the way kafka achieves it.

Stream processing applications written in the Kafka Streams library can turn on exactly-once semantics by simply making a single config change, to set the config named “processing.guarantee” to “exactly_once” (default value is “at_least_once”), with no code change required.

This configuration is more like magic pill for Kafka stream application to handle one of the most complex problem of distributed systems.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store