Kafka Exceptions: Building resilience into the system

Deepti Mittal
4 min readDec 13, 2021

Exceptions are no more exceptions in distributed applications. Whatever can go wrong, will go wrong all the time. If you think things are going well, then you’re missing something.

In distributed applications exceptions should be treated in 2 categories.

Recoverable Exceptions: Which can be recovered after multiple retries. Systems can be designed to recover from these by itself hence no manual intervention required. IOExceptions, Network Connections exception might come in this category.

Non Recoverable Exceptions: It might be caused due to data or validation failure and even multiple retries will result into failure. Records cannot be processed successfully hence it either needs to be pushed to dead letter queue/logged and record needs to be skipped so it does not block other records to be processed. There has to be some mechanism to notify about failure as well. Example of these kinds of exceptions would be NullPointerExceptions, IllegalStateExceptions, DataValidation exceptions.

Transactions in Kafka

Kafka is highly distributed systems. Kafka has concept of transactions as well, to avoid duplicity of events in the system.

My previous blog explains exactly once semantic in Kafka stream and concept applies to Kafka also.

One of the core architecture principle in our application while following exactly once was also to make sure that in case of failure whole transaction can be retried without any negative side effect. If your transactions involve database that either those DB operations needs to be idempotent or those operations also needs to be rolled back in case of Kafka transactions failure.

This is very important otherwise it will result into duplicity in system.

Coming back to errors, in case of recoverable exceptions transactions should be retried and application should decide that how frequently and how many time transactions should be retried. In case of non recoverable exceptions there is no point of retrying the transactions.

Recoverable exceptions in Kafka

Here are some recoverable exceptions which occurs in Kafka applications,

ProducerFenced Exceptions: This fatal exception indicates that another producer with the same transactional.id has been started. It is only possible to have one producer instance with a transactional.id at any given time, and the latest one to be started “fences” the previous instances so that they can no longer make transactional requests. When you encounter this exception, you must close the producer instance. So in this exceptions one solution is to restart that instance of application so that producer is also closed.

InvalidPidMapping Exception: This is related to transactional. id. expiration. ms.The maximum amount of time in ms that the transaction coordinator will wait before proactively expire a producer’s transactional ID without receiving any transaction status updates from it. The default value of this configuration is 7 days. So if there is no transaction from a producer for more than 7 days and then it tries to commit transaction, this exception will occur. When you encounter this exception, you must close the producer instance. So in this exceptions one solution is to restart that instance of application so that producer is also closed.

CommitTimeoutExceptions: Kafka has configuration known as commit.interval.ms: The frequency with which to save the position of the processor. (Note, if processing.guarantee is set to exactly_once, the default value is 100, otherwise the default value is 30000. If you have transactions which has gone for longer period of time. Based on the workflow either commit timeout needs to be increased or we need to check if there is some steps in transaction which is taking longer and needs to be optimised. Some time retrying transactions can make it successful in case of time out exceptions.

Network exceptions: This kind of exceptions are very common in any distributed systems, where you are trying to access other systems like database, or calling any rest endpoint, due to network lag or if connection drops for few milliseconds could lead to failure in Kafka transactions. Retrying transaction after a certain time interval will lead to success most of the time.

Custom retry errors: When an application needs certain conditions to be met, before processing an event successfully, in this case event can be pushed to retry topic and needs to be retried later point of time or keep retrying till conditions is not met.

Non recoverable exceptions in Kafka

Below are few examples of non recoverable exceptions:

NullPointerExceptions: This could be because of some missing field/data. Hence there is very less chance of retrying the event will fix it. Solution is to skip the event. Log it into dead letter topic, or raising error logs.

Serialization and deserialization Exceptions: This errors occur when record does not satisfy the defined schema, hence retrying wont fix it.

Here is the list of all exceptions which we could identify during our application development as non recoverable exceptions:

“ArithmeticException”, “NullPointerException”, “ClassCastException”, “IndexOutOfBoundsException”,”IllegalStateException”

To summarise recoverable and non recoverable exceptions are very specific to application context and latency an application could afford. While designing an application it’s very important to identify those and decide expected behaviour for those exceptions.

If we don’t do that before deploying application in production it might lead to very stressful days in production.

Other blogs about implementation details on Kafka error handling through Spring:

--

--

Deepti Mittal

I am working as software engineer in Bangalore for over a decade. Love to solve core technical problem and then blog about it.