Kafka Streams: Would I ever use it again

Deepti Mittal
4 min readOct 20, 2021
Kafka Stream: Black box

Kafka Streams is a client library for building applications and microservices, where the input and output data are stored in an Apache Kafka cluster. It combines the simplicity of writing and deploying standard Java and Scala applications on the client side with the benefits of Kafka’s server-side cluster technology.

We used kafka stream for one of data heavy project . Later we figured out Kafka stream was not a good choice for our use case.

Use case details

Collect number of records published in multiple Kafka topics and when the schedule kicks in/particular trigger received, run the algorithms on those records and publish in another Kafka topics for downstream system to consume.

This sounds like a simple use case and no one could thought of that this will result into any issue till we started running this with actual data. Our record size was around 5KB. To run algorithms we needed to aggregate them at certain levels which resulted into single kafka record of size 5GB or more.

So we started dig into if Kafka stream can support this use case or not. This blog is all about our practical experiences with Kafka stream where it can be used and where it needs to be avoided.

Good fit for Kafka stream

  • Quick prototyping: If you have idea where you want to quickly prototype, Kafka stream could be good. It provides lots of High level APIs which can help in testing algorithms and concepts very fast. Using filtering, aggregation and transform APIs provided by kafka stream makes development very fast.
  • Small processing or data validation: Use case where we consume data from one topic and do small processing and validation before making it available for downstream systems, Kafka stream could be good fit. Remember processing has to be vey small otherwise you will start getting exception like producerFenced and TransactionCommitOperation failed exceptions.
  • Resilience and fault tolerance: Kafka stream uses local state store and changelog topics where data is saved during aggregation which can be used in case of failures and entire processing does not need to be started again. Whole concept of transaction and error handling is taken care of by library, which works out of box.

Issues with Kafka Stream

  • Need to have small transactions: Kafka stream might not be a very fit if you have to run complex algorithm which takes more than few milliseconds on incoming data as it results into timeouts. One can always argue that you can increase timeouts but Kafka has multiple timeouts at consumer, broker, producer and at stream level and so many of them are related. Getting that whole combination correct is not easy tasks. You can never be 100% sure that these timeouts will work in all kind of data and scenarios in production.
  • Big payloads: This was our main reason which made us move away from Kafka streams. To execute one of the algorithm we need to aggregate all records at higher level which resulted into one record being of size 5GB into changelog topics. Kafka stream was not able to save this huge message into topics and resulting into multiple failure and huge performance degradation. Kafka streams recommendation is record size not exceeding 10MB.
  • Batch processing: It’s a streaming solution and should be used as streaming, if you try to perform operations where multiple records needs to be collected, you might run into issues like one which I mentioned above. This might also result into bigger transactions hence you will get lot of errors.
  • Debuggability: Lot of producerFenced exception, commit Transaction failed exception, does not have good description in Kafka, It becomes much more difficult to analyse and debug them in Kafka stream because you will build topology using high DSL functions and it becomes black box to analyse errors. The part of solution which was implemented using Spring Kafka was so easy for us to analyze and fix but Kafka Stream issues were very hard inspite of spending days we would not know RCA.
  • Hidden Stream topics: Looking at stream code you can never find out that how many stream topics has been created till you use some external tool to visualize stream topology. When we started using AKHQ for viewing topics and consumer groups, we realised that stream creates many repartition and changelog topics. All the DSL function which are easy to use has cost associated in terms of having those extra topics which would increase data space on kafka clusters to great extent and lot of network I/Os are also involved while writing and reading data to changelog topics.
  • Stream configurations: Kafka has too many configurations for brokers, consumers and producers. Lot of configurations are inter related which might impact the performance too great extent. While using kafka stream it provided another set of configurations which adds complexity further.
  • Unbalanced load distribution : One way to achieve parallelism in Kafka is to increase partitions and consumers. In Simple Kafka solutions that calculation is very easy but in Kafka stream due to subtopologies and stream topics involvement you might not be able to find out exact number of consumers needed which caused unequal load distributions across multiple machines running multiple consumers. This problem could be solved while observing the partition assignment to consumers but its not very easy to find out that first time.

All these issues forced us to move away from Kafka stream where we decided to use external data store to achieve fault tolerance and resilience.

We have to our solution to achieve transactions and error handling but it was worth because in return we got debuggability of the solution which is most important for systems running in production.

--

--

Deepti Mittal

I am working as software engineer in Bangalore for over a decade. Love to solve core technical problem and then blog about it.