Schema registry: Decrease payload for Kafka records
Schema registry is a sitting outside Kafka brokers, keeping the schemas required to talk to Kafka. Producer and consumer still talks to Kafka brokers to publish and consume data but schema registry contains mapping of schemas and schema IDs used to publish data and consume that data. This helps in reducing payload size with every record because producer needs to just pass schema ID and not entire schema hence improving performance.
As depicted in diagram above following are the steps who schema registry plays a vital role while working with Kafka.
- Whenever there is a new schema first its gets registered with schema registry.
- After registering schema, SR( schema registry) will generate schema ID and will maintain the mapping.
- Kafka maintains internal topics _schemas where every schema registration is captured as one event in topic.
- Before publishing any record to Kafka producer will reach out to SR to get the ID for the schema. This call is HTTP call from SR client to server.
- Now instead of sending entire schema along with record, producer will only send schema ID. Producer will also cache this schema ID in local cache to use it for next time, hence saving one network trip to SR.
- When consumer consumes the record it will read schema ID from record and reach to SR to get the schema . This call is HTTP call from SR client to server. Consumer will also cache mapping of schema and schema ID to be used later and hence saving one network trip to SR.
- From next time onwards when producer/consumer will first check is schema and schema ID mapping exist in local cache.
Registering schema with Schema registry
There are 2 way schemas can we registered with schema registry.
- Auto schema registration: In this producer will have “auto.register.schemas” to be set to true. In this case if schema is not already registered, producer will register the schema. This approach could be considered when someone is learning about schema registration but its not a good approach to be followed in production. In this case producer might change schema which is not compatible with consumer hence it will break entire integration.
- Separately registering schemas: There is different module which maintains and registers schemas through secure way. Before registering changed schema it can also run test cases to test compatibility of changed schemas. If we are using same kafka cluster for multiple system integration, this approach will provide control over schema changes. There are gradle and maven plugin to achieve this.
How schema registry get used during record serialisation and deserialisation.
Schema registry server exposes multiple endpoint to perform CRUD operation on schemas.
At side of producer
- Kafka producer class(org.apache.kafka.clients.producer.KafkaProducer) is responsible for serializing key and value before publishing the data. If you are using schema registry provided serializer will be from io.confluent(io.confluent.kafka.streams.serdes.avro.SpecificAvroSerde) as schema registry comes from confluent and not from apache kafka.
- Confluent serialiser will internally initalize/ delegate to CachedSchemaRegistryClient class.
- CachedSchemaRegistryClient is responsible to make HTTP call to schema registry and cache the response after getting schema id, which will be used during subsequent publish with same schema.
- In case auto register schema is enabled, schemas are not cached and io.confluent.kafka.serializers.AbstractKafkaAvroSerializer is responsible call everytime during serialisation to register schemas. Hence will have performance overhead.
At side of Consumer
- Kafka producer class(org.apache.kafka.clients.consumer.KafkaConsumer) is responsible for deserializing key and value before consuming the data. If you are using schema registry provided deserializer will be from io.confluent(io.confluent.kafka.streams.serdes.avro.SpecificAvroSerde) as schema registry comes from confluent and not from apache kafka.
- Confluent deserialiser will internally initalize/ delegate to CachedSchemaRegistryClient class.
- CachedSchemaRegistryClient is responsible to make HTTP call to schema registry and cache the response after getting schema id, which will be used during subsequent consume with same schema.
Schema format supported by Schema registry
- Avro
- JSON
- Protobuf
In which Avro is most popular because of converting data to binary format during serialization hence decreasing payload size by considerable amount.
References:
https://docs.confluent.io/platform/current/schema-registry/index.html