Avro in its simplicity
Apache Avro is a language-neutral data serialization system. It was developed by Doug Cutting, the father of Hadoop. Avro has a schema-based system. A language-independent schema is associated with its read and write operations. Avro serializes the data which has a built-in schema. Avro serializes the data into a compact binary format, which can be deserialized by any application.
Avro is used to transfer data over network or for data persistent.
Avro will convert the data in binary format before transferring hence it reduces the payload size to great extent.
Avro is also used in Remote Procedure Calls (RPCs). During RPC, client and server exchange schemas in the connection handshake.
Avro schema
Avro depends heavily on its schema. It serializes fast and the resulting serialized data is lesser in size. Schema is stored along with the Avro data in a file for any further processing.
In RPC, the client and the server exchange schemas during the connection. This exchange helps in the communication between same named fields, missing fields, extra fields, etc.
In other communication schema will be transferred along with data and deserializer is used to read the data as per schema.
Avro schemas are defined with JSON that simplifies its implementation in languages with JSON libraries.
Here is the sample Avro schema:
{"namespace": "com.deepti.kafka.sample.avro",
"type": "record",
"name": "User",
"fields": [
{"name": "name", "type": "string"},
{"name": "favorite_number", "type": ["int", "null"]},
{"name": "favorite_color", "type": ["string", "null"]}
]
}
Few recommendations for Avro schema
Here are some recommendations specific to Avro:
- Use enumerated values whenever possible instead of magic strings. Avro allows specifying the set of values that can be used in the schema as an enumeration. This avoids typos in data producer code making its way into the production data set that will be recorded for all time.
Example:
{
"type" : "enum",
"name" : "Numbers",
"namespace": "data",
"symbols" : [ "ONE", "TWO", "THREE", "FOUR" ]
}
- Require documentation for all fields. Even seemingly obvious fields often have non-obvious details. Try to get them all written down in the schema so that anyone who needs to really understand the meaning of the field need not go any further.
Example:
"name" : "Colors",
"namespace" : "palette",
"doc" : "Colors supported by the palette.",
"symbols" : ["WHITE", "BLUE", "GREEN", "RED", "BLACK"]}
- Avoid non-trivial union types and recursive types. These are Avro features that map poorly to most other systems. Since our goal is an intermediate format that maps well to other systems we want to avoid any overly advanced features.
How to use it with Kafka
I was introduced to Avro for publishing and consuming from Kafka topics. Steps for any other purpose of using Avro will be the same. Here are the steps to integrate avro for Data transfer:
- Define schema
- Generate classes
- Define serializer and deseralizer
- Define config properties for key and values
- Send/Consume record
Here is the sample code which implements all the steps