Apache Avro Data serialization Framework
04 May 2017Introduction
Data serialization is a mechanism to translate or serialize data into binary or textual form that can be transported over the network or store on some persisten storage .Serialization operation is performed by the sender/transmitter of the data
Deserialization is the process of converting serialized binary data which is received over the network or from persistant storage to its original form . Deserialization of the data is performed by he Receiver of the data.
Apache Avro is a language agnostic remote procedure call and data serialization framework.Avro is a preferred tool to serialize data in Hadoop.
Download
Apache Avro jar file can be downloaded from the following link https://avro.apache.org/releases.html
You can select and download the library for any of the languages provided. In this tutorial, we use Java. Hence download the jar files avro-1.8.1.jar
and avro-tools-1.8.1.jar
To work with Avro in Linux environment, download the following jar files −
avro-1.8.1.jar
avro-tools-1.8.1.jar
log4j-api-2.0-beta9.jar
og4j-core-2.0.beta9.jar.
You can also get the Avro library into your project using Maven
<dependency>
<groupId>org.apache.avro</groupId>
<artifactId>avro</artifactId>
<version>1.8.1</version>
</dependency>
<dependency>
<groupId>org.apache.avro</groupId>
<artifactId>avro-tools</artifactId>
<version>1.8.1</version>
</dependency>
<dependency>
<groupId>org.apache.logging.log4j</groupId>
<artifactId>log4j-api</artifactId>
<version>2.7</version>
</dependency>
<dependency>
<groupId>org.apache.logging.log4j</groupId>
<artifactId>log4j-core</artifactId>
<version>2.7</version>
</dependency>
Avro Schema
Avro accepts schemas as input. Using these schemas, you can store serialized values in binary format using less space.
Avro schema supports the following primitive data types
Data type | Description |
---|---|
null | Null is a type having no value |
int | 32-bit signed integer |
long | 64-bit signed integer. |
float | single precision (32-bit) IEEE 754 floating-point number. |
double | double precision (64-bit) IEEE 754 floating-point number. |
bytes | sequence of 8-bit unsigned bytes. |
string | Unicode character sequence. |
It also supports six complex data types like Records, Enums, Arrays, Maps, Unions, and Fixed
**Record **
The Record datatype contains 4 attributes
type - Describes document type
namespace − Describes the name of the namespace in which the object resides
name − Describes the schema name
fields − This is an attribute array which contains the following
name − Describes the name of field
type − Describes data type of field
Example of a schema for type record
is as follows
{
"type" : "record",
"namespace" : "Testing",
"name" : "Employee",
"fields" : [
{ "name" : "Name" , "type" : "string" },
{ "name" : "Age" , "type" : "int" }
]
}
Enum
An enumeration is a list of items in a collection
name − The value of this field holds the name of the enumeration.
namespace − The value of this field contains the string that qualifies the name of the Enumeration.
symbols − The value of this field holds the enum’s symbols as an array of names.
Example schema for enum type
is as follows
{
"type" : "enum",
"name" : "Numbers", "namespace": "data", "symbols" : [ "ONE", "TWO", "THREE", "FOUR" ]
}
- Arrays
This data type defines an array field having a single attribute items. This items attribute specifies the type of items in the array.
{ " type " : " array ", " items " : " int " }
Maps
The map data type is an array of key-value pairs. The values attribute holds the data type of the content of map. Avro map values are implicitly taken as strings
{"type" : "map", "values" : "int"}
Unions
A union datatype is used whenever the field has one or more datatypes.
They are represented as JSON arrays. For example, if a field that could be either an int or null, then the union is represented as [“int”, “null”].
{
"type" : "record",
"namespace" : "tutorialspoint",
"name" : "empdetails ",
"fields" :
[
{ "name" : "experience", "type": ["int", "null"] }, { "name" : "age", "type": "int" }
]
}
Generating Java Classes
One can read an Avro schema into the program either by generating a class corresponding to a schema or by using the parsers library
For example let us consider the RPC for login function
Defining the Schema
The login data will require username,password and devicetoken to be passed to the server by the client.
Let us define the schema file called “Login.json”
{
"namespace": "com.pyvision",
"type": "record",
"name": "userLogin",
"fields": [
{"name": "name", "type": "string"},
{"name": "password", "type": "string"},
{"name": "devicetoken", "type": "string"}
]
}
Compiling the Schema
After creating the Avro schema, we need to compile it using Avro tools.
java -jar <path/to/avro-tools-x.x.x.jar> compile schema <path/to/schema-file> <destination-folder>
java -jar avro-tools-1.8.1.jar compile schema Login.json .
This will generate the java file com/pyvision/userLogin.java
The namespace is the packagename and the java file generated is the name of schema record type
Use Generated Java file
copy the directory com/pyvsion
to your java maven project.
userLogin user=new userLogin();
user.setDevicetoken("device1");
user.setName("user1");
user.setPassword("1234");
To converts Java objects into in-memory binary serialized format.
DatumWriter<userLogin> empDatumWriter = new SpecificDatumWriter<userLogin>(userLogin.class);
DataFileWriter<userLogin> empFileWriter = new DataFileWriter<userLogin>(empDatumWriter);
empFileWriter.create(user.getSchema(),new File("userLogin.avro"));
empFileWriter.append(user);
empFileWriter.close();
To deserialize a class
DatumReader<userLogin> empDatumReader = new SpecificDatumReader<userLogin>(userLogin.class);
DataFileReader<userLogin> dataFileReader = new DataFileReader(new File("/tmp/userLogin.avro"), empDatumReader);
userLogin em=null;
while(dataFileReader.hasNext()){
em=dataFileReader.next(em);
System.out.println(em);
}
The deserialized data is as follows
{"name": "user1", "password": "1234", "devicetoken": "device1"}
One can read an Avro schema into a program either by generating a class corresponding to a schema or by using the parsers library.In Avro, data is always stored with its corresponding schema. Therefore, we can always read a schema without code generation.
Avro has support for following languages C,C++,java,php,perl,python,ruby,scala,Go,haskell
Thus we can serialize and deserialize any data across applications developed in any of the above languages.
As far as serialization goes it is similar to Thrift and Protocol Buffer.
References
- https://www.tutorialspoint.com/avro/avro_schemas.htm
- https://www.tutorialspoint.com/avro/avro_schemas.htm