Apache Avro Data serialization Framework

04 May 2017

Introduction

Data serialization is a mechanism to translate or serialize data into binary or textual form that can be transported over the network or store on some persisten storage .Serialization operation is performed by the sender/transmitter of the data

Deserialization is the process of converting serialized binary data which is received over the network or from persistant storage to its original form . Deserialization of the data is performed by he Receiver of the data.

Apache Avro is a language agnostic remote procedure call and data serialization framework.Avro is a preferred tool to serialize data in Hadoop.

Download

Apache Avro jar file can be downloaded from the following link https://avro.apache.org/releases.html

You can select and download the library for any of the languages provided. In this tutorial, we use Java. Hence download the jar files avro-1.8.1.jar and avro-tools-1.8.1.jar

To work with Avro in Linux environment, download the following jar files −

avro-1.8.1.jar
avro-tools-1.8.1.jar
log4j-api-2.0-beta9.jar
og4j-core-2.0.beta9.jar.

You can also get the Avro library into your project using Maven

      <dependency>
         <groupId>org.apache.avro</groupId>
         <artifactId>avro</artifactId>
         <version>1.8.1</version>
      </dependency>
	
      <dependency>
         <groupId>org.apache.avro</groupId>
         <artifactId>avro-tools</artifactId>
         <version>1.8.1</version>
      </dependency>
			
      <dependency>
         <groupId>org.apache.logging.log4j</groupId>
         <artifactId>log4j-api</artifactId>
         <version>2.7</version>
      </dependency>
	
      <dependency>
         <groupId>org.apache.logging.log4j</groupId>
         <artifactId>log4j-core</artifactId>
         <version>2.7</version>
      </dependency>
			

Avro Schema

Avro accepts schemas as input. Using these schemas, you can store serialized values in binary format using less space.

Avro schema supports the following primitive data types

Data type	Description
null	Null is a type having no value
int	32-bit signed integer
long	64-bit signed integer.
float	single precision (32-bit) IEEE 754 floating-point number.
double	double precision (64-bit) IEEE 754 floating-point number.
bytes	sequence of 8-bit unsigned bytes.
string	Unicode character sequence.

It also supports six complex data types like Records, Enums, Arrays, Maps, Unions, and Fixed

Record

The Record datatype contains 4 attributes

type - Describes document type
namespace − Describes the name of the namespace in which the object resides
name − Describes the schema name
fields − This is an attribute array which contains the following
         name − Describes the name of field
         type − Describes data type of field

Example of a schema for type record is as follows

{
   "type" : "record",
   "namespace" : "Testing",
   "name" : "Employee",
   "fields" : [
      { "name" : "Name" , "type" : "string" },
      { "name" : "Age" , "type" : "int" }
   ]
}

Enum

An enumeration is a list of items in a collection

name − The value of this field holds the name of the enumeration.

namespace − The value of this field contains the string that qualifies the name of the Enumeration.

symbols − The value of this field holds the enum’s symbols as an array of names.

Example schema for enum type is as follows

{
   "type" : "enum",
   "name" : "Numbers", "namespace": "data", "symbols" : [ "ONE", "TWO", "THREE", "FOUR" ]
}

Arrays

This data type defines an array field having a single attribute items. This items attribute specifies the type of items in the array.

{ " type " : " array ", " items " : " int " }

Maps

The map data type is an array of key-value pairs. The values attribute holds the data type of the content of map. Avro map values are implicitly taken as strings

{"type" : "map", "values" : "int"}

Unions

A union datatype is used whenever the field has one or more datatypes.

They are represented as JSON arrays. For example, if a field that could be either an int or null, then the union is represented as [“int”, “null”].

{ 
   "type" : "record", 
   "namespace" : "tutorialspoint", 
   "name" : "empdetails ", 
   "fields" : 
   [ 
      { "name" : "experience", "type": ["int", "null"] }, { "name" : "age", "type": "int" } 
   ] 
}

Generating Java Classes

One can read an Avro schema into the program either by generating a class corresponding to a schema or by using the parsers library

For example let us consider the RPC for login function

Defining the Schema

The login data will require username,password and devicetoken to be passed to the server by the client.

Let us define the schema file called “Login.json”

{
   "namespace": "com.pyvision",
   "type": "record",
   "name": "userLogin",
   "fields": [
      {"name": "name", "type": "string"},
      {"name": "password", "type": "string"},
      {"name": "devicetoken", "type": "string"}
   ]
}

Compiling the Schema

After creating the Avro schema, we need to compile it using Avro tools.

java -jar <path/to/avro-tools-x.x.x.jar> compile schema <path/to/schema-file> <destination-folder>

java -jar avro-tools-1.8.1.jar compile schema Login.json .

This will generate the java file com/pyvision/userLogin.java

The namespace is the packagename and the java file generated is the name of schema record type

Use Generated Java file

copy the directory com/pyvsion to your java maven project.

    userLogin user=new userLogin();
     user.setDevicetoken("device1");
     user.setName("user1");
     user.setPassword("1234");
				

To converts Java objects into in-memory binary serialized format.

DatumWriter<userLogin> empDatumWriter = new SpecificDatumWriter<userLogin>(userLogin.class);
DataFileWriter<userLogin> empFileWriter = new DataFileWriter<userLogin>(empDatumWriter);
empFileWriter.create(user.getSchema(),new File("userLogin.avro"));
empFileWriter.append(user);
empFileWriter.close();

To deserialize a class

DatumReader<userLogin> empDatumReader = new SpecificDatumReader<userLogin>(userLogin.class);
DataFileReader<userLogin> dataFileReader = new DataFileReader(new File("/tmp/userLogin.avro"), empDatumReader);

userLogin em=null;
while(dataFileReader.hasNext()){

em=dataFileReader.next(em);
System.out.println(em);

}

The deserialized data is as follows

{"name": "user1", "password": "1234", "devicetoken": "device1"}

One can read an Avro schema into a program either by generating a class corresponding to a schema or by using the parsers library.In Avro, data is always stored with its corresponding schema. Therefore, we can always read a schema without code generation.

Avro has support for following languages C,C++,java,php,perl,python,ruby,scala,Go,haskell

Thus we can serialize and deserialize any data across applications developed in any of the above languages.

As far as serialization goes it is similar to Thrift and Protocol Buffer.

References

https://www.tutorialspoint.com/avro/avro_schemas.htm
https://www.tutorialspoint.com/avro/avro_schemas.htm

pyVision A Machine Learning and Signal Processing toolbox