Hello, in this video,
we will continue our exploration of formats used in Hadoop world.
As you had learned in the previous video,
text formats are designed with human readability in mind.
Binary formats are designed for machines and
they trade readability for efficiency and correctness.
Consider our number as string example.
To parse string into an integer,
you have to loop through the characters,
convert every character into a digit and do
a round of additions and multiplications to compute final value.
Alternatively, you can store integers as is just by copying the appropriate bytes.
In that case, every integer would occupy exactly 8 bytes on a 64 bit platform,
and serialization and deserialization will be just as efficient as data copying.
That is an example of inefficiencies that binary formats aim to resolve.
Let's see what binary formats are popular in Hadoop world.
The first binary format implemented in Hadoop was sequence file.
The primary design goal for sequence file was simplicity and efficiency.
And the primary use case was storing the intermediate data in MapReduce computations.
Essentially, SequenceFile stores a sequence of key value pairs,
where key and value are of arbitrary type
with the user defined serialization and deserialization code.
In the MapReduce lesson,
you will learn why key-value pairs are so special.
The format was not designed for interoperability with other languages.
Serialization, and deserialization codes are
provided by implementing writable and readable interfaces in Java,
which is often done in the ad hoc Java-specific manner.
This is why you will not find a lot of usages or sequence files outside of Java world.
Technically, SequenceFile starts with the header which includes format version,
class names for key and value types,
flags syndicating compression, metadata, and a sync marker.
Depending on the compression flags,
the data layout falls in one or three cases.
In the uncompressed case for every record,
there is a fixed size header with a record key length and value length,
followed by the serialized key and the serialized value.
And occasionally, followed by the sync marker.
Thus, to decode data,
you can read the file linearly and use length to read the key and the value.
A sync marker is similar to the new line character in text formats.
It is used to skip data thus allowing efficient splitting.
For the record compressed case,
layout is the same,
but the value is compressed with the code specified in the header.
For the block compressed case,
layout is slightly different.
Key-value pairs are grouped in blocks and every block starts with
a number of pairs followed by the key lengths and the compressed keys.
Then by value lengths and finally by compressed values.
The difference between the record compressed case and
the block compressed case is that in the former,
every value is compressed individually while in the latter,
a set of keys or values are compressed together resulting in better compression.
Let's see how sequence file is compared to other formats on our criteria.
Space efficiency, moderate.
The on-disk format closely matches
the in-memory format to enable fast encoding and decoding.
Strictly speaking, this statement may not hold for an arbitrary user defined code,
but for primitive types, it is true.
Using the plug compression could further improve space efficiency.
Encoding and decoding speed.
Primitive values are copied as is.
So there is nothing tricky.
Supported data types.
Any type implement in the appropriate interfaces could be used with a format.
For a developer, this is a huge advantage.
You can work with custom data types,
implement an arbitrary logic and use
exactly the same type when interoperating with the harder framework.
Splittable or monolithic.
Sequence files are splittable via sync markers.
Sync markers are unique with the high probability,
so you can seek to an arbitrary point in the file and
scan for the next occurrence of the sync marker to get the next record.
Accessibility.
Not out of the box.
You may include a version when serializing data and later
use this version to choose among different revisions of deserialization code.
But this is up to you.
Okay, let's move on. Next on the list, Avro.
Avro is a file format and a support library.
Avro's design goal was to create
an efficient and flexible format
which could be used with different programming languages.
To store your data in Avro,
you need to provide a schema.
That is a description of the fields in data items and their types.
The schema defines data encoding for every item.
When storing data, the schema is included in
the file thus allowing future readers to decode it correctly.
The schema is also used when reading data.
If the read schema does not match the data schema,
Avro tries to resolve inconsistencies thus enabling smooth schema migrations.
Technically, the Avro data layout is similar to the sequence of files.
Every Avro file starts with a header followed by
a sequence of blocks containing the number of encoded objects,
their sizes, and the actual payload.
Sync markers are used to delimit consequent blocks of records.
What is different in Avro is that the serialization code is
defined by the schema and not by the user-provided code.
Let's check with our criteria list to see how Avro is compared.
Space efficiency is similar to Sequence files.
The encoding format mostly follows the in-memory format.
Space savings could be obtained by using compression.
Encoding and decoding speed,
Avro can generate serialization and deserialization code from a schema.
In this case, its performance closely matches sequence files.
Without code generation however,
speed is rather limited.
Supported data types,.
Avro provides the same types as JSON,
plus a few more complex types, like enumerations records.
Compared to sequence files,
Avro forces you to express data in the restricted type system.
This is a price you pay for cross language interoperability.
Split ability is achieved using the same sync marker technique as in sequence files.
Extensibility and maintainability, are design goals for Avro.
So many simple operations,
such as field addition,
or removal, or renaming,
are handled transparently by the framework.
Avro is a popular format now,
holding the balance between efficiency and flexibility.
For many applications, I would say it's a good choice.
Now we are going to make a break here.
You have learned about sequenced files and Avro formats.
They are record-oriented formats.
In the next video, you will learn about columnar formats.