In Hadoop, by default HDFS replicates each block three times for several purposes. Replication in HDFS is very simple and robust form of redundancy to shield against the failure of datanode. But replication is very expensive. Thus, 3 x replication scheme has 200% overhead in storage space and other resources.
Thus, Hadoop 2.x introduced Erasure Coding a new feature to use in the place of Replication. It also provides the same level of fault tolerance with less space store and 50% storage overhead.
Erasure Coding uses Redundant Array of Inexpensive Disk (RAID). RAID implements EC through striping. In which it divide logical sequential data (such as a file) into the smaller unit (such as bit, byte or block). Then, stores data on different disk.
Encoding- In this process, RAID calculates and sort Parity cells for each strip of data cells. And recover error through the parity. Erasure coding extends a message with redundant data for fault tolerance. EC codec operates on uniformly sized data cells. In Erasure Coding, codec takes a number of data cells as input and produces parity cells as the output. Data cells and parity cells together are called an erasure coding group.
There are two algorithms available for Erasure Coding:
- XOR Algorithm
- Reed-Solomon Algorithm