Why Hadoop PaaS Users Need To Upgrade To Version 3

Hadoop 3 is a major version of a framework for developing and running distributed programs. In 2017 Hadoop 3 was released. For users of cloud Hadoop versions 1 and 2, we will tell you about the new features of the third and why it is worth switching to it.

Table of Contents

Erasure Coding: Reducing Storage Redundancy

Hadoop 2 uses replication to provide fault tolerance. It means that all stored data is in a redundant state. By default, the replication factor is three, which means that all information in three replicas will be held: master data plus two copies.

For example, a 1 GB file is stored in four 256 MB blocks. For each of these blocks, two additional copies are created, resulting in 3 GB. The storage redundancy is 200%.

Hadoop 3 takes a different approach – Erasure Coding. It is a data fragmentation technique that creates only a few extra parity blocks. They store information on how to restore the rest of the blocks if they are unavailable. Learn more about how it works.

Similarly, the Erasure Coding approach will use only two additional blocks, and together the data will take up 1.5 TB. It has 50% storage redundancy.

Support For More Than Two NameNodes: Improving Fault Tolerance

In Hadoop 2, there can be one active and one standby management node (NameNode). It means that the cluster can only survive the failure of one node. If two nodes fail, the group will be unavailable.

Hadoop 3 can provide better fault tolerance – it now supports multiple redundant NameNodes. Thus, the cluster can withstand the failure of two or more control nodes.

YARN Federation: Scale Up To 100,000 Nodes

YARN was designed to scale up to 10,000 nodes. The new version of Hadoop introduces the YARN Federation, which will allow YARN to scale to 40,000 or even 100,000 nodes. It is useful for high-load applications

New YARN Resource Types: Custom Quantifiable Resources

The YARN resource model has been generalized to support additional quantifiable resource types in addition to CPU and RAM. For example, a cluster administrator can define resources such as GPUs, software licenses, or locally attached storage.

It allows you to schedule tasks in YARN based on the availability of these resources. For example, you might consider the number of available software licenses or free disk space when planning lessons.

YARN Timeline Service 2: Scalable Backend Storage

The Timeline Server is responsible for collecting and searching information about running applications in the cluster. Timeline Service 2.0 is the next major iteration of Timeline Server since v.1. Version v.2 solves two main problems of versions v.1:

Scalability. Version v.1 has a single instance of reader/writer and storage and cannot scale beyond multiple nodes. Version 2 uses a more scalable distributed recording architecture and scalable backend storage. YARN Timeline Service v.2 separates data collection from service.
Ease of use. YARN Timeline Service v.2 supports aggregation of information at the threads or logical group of YARN applications.

Task Heap Management: Auto-Tuning The Heap Size

Hadoop 3 introduces new methods for tuning heap sizes. Auto-tuning is now possible based on host memory size, and the HADOOP_HEAPSIZE variable is deprecated. The desired heap size no longer needs to be specified in the task configuration and Java parameter.

Java 8

Hadoop 2 runs on Java Development Kit 7. Apache Hadoop 3 has migrated to Java Development Kit 8. In addition to Hadoop 3 itself, JDK 8 support is present in its HBase 2.0, Hive 3.0, and Phoenix 3.0 components.

Also Read: How And Why “Auchan” Built A Platform For Working With Big Data In A Public Cloud