Big Data

Introduction to Hadoop

Pinterest LinkedIn Tumblr

Hadoop can effectively manage large data that is both structured and unstructured in a variety of formats. Get an introduction it here.

According to Cloudera, Hadoop is an open-source, Java-based programming framework that supports the processing and storage of extremely large data sets in a distributed computing environment. It is part of the Apache project sponsored by the Apache Software Foundation.

Hadoop was created by computer scientists Doug Cutting and Mike Cafarella in 2006 to support distribution for the Nutch search engine. It was inspired by Google’s MapReduce.

Why Hadoop?

The problem with RDBMS is that it can not process semi-structured and unstructured data (i.e. text, videos, audios, Facebook posts, clickstream data, etc.). It can only work with structured data (i.e. banking transaction, location information, etc.). Both are also different in term of processing data.

RDBMS architecture with the ER model is unable to deliver fast results with vertical scalability by adding CPU or more storages. It becomes unreliable if the main server is down.

On the other hand, the Hadoop system manages effectively with large-sized structured and unstructured data in different formats such as XML, JSON, and text at high fault-error tolerance. With clusters of many servers in horizontal scalability, Hadoop’s performance is superior. It provides faster results from Big Data and unstructured data because its Hadoop architecture is based on the flat open source.

You can download it from here.

The basic points are:

  • It is open source
  • It is powered by Java
  • It is part of the Apache group
  • It is supported by big web giants.

Key Technologies

Following are the key technologies used in Hadoop.

Hadoop MapReduce

MapReduce is a computational model and software framework for writing applications that run on Hadoop. These MapReduce programs are capable of processing enormous data in parallel on large clusters of computation nodes.

HDFS (Inspired by GFS)

HDFS takes care of the storage part of Hadoop applications. MapReduce applications consume data from HDFS. HDFS creates multiple replicas of data blocks and distributes them on compute nodes in the cluster. This distribution enables the reliable and extremely rapid computations.

HDFS Architecture


Each Datanode sends a Heartbeat message to the Namenode periodically. If any of the Datanodes fail, Namenode detects this condition from the absence of the Heartbeat message so that the Namenode does not forward any new IO requests to that Datanode.

What Is Replication Factor?

The replication factor is a property that can be set in the HDFS configuration file that will allow you to adjust the global replication factor for the entire cluster. For each block stored in HDFS, there will be n-1 duplicated blocks distributed across the cluster. For example, if the replication factor was set to 3 (default value in HDFS), there would be one original block and two replicas.


The Namenode is the centerpiece of an HDFS file system. It keeps the directory tree of all files in the file system and tracks where the cluster the file data is kept. It does not store the data of these files itself. Namenode holds the metadata HDFS like Namespace information, block information, etc. When in use, all this information is stored in main memory. But this information is also stored in disk for persistence storage.


A Datanode stores data in the HDFS. A functional filesystem has more than one Datanode, with data replicated across them. On startup, a Datanode connects to the Namenode, spinning until that service comes up. It then responds to requests from the Namenode for filesystem operations.

Secondary Namenode

Many of us get confused by its name >it gives a sense that it’s a backup for the Namenode. But in reality, it’s not.


The above image shows how Namenode stores information in disk. The two different files are:

  1. fsimage, the snapshot of the filesystem when the namenode is started.
  2. edit logs, the sequence of changes made to the filesystem after the namenode has started.

In production clusters, restarts of Namenode are rare, which means edit logs can grow very large. This can result in the following situations:

  1. Editlog becomes very large, making it challenging to manage.
  2. Restarting Namenode takes a long time because a lot of changes have to be merged.
  3. If there’s a crash, we will lose a huge amount of metadata since fsimage is very old.

To overcome these issues, we need a mechanism that will help us reduce the edit log size so that it’s manageable and that will give us an up-to-date fsimage, so that the load on Namenode reduces.

Secondary Namenode helps us overcome the above issues by taking over the responsibility of merging edit logs with fsimage from the Namenode.


The above figure shows the workings of Secondary Namenode.

  • It gets the edit logs from Namenode in regular intervals and applies them to fsimage.
  • Once it has a new fsimage, it copies it back to Namenode.
  • Namenode will use this fsimage for the next restart, which will reduce the startup time.

Resouce Manager and Node Manager

The Resource Manager and the Node Manager form the data computation framework. The Resource Manager is the ultimate authority that arbitrates resources among all the applications in the system. The Node Manager is the per-machine framework agent that is responsible for containers, monitoring their resource usage (CPU, memory, disk, network), and reporting the this data to the Resource Manager/Scheduler.yarn_architecture


original source: