With Hadoop lets you store files bigger than

With an increase in the penetration of
internet and the usage of the internet, the data captured by Google increased
exponentially year on year. Just to give us an estimate of this number, in 2007
Google collected on an average 270 PB of data every month. The same number
increased to 20000 PB everyday in 2009. Obviously, Google needed a better
platform to process such an enormous data. Google implemented a programming
model called MapReduce, which could process this 20000 PB per day. Google ran these
MapReduce operations on a special file system called Google File System (GFS). Unfortunately,
GFS is not an open source.

Doug cutting and
Yahoo! reverse engineered the model GFS and built a parallel Hadoop Distributed
File System (HDFS). Thus came Hadoop, a framework- an open-source Apache
project- that can be used for performing operations on data in a distributed
environment(using HDFS) using a simple programing model called MapReduce .

We Will Write a Custom Essay Specifically
For You For Only $13.90/page!


order now

In other words, Hadoop
can be thought of as a set of open source programs and procedures which anyone
can use as the “backbone” of their big data operations. Hadoop
lets you store files bigger than what can be stored on one particular node or
server. So one can store very, very large files. It also lets you store many, many
files.

It is also a scalable and fault tolerant system. In
the realm of Big Data, Hadoop falls primarily into the distributed processing
category but also has a powerful storage capability.

The core components of Hadoop are:

1.    
Hadoop YARN – A manager and scheduling
system that schedules resources on a cluster of machines. It manages resources of the systems storing
the data and running the analysis.

2.     Hadoop
MapReduce – MapReduce is named after the two basic operations this module
carries out – reading data from the database, putting it into a format suitable
for analysis (map), and performing mathematical operations (reduce).MapReduce
provides a programming model that makes combining the data from various hard
drives a much easier task. There are two parts to the programming model – the
map phase and the reduce phase—and it’s the interface between the two where the
“combining” of data occurs. Hadoop distributes the data across multiple
servers. Each and every server offers the ability to analyze and store the data
locally. When you run a query on a large dataset, every server in this network
will execute the query on its local server on the local dataset. Finally, the
results from all the local servers are consolidated. The consolidation part is
handled effectively by MapReduce.

3.     Hadoop
Distributed File System (HDFS)

This is a self-healing, high
bandwidth clustered file storage, which is optimized for high throughput access
to data. It can store any type of data, structured or complex from any number
of sources in their original format. It is a file system
designed for storing very large files with streaming data access patterns,
running on clusters of commodity hardware. Hadoop by default
stores 3 copies of each data block in the cluster on different nodes of the
cluster. Any time a node or machine fails containing a certain block of data,
another copy is created on another node in the cluster thus making the system
fail proof. In simpler terms Hadoop distributes and replicates the dataset
across the multiple nodes efficiently. So that if any of the nodes fail in the
Hadoop ecosystem, it will still return the dataset appropriately.

 

KEY CHARACTERISTICS

 

–       High Availability

MapReduce,
a YARN based system has efficient load balancing. It ensures that jobs run and
fail independently.  It also restarts
jobs automatically on failure.

–       Scalability of Storage/Compute

 Using the MapReduce model, applications can
scale from a single node to hundreds of nodes without having to re-architect
the system. Scalability is built into the model because data is chunked and
distributed as independent compute quantities.

–       Controlling Cost  

Adding
or retiring nodes based on the evolution of storage and analytic requirements
is easy with Hadoop. You don’t have to commit to more storage or processing
power ahead of time and can scale only when required, thus controlling your
costs.

–       Agility and Innovation

Since
data is stored in its original format and there is no predefined schema, it is
easy to apply new and evolving analytic techniques to this data using
MapReduce.