What is Hadoop?

              Hadoop is an open source software project that enables the distributed processing of large data sets across clusters of commodity servers. It is designed to scale up from a single server to thousands of machines, with a very high degree of fault tolerance. Rather than relying on high-end hardware, the resiliency of these clusters comes from the software’s ability to detect and handle failures at the application layer.

Hadoop is a free, Java-based programming framework that supports the processing of large data sets in a distributed computing environment. It is part of the Apache project sponsored by the Apache Software Foundation.

Hadoop makes it possible to run applications on systems with thousands of nodes involving thousands of terabytes. Its distributed file system facilitates rapid data transfer rates among nodes and allows the system to continue operating uninterrupted in case of a node failure. This approach lowers the risk of catastrophic system failure, even if a significant number of nodes become inoperative.

Hadoop was inspired by Google‘s MapReduce, a software framework in which an application is broken down into numerous small parts. Any of these parts (also called fragments or blocks) can be run on any node in the cluster. Doug Cutting, Hadoop’s creator, named the framework after his child’s stuffed toy elephant. The current Apache Hadoop ecosystem consists of the Hadoop kernel, MapReduce, the Hadoop distributed file system (HDFS) and a number of related projects such as Apache Hive, HBase and Zookeeper.

The Hadoop framework is used by major players including Google, Yahoo and IBM, largely for applications involving search engines and advertising. The preferred operating systems are Windows and Linux but Hadoop can also work with BSD and OS X.

•Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.
•Large datasets à Terabytes or petabytes of data.
•Large clusters à hundreds or thousands of nodes.
•Hadoop is open-source implementation for Google MapReduce.
•Hadoop is based on a simple programming model called MapReduce.
•Hadoop is based on a simple data model, any data will fit.
What is Hadoop (Cont’d)?
•Hadoop framework consists on two main layers,
•Distributed file system (HDFS)
•Execution engine (MapReduce)
Hadoop Master/Slave Architecture:
•Hadoop is designed as a master-slave shared-nothing architecture
Design Principles of Hadoop:
•Need to process big data.
•Need to parallelize computation across thousands of nodes.
•Commodity hardware
•Large number of low-end cheap machines working in parallel to solve a computing problem.
•This is in contrast to Parallel DBs.
•Small number of high-end expensive machines.
Design Principles of Hadoop:
•Automatic parallelization & distribution
•Hidden from the end-user
•Fault tolerance and automatic recovery
•Nodes/tasks will fail and will recover automatically
•Clean and simple programming abstraction
•Users only provide two functions “map” and “reduce”
How Uses MapReduce/Hadoop?
•Google: Inventors of MapReduce computing paradigm.
•Yahoo: Developing Hadoop open-source of MapReduce.
•IBM, Microsoft, Oracle.
•Facebook, Amazon, AOL, NetFlex.
•Many others + universities and research labs.
Hadoop Distributed File System (HDFS):
Main Properties of HDFS :
Large: A HDFS instance may consist of thousands of server machines, each storing part of the file system’s data.
Replication: Each data block is replicated many times (default is 3).
•Failure: Failure is the norm rather than exception.
Fault Tolerance: Detection of faults and quick, automatic recovery from them is a core architectural goal of HDFS.
Namenode is consistently checking Datanodes.
Map-Reduce Execution Engine:
(Example: Color Count)
Properties of MapReduce Engine:
•Job Tracker is the master node (runs with the namenode).
•Receives the user’s job.
•Decides on how many tasks will run (number of mappers).
•Decides on where to run each mapper (concept of locality).
Properties of MapReduce Engine (Cont’d):
•Task Tracker is the slave node (runs on each datanode).
•Receives the task from Job Tracker.
•Runs the task until completion (either map or reduce task).
•Always in communication with the Job Tracker reporting progress.
Key-Value Pairs:
•Mappers and Reducers are users’ code (provided functions)
•Just need to obey the Key-Value pairs interface
Consume <key, value> pairs
Produce <key, value> pairs
Consume <key, <list of values>>
Produce <key, value>
•Shuffling and Sorting:
     Hidden phase between mappers and reducers
     Groups all similar keys from all mappers, sorts and passes them to a certain reducer in the form of <key, <list of values>>
MapReduce Phases:

Large-Scale Data Analytics:

•MapReduce computing paradigm (E.g., Hadoop) vs. Traditional database systems.

Many enterprises are turning to Hadoop: Especially applications generating big data.Web applications, social networks, scientific applications.

Hadoop changes the economics and the dynamics of large scale computing. Its impact can be boiled down to four salient characteristics.

Hadoop enables a computing solution that is:

  • Scalable– New nodes can be added as needed, and added without needing to change data formats, how data is loaded, how jobs are written, or the applications on top.
  • Cost effective– Hadoop brings massively parallel computing to commodity servers. The result is a sizeable decrease in the cost per terabyte of storage, which in turn makes it affordable to model all your data.
  • Flexible– Hadoop is schema-less, and can absorb any type of data, structured or not, from any number of sources. Data from multiple sources can be joined and aggregated in arbitrary ways enabling deeper analyses than any one system can provide.
  • Fault tolerant– When you lose a node, the system redirects work to another location of the data and continues processing without missing a fright beat.




In: Blogging


                 Big data is an all-encompassing term for any collection of data sets so large and complex that it becomes difficult to process them using traditional data processing applications. The challenges include analysis, capture, curation, search, sharing, storage, transfer, visualization, and privacy violations.
Every day, we create 2.5 quintillion bytes of data — so much that 90% of the data in the world today has been created in the last two years alone. This data comes from everywhere: sensors used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records, and cell phone GPS signals to name a few.This data is “big data.”

There are huge volumes of data in the world:

From the beginning of recorded time until 2003,
We created 5 billion gigabytes (exabytes) of data.
In 2011, the same amount was created every two days.
In 2013, the same amount of data is created every 10 minutes.

Definition of Big Data:

Big data usually includes data sets with sizes beyond the ability of commonly used software tools to capture, curate, manage, and process data within a tolerable elapsed time.
Big data “size” is a constantly moving target, as of 2012 ranging from a few dozen terabytes to many petabytes of data.
Big data is a set of techniques and technologies that require new forms of integration to uncover large hidden values from large datasets that are diverse, complex, and of a massive scale.

Big data can be described by the following characteristics:

Volume – The quantity of data that is generated is very important in this context. It is the size of the data which determines the value and potential of the data under consideration and whether it can actually be considered as Big Data or not. The name ‘Big Data’ itself contains a term which is related to size and hence the characteristic.

Variety – The next aspect of Big Data is its variety. This means that the category to which Big Data belongs to is also a very essential fact that needs to be known by the data analysts. This helps the people, who are closely analyzing the data and are associated with it, to effectively use the data to their advantage and thus upholding the importance of the Big Data.

Velocity – The term ‘velocity’ in the context refers to the speed of generation of data or how fast the data is generated and processed to meet the demands and the challenges which lie ahead in the path of growth and development.

Variability – This is a factor which can be a problem for those who analyse the data. This refers to the inconsistency which can be shown by the data at times, thus hampering the process of being able to handle and manage the data effectively.

Veracity – The quality of the data being captured can vary greatly. Accuracy of analysis depends on the veracity of the source data.

Complexity – Data management can become a very complex process, especially when large volumes of data come from multiple sources. These data need to be linked, connected and correlated in order to be able to grasp the information that is supposed to be conveyed by these data. This situation, is therefore, termed as the ‘complexity’ of Big Data.


Big data has increased the demand of information management specialists in that Software AG, Oracle Corporation, IBM, FICO, Microsoft, SAP, EMC, HP and Dell have spent more than $15 billion on software firms specializing in data management and analytics.
In 2010, this industry was worth more than $100 billion and was growing at almost 10 percent a year: about twice as fast as the software business as a whole.

The 5C Level Architecture can be described as: Smart Connection- Acquiring accurate and reliable data from machines and their components is the first step in developing a cyber-physical system application.
The data might be directly measured by sensors or obtained from controller or enterprise manufacturing systems such as ERP, MES, SCM and CMM. Two important factors at this level have to be considered.
Data-to-Information Conversion-Meaningful information has to be inferred from the data. Currently, there are several tools and methodologies available for the data to information conversion level.
In recent years, extensive focus has been applied to develop these algorithms specifically for prognostics and health management applications.
By calculating health value, estimated remaining useful life and etc., the second level of CPS architecture brings self-awareness to machines.

Types of tools typically used in Big Data Scenario:

Where is the processing hosted?
Distributed server/cloud
Where data is stored?
Distributed Storage (eg: Amazon s3)
Where is the programming model?
Distributed processing (Map Reduce)
How data is stored and indexed?
High performance schema free database
What operations are performed on the data?
Analytic/Semantic Processing (Eg. RDF/OWL)

When dealing with Big Data is hard:
When the operations on data are complex:
Eg. Simple counting is not a complex problem.
Modeling and reasoning with data of different kinds can get extremely complex

Good news with big-data:   Often, because of the vast amount of data, modeling techniques can get simpler (e.g., smart counting can replace complex model-based analytics)…
…as long as we deal with the scale.

Time for thinking:
What do you do with the data.

Lets take an example:
“From application developers to video streamers, organizations of all sizes face the challenge of capturing, searching, analyzing, and leveraging as much as terabytes of data per second—too much for the constraints of traditional system capabilities and database management tools.”

Why Big-Data?
Key enablers for the appearance and growth of ‘Big-Data’ are:
Increase in storage capabilities.
Increase in processing power.
Availability of data.
Big- Data’ is similar to ‘Small-data’ but bigger

.. But having data bigger it requires different approaches:
Techniques, tools, architecture
… with an aim to solve new problems
Or old problems in a better way.



              KESHAVA TECH TEAM.