Sunday, July 31, 2016

Introduction to Hadoop Eco System

Introduction : 
Hadoop is an Apache project specific for Online Analytical Processing (OLAP) of Big Data. Classical Relational Databases ( Oracle, My-SQL, MSSQL) have performance drawback when they start analyze large amount of data. Because classical relational databases loads data in to memory and then start processing. This required high-end hardware platforms. 

Problem : 

Relational databases are not good in front of analyzing data with high velocity, high volume and higher verity. But they are really good in Online Transaction Processing (OLTP). OLTP is based on insert and updates (Changes to database). But OLAP is based on querying, filtering and processing large amount of data.

Popularity of social media makes unstructured data in terabytes per seconds. Analyzing delta change from massive data dumps ( Facebook's Like count for resent photo ) at real time is a changeable task. This cannot be achieved by Relational Database Models.

Solution : 

Massively Parallel Distributed Processing Frameworks. 

Example : Hadoop Ecosystem



Figure 1 - Hadoop Eco System


Hadoop Eco System
Hadoop is the core engine of OLAP and Eco system comes with bellow features and frameworks

1. Data Acquisition

1. Apache Flume : A tool reads live stream data and pushing it to the hadoop
2. Apache Sqoop : A tool reads relational database and pushing it to hadoop

2. Arrangement of Data

1. Hadoop Distributed File System (HDFS) : File system of Hadoop Framework
2. NoSQL Databases

3. Analyzing Data

1. Apache Pig  : A tool with a scripting language to analyze data from Hadoop.
                           It analyzes unstructured data Ex : log analyzing
2. Apache Hive : A tool with a SQL features to analyze data from Hadoop

4. Intelligence

    It is all about Getting visualization for decisions
   
1. Apache Hue : It is a web interface for visualization
2. Cloudera Community Manager : Web interface to setup Hadoop cluster
3. Tableau, Qlickview, Sas : Proprietary products for visualization



Usage of Hadoop Eco System

1. Sentimental Analyzing 

This to identify customer satisfactions, actual expectations of customers

2. Forcasting Trends and Risk modeling

This is to identify new market trends, forecasting, risk identification in banking sectors 


3. Analyzing Network Data, Firewall Logs

Analyzing massive number of data packets, lengthy firewall logs for security threats

4. Intelligent Search

Providing best fittest search results in an efficient manner.


How Hadoop Really Works

still writing......