Sunday, July 31, 2016

Introduction to Hadoop Eco System

Introduction : 
Hadoop is an Apache project specific for Online Analytical Processing (OLAP) of Big Data. Classical Relational Databases ( Oracle, My-SQL, MSSQL) have performance drawback when they start analyze large amount of data. Because classical relational databases loads data in to memory and then start processing. This required high-end hardware platforms. 

Problem : 

Relational databases are not good in front of analyzing data with high velocity, high volume and higher verity. But they are really good in Online Transaction Processing (OLTP). OLTP is based on insert and updates (Changes to database). But OLAP is based on querying, filtering and processing large amount of data.

Popularity of social media makes unstructured data in terabytes per seconds. Analyzing delta change from massive data dumps ( Facebook's Like count for resent photo ) at real time is a changeable task. This cannot be achieved by Relational Database Models.

Solution : 

Massively Parallel Distributed Processing Frameworks. 

Example : Hadoop Ecosystem



Figure 1 - Hadoop Eco System


Hadoop Eco System
Hadoop is the core engine of OLAP and Eco system comes with bellow features and frameworks

1. Data Acquisition

1. Apache Flume : A tool reads live stream data and pushing it to the hadoop
2. Apache Sqoop : A tool reads relational database and pushing it to hadoop

2. Arrangement of Data

1. Hadoop Distributed File System (HDFS) : File system of Hadoop Framework
2. NoSQL Databases

3. Analyzing Data

1. Apache Pig  : A tool with a scripting language to analyze data from Hadoop.
                           It analyzes unstructured data Ex : log analyzing
2. Apache Hive : A tool with a SQL features to analyze data from Hadoop

4. Intelligence

    It is all about Getting visualization for decisions
   
1. Apache Hue : It is a web interface for visualization
2. Cloudera Community Manager : Web interface to setup Hadoop cluster
3. Tableau, Qlickview, Sas : Proprietary products for visualization



Usage of Hadoop Eco System

1. Sentimental Analyzing 

This to identify customer satisfactions, actual expectations of customers

2. Forcasting Trends and Risk modeling

This is to identify new market trends, forecasting, risk identification in banking sectors 


3. Analyzing Network Data, Firewall Logs

Analyzing massive number of data packets, lengthy firewall logs for security threats

4. Intelligent Search

Providing best fittest search results in an efficient manner.


How Hadoop Really Works

still writing......

Thursday, July 28, 2016

Format Pen Drive

Environment :  Mint Linux

Step 0 : Plug the PEN drive to your Linux distribution

Step 1 : Identify mount points of disks
Command 1 : sudo fdisk -l


WARNING: GPT (GUID Partition Table) detected on '/dev/sda'! The util fdisk doesn't support GPT. Use GNU Parted.


Disk /dev/sda: 500.1 GB, 500107862016 bytes
255 heads, 63 sectors/track, 60801 cylinders, total 976773168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disk identifier: 0x00000000

Device Boot      Start         End      Blocks   Id  System
/dev/sda1               1   976773167   488386583+  ee  GPT
Partition 1 does not start on physical sector boundary.

Disk /dev/sdb: 15.7 GB, 15733161984 bytes
64 heads, 32 sectors/track, 15004 cylinders, total 30728832 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x51585bc4

Device Boot      Start         End      Blocks   Id  System
/dev/sdb1   *          64     2054143     1027040    6  FAT16


Step 2 : Un-mount the PEN drive
Command 2 : sudo umount /dev/sdb


Step 3 : Format the PEN drive with prefered file formate.
Command 3 : sudo mkfs.vfat -I /dev/sdb






Monday, July 18, 2016

ORA-01722: invalid number - Avoiding characters from numeric values in same column

Issue :


I have found column CG_AMOUNT has non numeric values for bellow query.

SELECT COUNT(*) AS CHARGE_CUSTOMER_COUNT FROM CM_CHRG_CX;

Oracle gives bellow error code : ORA-01722: invalid number


Solution : use REGEXP_LIKE(CG_AMOUNT, '^[[:digit:]]+$');

SELECT COUNT(*) AS CHARGE_CUSTOMER_COUNT FROM CM_CHRG_CX WHERE REGEXP_LIKE(CG_AMOUNT, '^[[:digit:]]+$');