My Analytics Blog

Wednesday, November 11, 2015

UNDERSTANDING MAP-REDUCE PROGRAMMING

MapReduce is a method for distributing a task across multiple nodes where each node processes data stored on that node.
Consists of two phases:

Map
Reduce

The jobtracker coordinates all the jobs run on the system by scheduling tasks to run on tasktrackers. Tasktrackers run tasks and send progress reports to the jobtracker, which keeps a record of the overall progress of each job. If a task fails, the jobtracker can reschedule it on a different tasktracker.

Hadoop divides the input into fixed-size pieces called input splits, or just splits. The size of split is mostly same as block size of HDFS for optimization purpose. Hadoop creates one map task for each split, which runs the user defined map function for each record in the split. The record, we can say is made up by Hadoop using record reader as key value pair.

Hadoop always try its best to run the map task on a node where the input data /input split resides in. This is called the data locality optimization which improves the performance. In case all 3 nodes hosting the replica of map task input split are busy with other task then, job scheduler will look for a free map slot on a node in the same rack as one of the blocks. Very occasionally even this is not possible, so an off-rack node is used, which results in an inter-rack network transfer. It is possible because of rack awareness feature in Hadoop.

Map tasks write their output to the local disk /local node where mapper resides, not to HDFS. Map output is intermediate output and once the job is complete the map output can be thrown away.

If the node running the map task fails before the map output has been consumed by the reduce task, then Hadoop using job tracker will automatically rerun the map task on another node to re-create the map output.

The input to a reduce task is normally the output from multiple mappers. Reduce tasks don't have the advantage of data locality. Therefore, the sorted and shuffled map outputs have to be transferred across the network to the node where the reduce task is running, where they are merged and then passed to the user-defined reduce function. The output of the reducer is stored in HDFS.

http://hadooptutorials.co.in/tutorials/mapreduce/advanced-map-reduce-examples-1.html

http://hadooptutorials.co.in/tutorials/mapreduce/advanced-map-reduce-examples-2.html

Understanding HADOOP Ecosystem

We all know Hadoop is a framework which deals with Big Data but unlike any other frame work it's not a simple framework, it has its own family for processing different thing which is tied up in one umbrella called as Hadoop Ecosystem. Before jumping directly to members of ecosystem let's have a understanding of classification of data. Data is mainly categorized in 3 types under Big Data platform.

Structured Data - Data which has proper structure and which can be easily stored in tabular form in any relational databases like Mysql, Oracle etc is known as structured data.Example- Employee data .
Semi-Structured Data - Data which has some structure but cannot be saved in a tabular form in relational databases is known as semi structured data. Example-XML data, email messages etc.
Unstructured Data - Data which is not having any structure and cannot be saved in tabular form of relational databases is known as unstructured data. Example- Video files, Audio files, Text file etc.
`
Below diagram shows a various components of Hadoop Ecosystem.

Let's try to understand each component in detail

SQOOP : SQL + HADOOP = SQOOP

When we import any structured data from table (RDBMS) to HDFS a file is created in HDFS which we can process by either Map Reduce program directly or by HIVE or PIG. Similarly after processing data in HDFS we can store the processed structured data back to another table in RDBMS by exporting through Sqoop.

HDFS (Hadoop Distributed File System)

HDFS is a main component of Hadoop and a technique to store the data in distributed manner in order to compute fast. HDFS saves data in a block of 64MB(default) or 128 MB in size which is logical splitting of data in a Datanode (physical storage of data) in Hadoop cluster(formation of several Datanode which is a collection commodity hardware connected through single network). All information about data splits in data node known as metadata is captured in Namenode which is again a part of HDFS.

MapReduce Framework

MapReduce is a programming model for processing large data sets with a parallel, distributed algorithm on a cluster. Apache MapReduce was derived from Google MapReduce: Simplified Data Processing on Large Clusters paper. The current Apache MapReduce version is built over Apache YARN Framework. YARN stands for “Yet-Another-Resource-Negotiator”. It is a new framework that facilitates writing arbitrary distributed processing frameworks and applications. YARN’s execution model is more generic than the earlier MapReduce implementation. YARN can run applications that do not follow the MapReduce model, unlike the original Apache Hadoop MapReduce (also called MR1). Hadoop YARN is an attempt to take Apache Hadoop beyond MapReduce for data-processing. We can write Map reduce program by using any language like JAVA, C++ PIPEs, PYTHON, RUBY etc. By name only Map Reduce gives its functionality Map will do mapping of logic into data (distributed in HDFS) and once computation is over reducer will collect the result of Map to generate final output result of MapReduce. MapReduce Program can be applied to any type of data whether Structured or Unstructured stored in HDFS. Example - word count using MapReduce.

HBASE

Hadoop Database or HBASE is a non-relational (NoSQL) database that runs on top of HDFS. HBASE was created for large table which have billions of rows and millions of columns with fault tolerance capability and horizontal scalability and based on Google Big Table. Hadoop can perform only batch processing, and data will be accessed only in a sequential manner for random access of huge data HBASE is used. It’s for backing Hadoop MapReduce jobs with Apache HBase tables. Ramdom, real-time r/w operations in column-oriented very large tables (BDDB: Big Data Data Base).

Apache Pig

Pig provides an engine for executing data flows in parallel on Hadoop. It includes a language, Pig Latin, for expressing these data flows. Pig Latin includes operators for many of the traditional data operations (join, sort, filter, etc.), as well as the ability for users to develop their own functions for reading, processing, and writing data. Pig runs on Hadoop. It makes use of both the Hadoop Distributed File System, HDFS, and Hadoop’s processing system, MapReduce.
Pig uses MapReduce to execute all of its data processing. It compiles the Pig Latin scripts that users write into a series of one or more MapReduce jobs that it then executes. Pig Latin looks different from many of the programming languages you have seen. There are no if statements or for loops in Pig Latin. This is because traditional procedural and object-oriented programming languages describe control flow, and data flow is a side effect of the program. Pig Latin instead focuses on data flow.

Apache Spark

Data analytics cluster computing framework originally developed in the AMPLab at UC Berkeley. Spark fits into the Hadoop open-source community, building on top of the Hadoop Distributed File System (HDFS). However, Spark provides an easier to use alternative to Hadoop MapReduce and offers performance up to 10 times faster than previous generation systems like Hadoop MapReduce for certain applications.
Spark is a framework for writing fast, distributed programs. Spark solves similar problems as Hadoop MapReduce does but with a fast in-memory approach and a clean functional style API. With its ability to integrate with Hadoop and inbuilt tools for interactive query analysis (Shark), large-scale graph processing and analysis (Bagel), and real-time analysis (Spark Streaming), it can be interactively used to quickly process and query big data sets.
To make programming faster, Spark provides clean, concise APIs in Scala, Java and Python. You can also use Spark interactively from the Scala and Python shells to rapidly query big datasets. Spark is also the engine behind Shark, a fully Apache Hive-compatible data warehousing system that can run 100x faster than Hive.

Apache Flume

Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. It uses a simple extensible data model that allows for online analytic application.

Apache Kafka

Distributed publish-subscribe system for processing large amounts of streaming data. Kafka is a Message Queue developed by LinkedIn that persists messages to disk in a very performant manner. Because messages are persisted, it has the interesting ability for clients to rewind a stream and consume the messages again. Another upside of the disk persistence is that bulk importing the data into HDFS for offline analysis can be done very quickly and efficiently. Storm, developed by BackType (which was acquired by Twitter a year ago), is more about transforming a stream of messages into new streams.

Apache Zookeeper

Writing distributed applications is difficult because of partial failure may occur between nodes to overcome this Apache Zookeper has been developed by maintaining an open-source server which enables highly reliable distributed coordination. ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services . In case of any partial failure clients can connect to any node and be assured that they will receive the correct, up-to-date information.It’s a coordination service that gives you the tools you need to write correct distributed applications. ZooKeeper was developed at Yahoo! Research. Several Hadoop projects are already using ZooKeeper to coordinate the cluster and provide highly-available distributed services. Perhaps most famous of those are Apache HBase, Storm, Kafka. ZooKeeper is an application library with two principal implementations of the APIs—Java and C—and a service component implemented in Java that runs on an ensemble of dedicated servers. Zookeeper is for building distributed systems, simplifies the development process, making it more agile and enabling more robust implementations.

Apache Oozie

It is a workflow scheduler system to manage hadoop jobs. It is a server-based Workflow Engine specialized in running workflow jobs with actions that run Hadoop MapReduce and Pig jobs. Oozie is implemented as a Java Web-Application that runs in a Java Servlet-Container. Hadoop basically deals with bigdata and when some programmer wants to run many job in a sequential manner like output of job A will be input to Job B and similarly output of job B is input to job C and final output will be output of job C. To automate this sequence we need a workflow and to execute same we need engine for which OOZIE is used.

Apache Mahout

Mahout is an open source machine learning library from Apache written in java. The algorithms it implements fall under the broad umbrella of machine learning or collective intelligence. This can mean many things, but at the moment for Mahout it means primarily recommender engines (collaborative filtering), clustering, and classification. Mahout aims to be the machine learning tool of choice when the collection of data to be processed is very large, perhaps far too large for a single machine. In its current incarnation, these scalable machine learning implementations in Mahout are written in Java, and some portions are built upon Apache's Hadoop distributed computation project.

H2O

H2O is a statistical, machine learning and math runtime tool for bigdata analysis. Developed by the predictive analytics company H2O.ai, H2O has established a leadership in the ML scene together with R and Databricks’ Spark. According to the team, H2O is the world’s fastest in-memory platform for machine learning and predictive analytics on big data. It is designed to help users scale machine learning, math, and statistics over large datasets.

In addition to H2O’s point and click Web-UI, its REST API allows easy integration into various clients. This means explorative analysis of data can be done in a typical fashion in R, Python, and Scala; and entire workflows can be written up as automated scripts.

Tuesday, August 18, 2015

Hive an Introduction!

Data is growing like anything and industry specialists think that RDBMS is not going to work for tera/peta bytes of data. Hadoop came as a savior for Big Data related issues. Term Big Data/Hadoop scares people who are well versed in writing sql queries for getting data from database because hadoop jobs are all about Map Reduce and their SQL knowledge will not be going to help in writing Map Reduce.

Hive becomes as a Remedy for this which bridges the gap between SQL and Map Reduce. Hive provides QL(query Language) and convert the query language to Map reduce jobs internally. Hive provides very powerful query language like SQL. People with no knowledge of Map Reduce can also query data from Hadoop. Let's understand Hive from RDMS perspective.

How Hive Manages Data?

Hive keeps data in any of the Hadoop supported file system like HDFS or local file system. Hive keeps table structure information in meta store which can be any of the JDBC supporting RDMS i.e. Derby, MySql, Postgres etc. By default Hive provides one embedded derby instance which can cater only one user locally.

Hive doesn't validate the table structure while loading the data into table. It only validates the table structure while querying the table, that provides fast load of data which is very much helpful in Big Data scenario.

let's explore Hive Features.

Hive supported Data Types
Hive supports all Basic data types along with complex data types like Array, Map etc.

Hive Table
The table in hive is conceptually analogous to rdbms table. But actually Hive just provides one wrapper to give user feeling that data is stored in table structure. The data actually resides in any Hadoop file system like HDFS or local file system. Hive store table structure and other information in rdbms and call it as metastore.

There are two type of tables in Hive.

Managed Tables
By default all the tables created in Hive are Managed Tables. As name suggests Managed tables are tables where data is managed by Hive itself which means once you load data into managed table it move the file to hive warehouse store (one location in file system where hive keeps data), which involves only move operation. When you drop the table it will remove the file from warehouse and delete the metadata information.
let's create table RESIDENTS with fields ID, NAME, STATE, CITY.

CREATE TABLE RESIDENTS (ID BIGINT, NAME STRING, STATE STRING, CITY STRING);
This is sample managed table. We can load data into this table using following statement.

LOAD DATA INPATH '/user/gagan/residents.txt' INTO TABLE RESIDENTS;
If the file system remains same then this load statement will only result into move the file residents.txt into Hive warehouse.

External Table
As name suggests external tables are tables where Hive doesn't manage table create and drop operations

CREATE EXTERNAL TABLE RESIDENTS_EXTERNAL (ID BIGINT, NAME STRING, STATE STRING, CITY STRING)
LOCATION '/user/gagan/RESIDENTS_EXTERNAL';
LOAD DATA INPATH '/user/gagan/residents.txt' INTO TABLE RESIDENTS_EXTERNAL;
Here we explicitly mention that table is external which indicates hive that do not move data to warehouse. Hive event doesn't validate the location while creating the table.

Import Data to Hive table
There are couple of ways in which we can import data into Hive.
Import data from file
As described above data can be loaded from into hive using LOAD statement.
Importing Data from another table
Like rdbms, we can import data to one hive table by querying other hive tables and the syntax will look like

INSERT INTO TABLE RESIDENTS
SELECT *
FROM RESIDENTS_EXTERNAL;
INSERT INTO will append data into RESIDENTS table. INSERT OVERRIDE can be used to override the existing table data. Hive doesn't support INSERT INTO TABLE VALUES clause of RDBMS.

CTAS (CREATE TABLE AS SELECT)
Sometimes we need to insert the result of query while creating a new table. This clause help in that scenario. Now the new table will copy the structure from the select statement. CTAS syntax looks like.

CREATE TABLE NEW_TABLE
AS
SELECT ID, NAME
FROM RESIDENTS;
Querying Hive tables
Querying hive tables is very much similar to SQL. Hive supports large numbers of built-in functions like mathematical, date ,string manipulation functions. some examples are as follows.

SELECT ID,NAME FROM RESIDENTS;
SELECT ID,NAME from RESIDENTS WHERE CITY='DELHI';
SELECT COUNT(*) FROM RESIDENTS WHERE STATE='HARYANA';
SELECT MAX(ID) FROM RESUDENTS;
SELECT ID,NAME FROM RESIDENTS WHERE CITY LIKE '%PUR';
Hive provides ORDER BY clause as well. As we know that Hive QL is translated into MR jobs so ORDER BY (runs on global data) make reducers to be 1 which in turn slow down the operation. We can perform sorting using SORT BY if sorting is not required on the entire data set. SORT BY produces sorted output per reducer.

Joins
Hive supports only equijoins,outer joins, left semi joins. Join conditions other then equality are very difficult to represent into MR jobs. Hive join syntax is bit different from SQL.
let's create two tables to understand joins in hive.

CREATE TABLE EMPLOYEES (ID INT, NAME STRING);
CREATE TABLE PROJECTS (PROJECT_NAME STRING, EMPLOYEE_ID INT)
Equi Join
Also known as Inner join. let's generate the report for employees who are allocated to some project.

SELECT NAME
FROM EMPLOYEES
JOIN PROJECTS ON (EMPLOYEES.ID = PROJECTS.EMPLOYEE_ID)
Here we have only one table name in from clause and other will be in JOIN ON clause.

Outer JOIN
LEFT and RIGHT both types of outer joins are supported by Outer Joins. Definition of outer joins remains same as SQL. Below example list all the employees with project information and NULL if no project record is available.

SELECT NAME
FROM EMPLOYEES
LEFT OUTER JOIN PROJECTS ON (EMPLOYEES.ID = PROJECTS.EMPLOYEE_ID)
Semi Joins
Hive doesn't support IN/EXISTS sub queries. Left Semi Joins are here to help writing these queries. The restrictions of using LEFT SEMI JOIN is that the right-hand-side table should only be referenced in the join condition (ON-clause), but not in WHERE- or SELECT-clauses etc.

SELECT *
FROM EMPLOYEES
WHERE ID IN (SELECT EMPLOYEE_ID FROM PROJECTS);
can be written as

SELECT EMPLOYESS.*
FROM EMPLOYEES
LEFT SEMI JOIN PROJECTS ON (EMPLOYEES.ID = PROJECTS.EMPLOYEE_ID)
Alter/Drop tables
Like SQL, tables can be altered or dropped in Hive.
Alter Table
Alter table helps in renaming the table name or altering the structure of table.

ALTER TABLE RESIDENTS RENAME TO INDIA_RESIDENTS;
ALTER TABLE INDIA_RESIDENTS ADD COLUMNS (FATHER_NAME STRING, MOTHER_NAME STRING);
Drop Table
Drop table deletes the data and metadata for a table but only metadata in case of external table.

DROP TABLE RESIDENTS;

Tuesday, April 28, 2015

What is Machine Learning?

Machine learning represents the logical extension of simple data retrieval and storage. It is about developing building blocks that make computers learn and behave more intelligently. Machine learning makes it possible to mine historical data and make predictions about future trends. Search engine results, online recommendations, ad targeting, fraud detection, and spam filtering are all examples of what is possible with machine learning. Machine learning is about making data-driven decisions. While instinct might be important, it is difficult to beat empirical data.

What is the use of Machine Learning?

Machine Learning is found in things we use every day such as Internet search engines, email and online music and book recommendation systems. Credit card companies use machine learning to protect against fraud.
Using adaptive technology, computers recognize patterns and anticipate actions. Machine Learning is used in more complex applications such as:

Self-parking cars
Guiding robots
Airplane navigation systems (manned and unmanned),
Space exploration
Medicine

What is Machine Learning best suited for?

Machine Learning is good at replacing labor-intensive decision-making systems that are predicated on hand-coded decision rules or manual analysis. Six types of analysis that Machine Learning is well suited for are:

classification (predicting the class/group membership of items)
regression (predicting real-valued attributes)
clustering (finding natural groupings in data)
multi-label classification (tagging items with labels)
recommendation engines (connecting users to items)

Sunday, April 12, 2015

Key Skills for a Successful Analytics Career

Companies worldwide are dealing with huge volumes of data, and as companies get more adept at data acquisition they are relying increasingly on analytics professionals to help them mine the data for business insights and to drive strategic growth. Qualified analytics professionals are in great demand, and can command high salaries for specialized skills.

There are some fundamental behaviors that are critical to those looking to build a successful analytics career, including:

A high sense of intellectual curiosity:
People that tend to do well with analytics careers typically have a high sense of curiosity and inquisitiveness. They want to know the whys and hows of any situation, and that is very useful in a professional environment dealing with business challenges. There has to be an interest in understanding the business issue and working out the specifics of the solution, and especially the curiosity to challenge any assumptions.
Mathematically oriented:
To do well in analytics, you need to be comfortable with mathematical concepts, and not be afraid to use mathematical tools. This is not the career for you if the word Mathematics strikes fear in your heart!
Big picture vision:
It is important to always remember the larger business issue that is being addressed through the process of working with data and dealing with minute.
Detail oriented:
While it is important to remember the big picture, it is critical to pay attention to the details. While working with large volumes of data it is very easy to lose sight of the specifics that add insight and understanding to solving business issues.
Ability to differentiate between tools and methods:
This is a common issue – confusing a tool with a solution. SAS and Excel are “dumb” tools in the sense that the output produced is meaningless unless thought has been applied to the methodology and techniques applied to get at results. Analytics is not SAS; it is using SAS to arrive at results applying analytical thinking and methodology.
Interpretation skills:
Ultimately, every hoop that an analyst jumps through is to enable solving a business problem. Numbers by themselves mean nothing. Experience and domain understanding give one the ability to interpret the results in the business context, assess usefulness of results, and allow the building of strategies based on the outcomes.

Certified Big Data Professional

Big Data success requires professionals who can prove their mastery with the tools and techniques of the Hadoop stack. However, experts predict a major shortage of advanced analytics skills over the next few years.
The Cloudera Certified Professional (CCP) program delivers the most rigorous and recognized Big Data credential.

Cloudera Certified Professional: Data Scientist (CCP:DS)

CCP: Data Scientists have demonstrated the skills of an elite group of specialists working with Big Data. Candidates must prove their abilities under real-world conditions, designing and developing a production-ready data science solution that is peer-evaluated for its accuracy, scalability, and robustness.

Cloudera Certified Developer for Apache Hadoop (CCDH)

Individuals who achieve CCDH have demonstrated their technical knowledge, skill, and ability to write, maintain, and optimize Apache Hadoop development projects.

Cloudera Certified Administrator for Apache Hadoop (CCAH)

Individuals who earn CCAH have demonstrated the core systems administrator skills sought by companies and organizations deploying Apache Hadoop.

Cloudera Certified Specialist in Apache HBase (CCSHB)

Individuals who pass CCSHB have demonstrated a comprehensive knowledge of the technology and skills required by companies using Apache HBase.

For more info Please click http://cloudera.com/content/cloudera/en/training/certification.html

What is Big Data?

Big data means really a big data, it is a collection of large datasets that cannot be processed using traditional computing techniques. Big data is not merely a data, rather it has become a complete subject, which involves various tools, techniques and frameworks.

Due to the advent of new technologies, devices, and communication means like social networking sites, the amount of data produced by mankind is growing rapidly every year. The amount of data produced by us from the beginning of time till 2003 was 5 billion gigabytes. If you pile up the data in the form of disks it may fill an entire football field. The same amount was created in every two days in 2011, and in every ten minutes in 2013. This rate is still growing enormously.

Big Data Technologies

Big data technologies are important in providing more accurate analysis, which may lead to more concrete decision-making resulting in greater operational efficiencies, cost reductions, and reduced risks for the business.

To harness the power of big data, you would require an infrastructure that can manage and process huge volumes of structured and unstructured data in realtime and can protect data privacy and security.

There are various technologies in the market from different vendors including Amazon, IBM, Microsoft, etc., to handle big data. While looking into the technologies that handle big data, we examine the following two classes of technology:

Operational Big Data
This include systems like MongoDB that provide operational capabilities for real-time, interactive workloads where data is primarily captured and stored.

NoSQL Big Data systems are designed to take advantage of new cloud computing architectures that have emerged over the past decade to allow massive computations to be run inexpensively and efficiently. This makes operational big data workloads much easier to manage, cheaper, and faster to implement.

Some NoSQL systems can provide insights into patterns and trends based on real-time data with minimal coding and without the need for data scientists and additional infrastructure.

Analytical Big Data

This includes systems like Massively Parallel Processing (MPP) database systems and MapReduce that provide analytical capabilities for retrospective and complex analysis that may touch most or all of the data.

MapReduce provides a new method of analyzing data that is complementary to the capabilities provided by SQL, and a system based on MapReduce that can be scaled up from single servers to thousands of high and low end machines.