DataScience Deep Dive: January 2015

I wanted to start this blog by a quick overview of current state of Big Data playground. There was a lot of noise during this year from everywhere making it nearly impossible for a newcomer to learn this world without being overwhelmed. So let's start.

How does Big Data differ from NoSQL?

NoSql is a type of database, which provides a mechanism for storage and retrieval of data modelled in means other than the tabular relations used in relational databases. The most prominent representatives are Cassandra, MongoDB, Neo4j and Redis each taking a different approach in data representation. It is important to notice, that NoSql databases have nothing to do with the amount of stored data, but merely it's representation. On contrary Big Data is commonly referred to technologies used to store and operate on huge amounts of data. Usually it is referred to a Apache Hadoop ecosystem. In fact Hadoop it's file system based databased, so it's also NoSql database. However if Redis can be used for storing any amount of data, Hadoop was specifically designed to store data of large amounts.

Who are Cloudera, Hortonworks and MapR?

Since Hadoop is an open source software, several companies have sprung over the years providing support and additional useful tools to the platform. These companies are Cloudera, Hortonworks and MapR, each distributes it's own distribution of Hadoop.

Cloudera has been here for the longest time since the creation of Hadoop. Hortonworks came later. While Cloudera and Hortonworks are 100 percent open source, most versions of MapR come with proprietary modules, like using proprietary file system MapR-FS instead of HDFS.

Each vendor/distribution has its unique strength and weaknesses, each have certain overlapping features as well. If you are looking to make the most of Hadoop’s immense data processing power, you should make a comparative study in them. To help you start, have a look at comparison table to the right by various features.

For more detailed comparison, you can request a 65 page free comparison booklet from Altoros.

Hadoop Eco System

Despite being an ingenuous piece of software, Hadoop is very difficult to operate on and makes it's developers' life a living hell. Even the easiest aggregate operation require one to implement a Hadoop job using MapReduce paradigm.

To make life simpler, good people from Facebook invented and open-sourced Hive project, which converts SQL to a series of MapReduce jobs. It tries to look like MySQL by storing table schemas in it's local database.

The problem with Hive, is that it was never developed for real-time and in memory processing. It was built for offline batch processing kinda stuff. Best suited when you need long running jobs performing data heavy operations like joins on very huge datasets. For folks wishing interactivity, other tools were developed by different vendors, each serving the same purpose more or less. These include Impala from Cloudera, Presto from Facebook and Apache Drill, heavily pushed by MapR. Apache Drill has similar goals to Impala and Presto – fast interactive queries for large datasets, and like these technologies it also requires installation of worker nodes. However, unlike Impala and Presto, Drill aims to support multiple backing stores (HDFS, HBase, MongoDB).

YARN and Spark

Appetite comes with eating. And after seeing what Hadoop can do, people wanted to do even more, but much quicker. It turns out, MapReduce wasn't the best architecture for real-time processing so in 2012 a sub-project called YARN was started promising the solution. Sometimes called MapReduce 2.0, YARN is a software rewrite that decouples resource management and scheduling capabilities from the data processing component, enabling Hadoop to support more varied processing approaches and a broader array of applications, the most prominent are Spark and Storm.

Apache Spark is an in-memory distributed data analysis platform-- primarily targeted at speeding up batch analysis jobs, iterative machine learning jobs, interactive query and graph processing. One of Spark's primary distinctions is its use of RDDs or Resilient Distributed Datasets. RDDs are great for pipelining parallel operators for computation, allowing it to run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk.

Apache Storm is focused on stream processing or what some call complex event processing. Storm implements a fault tolerant method for performing a computation or pipelining multiple computations on an event as it flows into a system.

Machine Learning with MLlib and Mahout

As data scientists we're interested in data insights, rather than the way it's stored. However since large amounts of data were stored in Hadoop, people needed a way to access it and more importantly to be able to learn from it. The answer didn't made itself to wait in a form of Apache Mahout for Hadoop MapReduce and Apache MLlib for Spark environment.

Hope you've found this article useful and are welcome to comment and share below.

Big Data Buzz Words Overview

How does Big Data differ from NoSQL?

Who are Cloudera, Hortonworks and MapR?

Hadoop Eco System

YARN and Spark

Machine Learning with MLlib and Mahout

Blog Archive

Popular Posts

Labels