Apache Hadoop has been the driving force behind the growth of the big data industry. You’ll hear it mentioned often, along with associated technologies such as Hive and Pig. But what does it do, and why do you need all its strangely-named friends, such as Oozie, Zookeeper and Flume?
Hadoop brings the ability to cheaply process large amounts of data, regardless of its structure. By large, we mean from 10-100 gigabytes and above. How is this different from what went before?
Existing enterprise data warehouses and relational databases excel at processing structured data and can store massive amounts of data, though at a cost: This requirement for structure restricts the kinds of data that can be processed, and it imposes an inertia that makes data warehouses unsuited for agile exploration of massive heterogenous data. The amount of effort required to warehouse data often means that valuable data sources in organizations are never mined. This is where Hadoop can make a big difference.
This article examines the components of the Hadoop ecosystem and explains the functions of each.
The core of Hadoop: MapReduce
Created at Google in response to the problem of creating web search indexes, the MapReduce framework is the powerhouse behind most of today’s big data processing. In addition to Hadoop, you’ll find MapReduce inside MPP and NoSQL databases, such as Vertica or MongoDB.
The important innovation of MapReduce is the ability to take a query over a dataset, divide it, and run it in parallel over multiple nodes. Distributing the computation solves the issue of data too large to fit onto a single machine. Combine this technique with commodity Linux servers and you have a cost-effective alternative to massive computing arrays.
At its core, Hadoop is an open source MapReduce implementation. Funded by Yahoo, it emerged in 2006 and, according to its creator Doug Cutting, reached “web scale” capability in early 2008.
As the Hadoop project matured, it acquired further components to enhance its usability and functionality. The name “Hadoop” has come to represent this entire ecosystem. There are parallels with the emergence of Linux: The name refers strictly to the Linux kernel, but it has gained acceptance as referring to a complete operating system.
Hadoop’s lower levels: HDFS and MapReduce
Above, we discussed the ability of MapReduce to distribute computation over multiple servers. For that computation to take place, each server must have access to the data. This is the role of HDFS, the Hadoop Distributed File System.
HDFS and MapReduce are robust. Servers in a Hadoop cluster can fail and not abort the computation process. HDFS ensures data is replicated with redundancy across the cluster. On completion of a calculation, a node will write its results back into HDFS.
There are no restrictions on the data that HDFS stores. Data may be unstructured and schemaless. By contrast, relational databases require that data be structured and schemas be defined before storing the data. With HDFS, making sense of the data is the responsibility of the developer’s code.
Programming Hadoop at the MapReduce level is a case of working with the Java APIs, and manually loading data files into HDFS.
Improving programmability: Pig and Hive
Working directly with Java APIs can be tedious and error prone. It also restricts usage of Hadoop to Java programmers. Hadoop offers two solutions for making Hadoop programming easier.
- Pig is a programming language that simplifies the common tasks of working with Hadoop: loading data, expressing transformations on the data, and storing the final results. Pig’s built-in operations can make sense of semi-structured data, such as log files, and the language is extensible using Java to add support for custom data types and transformations.
- Hive enables Hadoop to operate as a data warehouse. It superimposes structure on data in HDFS and then permits queries over the data using a familiar SQL-like syntax. As with Pig, Hive’s core capabilities are extensible.
Choosing between Hive and Pig can be confusing. Hive is more suitable for data warehousing tasks, with predominantly static structure and the need for frequent analysis. Hive’s closeness to SQL makes it an ideal point of integration between Hadoop and other business intelligence tools.
Pig gives the developer more agility for the exploration of large datasets, allowing the development of succinct scripts for transforming data flows for incorporation into larger applications. Pig is a thinner layer over Hadoop than Hive, and its main advantage is to drastically cut the amount of code needed compared to direct use of Hadoop’s Java APIs. As such, Pig’s intended audience remains primarily the software developer.
Improving data access: HBase, Sqoop and Flume
At its heart, Hadoop is a batch-oriented system. Data are loaded into HDFS, processed, and then retrieved. This is somewhat of a computing throwback, and often, interactive and random access to data is required.
Enter HBase, a column-oriented database that runs on top of HDFS. Modeled after Google’s BigTable, the project’s goal is to host billions of rows of data for rapid access. MapReduce can use HBase as both a source and a destination for its computations, and Hive and Pig can be used in combination with HBase.
In order to grant random access to the data, HBase does impose a few restrictions: Hive performance with HBase is 4-5 times slower than with plain HDFS, and the maximum amount of data you can store in HBase is approximately a petabyte, versus HDFS’ limit of over 30PB.
HBase is ill-suited to ad-hoc analytics and more appropriate for integrating big data as part of a larger application. Use cases include logging, counting and storing time-series data.
The Hadoop Bestiary
|Ambari||Deployment, configuration and monitoring|
|Flume||Collection and import of log and event data|
|HBase||Column-oriented database scaling to billions of rows|
|HCatalog||Schema and data type sharing over Pig, Hive and MapReduce|
|HDFS||Distributed redundant file system for Hadoop|
|Hive||Data warehouse with SQL-like access|
|Mahout||Library of machine learning and data mining algorithms|
|MapReduce||Parallel computation on server clusters|
|Pig||High-level programming language for Hadoop computations|
|Oozie||Orchestration and workflow management|
|Sqoop||Imports data from relational databases|
|Whirr||Cloud-agnostic deployment of clusters|
|Zookeeper||Configuration management and coordination|
Getting data in and out
Improved interoperability with the rest of the data world is provided by Sqoop and Flume. Sqoop is a tool designed to import data from relational databases into Hadoop, either directly into HDFS or into Hive. Flume is designed to import streaming flows of log data directly into HDFS.
Hive’s SQL friendliness means that it can be used as a point of integration with the vast universe of database tools capable of making connections via JBDC or ODBC database drivers.
Coordination and workflow: Zookeeper and Oozie
With a growing family of services running as part of a Hadoop cluster, there’s a need for coordination and naming services. As computing nodes can come and go, members of the cluster need to synchronize with each other, know where to access services, and know how they should be configured. This is the purpose of Zookeeper.
Production systems utilizing Hadoop can often contain complex pipelines of transformations, each with dependencies on each other. For example, the arrival of a new batch of data will trigger an import, which must then trigger recalculations in dependent datasets. The Oozie component provides features to manage the workflow and dependencies, removing the need for developers to code custom solutions.
Management and deployment: Ambari and Whirr
One of the commonly added features incorporated into Hadoop by distributors such as IBM and Microsoft is monitoring and administration. Though in an early stage, Ambari aims to add these features to the core Hadoop project. Ambari is intended to help system administrators deploy and configure Hadoop, upgrade clusters, and monitor services. Through an API, it may be integrated with other system management tools.
Though not strictly part of Hadoop, Whirr is a highly complementary component. It offers a way of running services, including Hadoop, on cloud platforms. Whirr is cloud neutral and currently supports the Amazon EC2 and Rackspace services.
Machine learning: Mahout
Every organization’s data are diverse and particular to their needs. However, there is much less diversity in the kinds of analyses performed on that data. The Mahout project is a library of Hadoop implementations of common analytical computations. Use cases include user collaborative filtering, user recommendations, clustering and classification.
Normally, you will use Hadoop in the form of a distribution. Much as with Linux before it, vendors integrate and test the components of the Apache Hadoop ecosystem and add in tools and administrative features of their own.
Though not per se a distribution, a managed cloud installation of Hadoop’s MapReduce is also available through Amazon’s Elastic MapReduce service.