My earliest introduction to Apache Hadoop, several years ago, was using Hadoop with one of the popular NoSQL databases to build data acquisition pipelines for a semantic search engine for a potential customer. Originally, I had used classic ETL and database tools but the resulting data acquisition, cleansing, and entity extraction pipeline took days to run over a workload of several million medium / large XML documents. Like many others adopting NoSQL, we could have scaled up our relational approach with expensive server and software purchases, but the budget constraints meant that we had to look at other alternatives.
I started to experiment with using a mix of Hadoop components, open source ETL tools, XSLT (as the source data was archives of XML documents), and NoSQL technologies along with custom Java components to perform entity detection and custom plugins for the particular ETL software. The resulting prototype solution performed the same corpus preparation in hours even on a small proof-of-concept cluster.
For many approaching Hadoop solutions for the first time, the natural tendency is to view Hadoop from the perspective of just one facet of existing technologies such as databases, machine-learning, cloud computing, distributed storage, or distributed computing. While this can be very useful as a learning tool, it can lead to some misconceptions about the Hadoop components and ecosystem or applicable use cases.
That is, database technologists might focus on the SQL-like components of the Hadoop infrastructure – query engines such as Apache Hive and Cloudera Impala, machine learning scientists might focus on frameworks such as Apache Mahout, and so on. Often, past experiences with other systems within a specific domain leads to assumptions about architecture, performance, limitations, or complexity that may not be true for Hadoop.
Also, different application domains and types of data often have their own tooling and terminology – for example, the data modeling and ETL based processing of data warehousing is different in tool approach and terminology to entity detection on a semantic data corpus, the processing of real-time data, or the techniques used with digital media – yet each of these has something to offer to a designer of big data solutions.
For Apache HBase solutions, the challenge can be enhanced as, unless the designer is familiar with other NoSQL technologies, there is less existing contextual knowledge. For HBase, the challenge is not just in applying HBase to a problem domain, but also in applying appropriate tools and techniques to all the stages of solution development that lead to efficient use of HBase and to take advantage of complementary technologies.
The solution areas to consider include:
- Architectural decisions around choices of technologies to apply
- Data acquisition — for batch, incremental. and real-time or near real-time data
- Data cleansing and validation
- Data modeling – schema design, choices of file format, column family layout, etc. (HBase is often described as “schema-less,” but we are using the looser interpretation of schema as including choice of column, table, and storage layout rather than the strict definition of schema in DDL.)
- Data access and retrieval patterns – for HBase, data retrieval heavily influences decisions around column family layout and choice of rowkeys
- Data enrichment – computing analytics, aggregations and other data enrichments
- Designing for optimal use of cluster – avoiding hot spots, etc.
One of the key advantages of the Hadoop ecosystem is the potential to treat the same data in structured, unstructured, and raw file based forms, and to apply a variety of techniques from different data and application domains to building solutions.
In the HBase tutorial at Strata Santa Clara 2014, attendees will learn how to build HBase solutions using a variety of tools, technologies, and concepts from the Hadoop ecosystem and other areas to perform data acquisition, data modeling, retrieval / query, and executing analytics and processing over the data.
Ronan Stokes is a Solutions Architect at Cloudera where he architects, designs, and builds Hadoop based big data solutions for Cloudera’s customers. Prior to Cloudera, Ronan worked with startups in unstructured data and semantic search, Informatica, Microsoft, and C++ pioneers Glockenspiel, in engineering and consulting roles.