ENTRIES TAGGED "strata sc 2014"
My earliest introduction to Apache Hadoop, several years ago, was using Hadoop with one of the popular NoSQL databases to build data acquisition pipelines for a semantic search engine for a potential customer. Originally, I had used classic ETL and database tools but the resulting data acquisition, cleansing, and entity extraction pipeline took days to run over a workload of several million medium / large XML documents. Like many others adopting NoSQL, we could have scaled up our relational approach with expensive server and software purchases, but the budget constraints meant that we had to look at other alternatives.
I started to experiment with using a mix of Hadoop components, open source ETL tools, XSLT (as the source data was archives of XML documents), and NoSQL technologies along with custom Java components to perform entity detection and custom plugins for the particular ETL software. The resulting prototype solution performed the same corpus preparation in hours even on a small proof-of-concept cluster.
For many approaching Hadoop solutions for the first time, the natural tendency is to view Hadoop from the perspective of just one facet of existing technologies such as databases, machine-learning, cloud computing, distributed storage, or distributed computing. While this can be very useful as a learning tool, it can lead to some misconceptions about the Hadoop components and ecosystem or applicable use cases.