ENTRIES TAGGED "Hadoop"
Data stores are rolling out easy-to-use analysis tools
Originated by the NSA, Apache Accumulo is a BigTable inspired data store known for being highly scalable and for its interesting security model. Federal agencies and Defense contractors have deployed Accumulo on clusters of a thousand or more servers. It also uses “cell-level” security to control access to values stored in individual cells1.
What Accumulo was lacking were easy-to-use, standard analytic engines that allow users to interact with data. The release of Sqrrl Enterprise this past week fills that gap. Sqrrl Enterprise provides an initial set of analytic engines for the Accumulo ecosystem2. It includes support for interactive SQL, fulltext search, and queries over graph data. Each of these engines takes into account security labels placed on data: since every data object ingested into Sqrrl has a security label, (query & analytic) results incorporate those access levels. Analysts interact with data as they normally would. For example Sqrrl’s indexing technology accounts for security labels, and search queries are written in standard Lucene syntax. Reminiscent of the Phoenix project for HBase3, SQL queries4 in Sqrrl are converted into optimized Accumulo iterators.
Analytic engines on top of Hadoop simplify the creation of interesting, low-cost, scalable applications
Hadoop’s low-cost, scale-out architecture has made it a new platform for data storage. With a storage system in place, the Hadoop community is slowly building a collection of open source, analytic engines. Beginning with batch processing (MapReduce, Pig, Hive), Cloudera has added interactive SQL (Impala), analytics (Cloudera ML + a partnership with SAS), and as of early this week, real-time search. The economics that led to Hadoop dominating batch processing is permeating other types of analytics.
Another collection of open source, Hadoop-compatible analytic engines, the Berkeley Data Analytics Stack (BDAS), is being built just across the San Francisco Bay. Starting with a batch-processing framework that’s faster than MapReduce (Spark), it now includes interactive SQL (Shark), and real-time analytics (Spark Streaming). Sometime this summer, frameworks for machine-learning (MLbase) and graph analytics (GraphX) will be released. A cluster manager (Mesos) and an in-memory file system (Tachyon) allow users of other analytic frameworks to leverage the BDAS platform. (The Python data community is looking at Tachyon closely.)
A new, open source benchmark can be used to track performance improvements over time
As organizations continue to accumulate data, there has been renewed interest in interactive query engines that scale to terabytes (even petabytes) of data. Traditional MPP databases remain in the mix, but other options are attracting interest. For example, companies willing to upload data into the cloud are beginning to explore Amazon Redshift1, Google BigQuery, and Qubole.
A variety of analytic engines2 built for Hadoop are allowing companies to bring its low-cost, scale-out architecture to a wider audience. In particular, companies are rediscovering that SQL makes data accessible to lots of users, and many prefer3 not having to move data to a separate (MPP) cluster. There are many new tools that seek to provide an interactive SQL interface to Hadoop, including Cloudera’s Impala, Shark, Hadapt, CitusDB, Pivotal-HD, PolyBase4, and SQL-H.
An open source benchmark from UC Berkeley’s Amplab
A benchmark for tracking the progress5 of scalable query engines has just been released. It’s a worthy first effort, and its creators hope to grow the list of tools to include other open source (Drill, Stinger) and commercial6 systems. As these query engines mature and features get added, data from this benchmark can provide a quick synopsis of performance improvements over time.
The initial release includes Redshift, Hive, Impala, and Shark (Hive, Impala, Shark were configured to run on AWS). Hive 0.10 and the most recent versions7 of Impala and Shark were used (Hive 0.11 was released in mid-May and has not yet been included). Data came from Intel’s Hadoop Benchmark Suite and CommonCrawl. In the case of Hive/Impala/Shark, data was stored in compressed SequenceFile format using CDH 4.2.0.
At least for the queries included in the benchmark, Redshift is about 2-3x faster than Shark/on-disk, and 0.3-2x faster than Shark/in-memory. Given that it’s built on top of a general purpose engine (Spark), it’s encouraging that Shark’s performance is within range of MPP8 databases (such as Redshift) that are highly optimized for interactive SQL queries. With new frameworks like Shark and Impala providing speedups comparable to those observed in MPP databases, organizations now have the option of using a single system (Hadoop/Spark) instead of two (Hadoop/Spark + MPP database).
Let’s look at some of the results in detail:
Researchers begin to scale up pattern recognition, machine-learning, and data management tools.
My first job after leaving academia was as a quant1 for a hedge fund, where I performed (what are now referred to as) data science tasks on financial time-series. I primarily used techniques from probability & statistics, econometrics, and optimization, with occasional forays into machine-learning (clustering, classification, anomalies). More recently, I’ve been closely following the emergence of tools that target large time series and decided to highlight a few interesting bits.
Time-series and big data:
Over the last six months I’ve been encountering more data scientists (outside of finance) who work with massive amounts of time-series data. The rise of unstructured data has been widely reported, the growing importance of time-series much less so. Sources include data from consumer devices (gesture recognition & user interface design), sensors (apps for “self-tracking”), machines (systems in data centers), and health care. In fact some research hospitals have troves of EEG and ECG readings that translate to time-series data collections with billions (even trillions) of points.
It helps to reduce context-switching during long data science workflows.
An integrated data stack boosts productivity
As I noted in my previous post, Python programmers willing to go “all in”, have Python tools to cover most of data science. Lest I be accused of oversimplification, a Python programmer still needs to commit to learning a non-trivial set of tools1. I suspect that once they invest the time to learn the Python data stack, they tend to stick with it unless they absolutely have to use something else. But being able to stick with the same programming language and environment is a definite productivity boost. It requires less “setup time” in order to explore data using different techniques (viz, stats, ML).
Multiple tools and languages can impede reproducibility and flow
On the other end of the spectrum are data scientists who mix and match tools, and use packages and frameworks from several languages. Depending on the task, data scientists can avail of tools that are scalable, performant, require less2 code, and contain a lot of features. On the other hand this approach requires a lot more context-switching, and extra effort is needed to annotate long workflows. Failure to document things properly makes it tough to reproduce3 analysis projects, and impedes knowledge transfer4 within a team of data scientists. Frequent context-switching also makes it more difficult to be in a state of flow, as one has to think about implementation/package details instead of exploring data. It can be harder to discover interesting stories with your data, if you’re constantly having to think about what you’re doing. (It’s still possible, you just have to concentrate a bit harder.)
Tools slowly democratize many data science tasks
Here are a few observations based on conversations I had during the just concluded Strata Santa Clara conference.
Spark is attracting attention
I’ve written numerous times about components of the Berkeley Data Analytics Stack (Spark, Shark, MLbase). Two Spark-related sessions at Strata were packed (slides here and here) and I talked to many people who were itching to try the BDAS stack. Being able to combine batch, real-time, and interactive analytics in a framework that uses a simple programming model is very attractive. The release of version 0.7 adds a Python API to Spark’s native Scala interface and Java API.
Hortonworks' Data Platform for Windows, Intel's Hadoop distribution, invasive smartphone surveillance, and data-driven "House of Cards."
Windows gets Hadoop, Intel launches Hadoop distribution
Hortonworks released a beta version of its Hortonworks Data Platform for Windows this week. In the press release, the company highlights the mission is to “expand the reach of Apache Hadoop across the enterprise” and notes that the “100% open source Hortonworks Data Platform is the industry’s first and only Apache Hadoop distribution for both Windows and Linux.”
Barb Darrow notes at GigaOm that there’s likely no better way to bring big data to the masses than via Microsoft Excel. Darrow reports that Hortonworks’ VP of corporate strategy Shawn Connolly told her that “[t]he combination should make it easier to integrate data from SQL Server and Hadoop and to funnel all that into Excel for charting and pivoting and all the tasks Excel is good at,” stressing that the same Apache Hadoop distribution will run on both Windows and Linux. Connolly also noted to Darrow that “an analogous Hortonworks Data Platform for Windows Azure is still in the works.”
Describe and run bleeding edge algorithms on massive data sets
In the course of applying machine-learning against large data sets, data scientists face a few pain points. They need to tune and compare several suitable algorithms – a process that may involve having to configure a hodgepodge of tools, requiring different input files, programming languages, and interfaces. Some software tools may not scale to big data, so they first sample and test ideas on smaller subsets, before tackling the problem of having to implement a distributed version of the final algorithm.
To increase productivity, ideally data scientists should be able to quickly test ideas without doing much coding, context switching, tuning and configuration. A research project0 out of UC Berkeley’s Amplab and Brown seems to do just that: MLbase aims to make cutting edge, scalable machine-learning algorithms available to non-experts. MLbase will have four pieces: a declarative language (MQL – discussed below), a library of distributed algorithms (ML-Library), an optimizer and a runtime (ML-Optimizer and ML-Runtime). Read more…
We're launching an investigation into in-memory data technologies.
In a forthcoming report we will highlight technologies and solutions that take advantage of the decline in prices of RAM, the popularity of distributed and cloud computing systems, and the need for faster queries on large, distributed data stores. Established technology companies have had interesting offerings, but what initially caught our attention were open source projects that started gaining traction last year.
An example we frequently hear about is the demand for tools that support interactive query performance. Faster query response times translate to more engaged and productive analysts, and real-time reports. Over the past two years several in-memory solutions emerged to deliver 5X-100X faster response times. A recent paper from Microsoft Research noted that even in this era of big data and Hadoop, many MapReduce jobs fit in the memory of a single server. To scale to extremely large datasets several new systems use a combination of distributed computing (in-memory grids), compression, and (columnar) storage technologies.
Another interesting aspect of in-memory technologies is that they seem to be everywhere these days. We’re looking at tools aimed at analysts (Tableau, Qlikview, Tibco Spotfire, Platfora), databases that target specific workloads or data types (VoltDB, SAP HANA, Hekaton, Redis, Druid, Kognitio, and Yarcdata), frameworks for analytics (Spark/Shark, GraphLab, GridGain, Asterix/Hyracks), and the data center (RAMCloud, memory Iocality).
We’ll be talking to companies and hackers to get a sense of how in-memory solutions fit into their planning. Along these lines, we would love to hear what you think about the rise of these technologies, as well as applications, companies and projects we should look at. Feel free to reach out to us on Twitter (Ben is @bigdata and Roger is @rogerm) or leave a comment on this post. Read more…
Diversity and manageability are big data watchwords for the next 12 months.
Here are some of the key big data themes I expect to dominate 2013, and of course will be covering in Strata.
Emergence of a big data architecture
The coming year will mark the graduation for many big data pilot projects, as they are put into production. With that comes an understanding of the practical architectures that work. These architectures will identify:
- best of breed tools for different purposes, for instance, Storm for streaming data acquisition
- appropriate roles for relational databases, Hadoop, NoSQL stores and in-memory databases
- how to combine existing data warehouses and analytical databases with Hadoop
Of course, these architectures will be in constant evolution as big data tooling matures and experience is gained.
In parallel, I expect to see increasing understanding of where big data responsibility sits within a company’s org chart. Big data is fundamentally a business problem, and some of the biggest challenges in taking advantage of it lie in the changes required to cross organizational silos and reform decision making.
One to watch: it’s hard to move data, so look for a starring architectural role for HDFS for the foreseeable future. Read more…