Describe and run bleeding edge algorithms on massive data sets
In the course of applying machine-learning against large data sets, data scientists face a few pain points. They need to tune and compare several suitable algorithms – a process that may involve having to configure a hodgepodge of tools, requiring different input files, programming languages, and interfaces. Some software tools may not scale to big data, so they first sample and test ideas on smaller subsets, before tackling the problem of having to implement a distributed version of the final algorithm.
To increase productivity, ideally data scientists should be able to quickly test ideas without doing much coding, context switching, tuning and configuration. A research project0 out of UC Berkeley’s Amplab and Brown seems to do just that: MLbase aims to make cutting edge, scalable machine-learning algorithms available to non-experts. MLbase will have four pieces: a declarative language (MQL – discussed below), a library of distributed algorithms (ML-Library), an optimizer and a runtime (ML-Optimizer and ML-Runtime). Read more…
We're launching an investigation into in-memory data technologies.
In a forthcoming report we will highlight technologies and solutions that take advantage of the decline in prices of RAM, the popularity of distributed and cloud computing systems, and the need for faster queries on large, distributed data stores. Established technology companies have had interesting offerings, but what initially caught our attention were open source projects that started gaining traction last year.
An example we frequently hear about is the demand for tools that support interactive query performance. Faster query response times translate to more engaged and productive analysts, and real-time reports. Over the past two years several in-memory solutions emerged to deliver 5X-100X faster response times. A recent paper from Microsoft Research noted that even in this era of big data and Hadoop, many MapReduce jobs fit in the memory of a single server. To scale to extremely large datasets several new systems use a combination of distributed computing (in-memory grids), compression, and (columnar) storage technologies.
Another interesting aspect of in-memory technologies is that they seem to be everywhere these days. We’re looking at tools aimed at analysts (Tableau, Qlikview, Tibco Spotfire, Platfora), databases that target specific workloads or data types (VoltDB, SAP HANA, Hekaton, Redis, Druid, Kognitio, and Yarcdata), frameworks for analytics (Spark/Shark, GraphLab, GridGain, Asterix/Hyracks), and the data center (RAMCloud, memory Iocality).
We’ll be talking to companies and hackers to get a sense of how in-memory solutions fit into their planning. Along these lines, we would love to hear what you think about the rise of these technologies, as well as applications, companies and projects we should look at. Feel free to reach out to us on Twitter (Ben is @bigdata and Roger is @rogerm) or leave a comment on this post. Read more…
Shark is 100X faster than Hive for SQL, and 100X faster than Hadoop for machine-learning
Hadoop’s strength is in batch processing, MapReduce isn’t particularly suited for interactive/adhoc queries. Real-time1 SQL queries (on Hadoop data) are usually performed using custom connectors to MPP databases. In practice this means having connectors between separate Hadoop and database clusters. Over the last few months a number of systems that provide fast SQL access within Hadoop clusters have garnered attention. Connectors between Hadoop and fast MPP database clusters are not going away, but there is growing interest in moving many interactive SQL tasks into systems that coexist on the same cluster with Hadoop.
Having a Hadoop cluster support fast/interactive SQL queries dates back a few years to HadoopDB, an open source project out of Yale. The creators of HadoopDB have since started a commercial software company (Hadapt) to build a system that unites Hadoop/MapReduce and SQL. In Hadapt, a (Postgres) database is placed in nodes of a Hadoop cluster, resulting in a system2 that can use MapReduce, SQL, and search (Solr). Now on version 2.0, Hadapt is a fault-tolerant system that comes with analytic functions (HDK) that one can use via SQL. Read more…
The development team continues to focus on features that will grow the number of Spark users
In an earlier post I listed a few reasons why I’ve come to embrace and use Spark. In particular I described why Spark is well-suited for many distributed Big Data Analytics tasks such as iterative computations and interactive queries, where it outperforms Hadoop. With version 0.6, Spark becomes even0 faster and easier to use. The release notes contain all the detailed changes, but as you’ll see from the highlights1 below, version 0.6 is a substantial release. Another good sign is the growth in number of contributors, with now over a third of the developers coming from outside the core team in Berkeley.
New Deployment Modes
In addition to running on top of Mesos, Spark can now be deployed in standalone mode: users only need to install Spark and a JVM on each node, a simple cluster manager2 that comes with Spark 0.6, handles the rest. Deployment becomes much simpler and allows organizations who aren’t familiar with Mesos (and C++) to run Spark on their clusters. This release also provides an experimental mode for running Spark on Apache YARN.
New Java API
While I use Spark exclusively through Scala, many Java developers have been asking for a Java API. With version 0.6 the wait is over: all of Spark’s features can be accessed from Java through an API. A Python API is coming very soon.
RDD persistence options
RDD’s are distributed objects that can be cached in-memory, across a cluster of compute nodes. They are the fundamental data objects used in Spark. In version 0.6 caching policy can be controlled at the individual RDD level. Users can decide whether to keep an RDD in-memory or on disk, serialize it, or replicate3 it across nodes. Read more…
Quickly perform and interpret the results of routine Small Data analysis
With so much focus on Big Data, the needs of many analysts who work with Small Data tend to get ignored. The default tool for many of these users remains spreadsheets1 and/or statistical packages which come with a lot of features and options. However many analysts need a very small subset of what these tools have to offer.
Enter Statwing, a software-as-a-service provider for routine statistical analysis. While the tool is still in the early stages, it can already do many basic “data analysis” tasks.
Consider the following example of a pivot table constructed in Excel: this required 8 mouse-clicks, if you do everything perfectly, and about 5 decisions (what variables to include, what metric to use, …)
The same task in Statwing required 4 mouse-clicks and 0 decisions! Plus it comes with visuals:
The lack of clutter and the addition of a simple “headline” (“Female tends to have much higher values for satisfaction than Male“), makes the result much easier to interpret. The advanced tab contains detailed statistical analysis (in this case the p-value, counts, values). Many users get confused by the output/results produced by traditional statistical software. Let’s face it, many analysts have had little training in statistics. I welcome a tool that produces readily interpretable results.
The company hopes to replicate the above example across a wide variety of routine data analysis tasks. Their initial focus is on tools for (consumer) survey analysis, a potentially huge market given that online companies have made surveys so much easier to conduct. Users of Statwing pay a small monthly subscription, making it cheaper than most2 statistical packages. For a small monthly fee, their intuitive UI lets analysts get their tasks done quickly. More importantly Statwing may nurture aspiring data scientists in your organization.
(1) As this recent Strata presentation points out: Spreadsheets are the glue that keeps many organizations together.
(2) Open source tools like OpenOffice, R and Octave are free. So is the use of Google spreadsheets.
What's so interesting about another open source virtualization platform?
ZeroVM was the piece of technology that caught my attention during the recent Bay Area Apache Drill Meetup. What’s so interesting about another open source virtualization platform? To find out I did more reading and spoke with LiteStack founder, Camuel Gilyadov.
ZeroVM has its roots in the OpenDremel project. Camuel and his team needed a lightweight virtualization framework but couldn’t find one that suited their requirements for OpenDremel. They created ZeroVM and along the way addressed issues relevant to cloud applications, including security, multi-tenancy, and instant1 elasticity. I’m not claiming ZeroVM is mature technology, but there are two potential applications that data scientists will like: Read more…
Spark is becoming a key part of a big data toolkit.
A large portion of this week’s Amp Camp at UC Berkeley, is devoted to an introduction to Spark – an open source, in-memory, cluster computing framework. After playing with Spark over the last month, I’ve come to consider it a key part of my big data toolkit. Here’s why:
Hadoop integration: Spark can work with files stored in HDFS, an important feature given the amount of investment in the Hadoop Ecosystem. Getting Spark to work with MapR is straightforward.
The Spark interactive Shell: Spark is written in Scala, and has it’s own version of the Scala interpreter. I find this extremely convenient for testing short snippets of code.
The Spark Analytic Suite:
(Figure courtesy of Matei Zaharia)
Spark comes with tools for interactive query analysis (Shark), large-scale graph processing and analysis (Bagel), and real-time analysis (Spark Streaming). Rather than having to mix and match a set of tools (e.g., Hive, Hadoop, Mahout, S4/Storm), you only have to learn one programming paradigm. For SQL enthusiasts, the added bonus is that Shark tends to run faster than Hive. If you want to run Spark in the cloud, there are a set of EC2 scripts available.
Since the first-ever Mechanical Turk meetup a year ago, there has been an explosion in crowdsourcing services and a well-attended conference in San Francisco. I remain enthusiastic about crowdsourcing, but the number of companies has me worried about quality of work. Fortunately specialization is already occurring, so for particular tasks there are companies out there ready to provide high-quality service….
While still a small fraction of data management job postings, the number of job posts that mention "hadoop" continue to grow steadily. Year-over-year, there were 300% more such job posts in the first seven months of 2010 compared to the same period in 2009. The fraction of "hadoop" jobs posted by California companies remain high, but is definitely lower than what it was last year.