ENTRIES TAGGED "Hadoop"
Researchers begin to scale up pattern recognition, machine-learning, and data management tools.
My first job after leaving academia was as a quant1 for a hedge fund, where I performed (what are now referred to as) data science tasks on financial time-series. I primarily used techniques from probability & statistics, econometrics, and optimization, with occasional forays into machine-learning (clustering, classification, anomalies). More recently, I’ve been closely following the emergence of tools that target large time series and decided to highlight a few interesting bits.
Time-series and big data:
Over the last six months I’ve been encountering more data scientists (outside of finance) who work with massive amounts of time-series data. The rise of unstructured data has been widely reported, the growing importance of time-series much less so. Sources include data from consumer devices (gesture recognition & user interface design), sensors (apps for “self-tracking”), machines (systems in data centers), and health care. In fact some research hospitals have troves of EEG and ECG readings that translate to time-series data collections with billions (even trillions) of points.
It helps to reduce context-switching during long data science workflows.
An integrated data stack boosts productivity
As I noted in my previous post, Python programmers willing to go “all in”, have Python tools to cover most of data science. Lest I be accused of oversimplification, a Python programmer still needs to commit to learning a non-trivial set of tools1. I suspect that once they invest the time to learn the Python data stack, they tend to stick with it unless they absolutely have to use something else. But being able to stick with the same programming language and environment is a definite productivity boost. It requires less “setup time” in order to explore data using different techniques (viz, stats, ML).
Multiple tools and languages can impede reproducibility and flow
On the other end of the spectrum are data scientists who mix and match tools, and use packages and frameworks from several languages. Depending on the task, data scientists can avail of tools that are scalable, performant, require less2 code, and contain a lot of features. On the other hand this approach requires a lot more context-switching, and extra effort is needed to annotate long workflows. Failure to document things properly makes it tough to reproduce3 analysis projects, and impedes knowledge transfer4 within a team of data scientists. Frequent context-switching also makes it more difficult to be in a state of flow, as one has to think about implementation/package details instead of exploring data. It can be harder to discover interesting stories with your data, if you’re constantly having to think about what you’re doing. (It’s still possible, you just have to concentrate a bit harder.)
Tools slowly democratize many data science tasks
Here are a few observations based on conversations I had during the just concluded Strata Santa Clara conference.
Spark is attracting attention
I’ve written numerous times about components of the Berkeley Data Analytics Stack (Spark, Shark, MLbase). Two Spark-related sessions at Strata were packed (slides here and here) and I talked to many people who were itching to try the BDAS stack. Being able to combine batch, real-time, and interactive analytics in a framework that uses a simple programming model is very attractive. The release of version 0.7 adds a Python API to Spark’s native Scala interface and Java API.
Hortonworks' Data Platform for Windows, Intel's Hadoop distribution, invasive smartphone surveillance, and data-driven "House of Cards."
Windows gets Hadoop, Intel launches Hadoop distribution
Hortonworks released a beta version of its Hortonworks Data Platform for Windows this week. In the press release, the company highlights the mission is to “expand the reach of Apache Hadoop across the enterprise” and notes that the “100% open source Hortonworks Data Platform is the industry’s first and only Apache Hadoop distribution for both Windows and Linux.”
Barb Darrow notes at GigaOm that there’s likely no better way to bring big data to the masses than via Microsoft Excel. Darrow reports that Hortonworks’ VP of corporate strategy Shawn Connolly told her that “[t]he combination should make it easier to integrate data from SQL Server and Hadoop and to funnel all that into Excel for charting and pivoting and all the tasks Excel is good at,” stressing that the same Apache Hadoop distribution will run on both Windows and Linux. Connolly also noted to Darrow that “an analogous Hortonworks Data Platform for Windows Azure is still in the works.”
Describe and run bleeding edge algorithms on massive data sets
In the course of applying machine-learning against large data sets, data scientists face a few pain points. They need to tune and compare several suitable algorithms – a process that may involve having to configure a hodgepodge of tools, requiring different input files, programming languages, and interfaces. Some software tools may not scale to big data, so they first sample and test ideas on smaller subsets, before tackling the problem of having to implement a distributed version of the final algorithm.
To increase productivity, ideally data scientists should be able to quickly test ideas without doing much coding, context switching, tuning and configuration. A research project0 out of UC Berkeley’s Amplab and Brown seems to do just that: MLbase aims to make cutting edge, scalable machine-learning algorithms available to non-experts. MLbase will have four pieces: a declarative language (MQL – discussed below), a library of distributed algorithms (ML-Library), an optimizer and a runtime (ML-Optimizer and ML-Runtime). Read more…
We're launching an investigation into in-memory data technologies.
In a forthcoming report we will highlight technologies and solutions that take advantage of the decline in prices of RAM, the popularity of distributed and cloud computing systems, and the need for faster queries on large, distributed data stores. Established technology companies have had interesting offerings, but what initially caught our attention were open source projects that started gaining traction last year.
An example we frequently hear about is the demand for tools that support interactive query performance. Faster query response times translate to more engaged and productive analysts, and real-time reports. Over the past two years several in-memory solutions emerged to deliver 5X-100X faster response times. A recent paper from Microsoft Research noted that even in this era of big data and Hadoop, many MapReduce jobs fit in the memory of a single server. To scale to extremely large datasets several new systems use a combination of distributed computing (in-memory grids), compression, and (columnar) storage technologies.
Another interesting aspect of in-memory technologies is that they seem to be everywhere these days. We’re looking at tools aimed at analysts (Tableau, Qlikview, Tibco Spotfire, Platfora), databases that target specific workloads or data types (VoltDB, SAP HANA, Hekaton, Redis, Druid, Kognitio, and Yarcdata), frameworks for analytics (Spark/Shark, GraphLab, GridGain, Asterix/Hyracks), and the data center (RAMCloud, memory Iocality).
We’ll be talking to companies and hackers to get a sense of how in-memory solutions fit into their planning. Along these lines, we would love to hear what you think about the rise of these technologies, as well as applications, companies and projects we should look at. Feel free to reach out to us on Twitter (Ben is @bigdata and Roger is @rogerm) or leave a comment on this post. Read more…
Diversity and manageability are big data watchwords for the next 12 months.
Here are some of the key big data themes I expect to dominate 2013, and of course will be covering in Strata.
Emergence of a big data architecture
The coming year will mark the graduation for many big data pilot projects, as they are put into production. With that comes an understanding of the practical architectures that work. These architectures will identify:
- best of breed tools for different purposes, for instance, Storm for streaming data acquisition
- appropriate roles for relational databases, Hadoop, NoSQL stores and in-memory databases
- how to combine existing data warehouses and analytical databases with Hadoop
Of course, these architectures will be in constant evolution as big data tooling matures and experience is gained.
In parallel, I expect to see increasing understanding of where big data responsibility sits within a company’s org chart. Big data is fundamentally a business problem, and some of the biggest challenges in taking advantage of it lie in the changes required to cross organizational silos and reform decision making.
One to watch: it’s hard to move data, so look for a starring architectural role for HDFS for the foreseeable future. Read more…
Steve Francia on alternatives to Hadoop and what lies ahead for MongoDB.
Steve and I sat down during the Strata + Hadoop World conference in New York last month to talk about what he’s most excited about nowadays. He focused on alternatives to Hadoop, what we can expect to see next from MongoDB, and the future of big data.
Highlights from the conversation include:
- Discover alternatives to Hadoop. [Discussed 18 seconds in].
- The new features in MongoDB 2.2. [Discussed at the 1:23 mark].
- How being an open source company helps 10gen connect with its users. [Discussed at the 3:09 mark].
- Long-term goals for MongoDB. [Discussed at the 5:10 mark].
- New technologies are enabling all of us to participate in big data. [Discussed at the 7:05 mark].
You can view the entire interview in the following video.
Shark is 100X faster than Hive for SQL, and 100X faster than Hadoop for machine-learning
Hadoop’s strength is in batch processing, MapReduce isn’t particularly suited for interactive/adhoc queries. Real-time1 SQL queries (on Hadoop data) are usually performed using custom connectors to MPP databases. In practice this means having connectors between separate Hadoop and database clusters. Over the last few months a number of systems that provide fast SQL access within Hadoop clusters have garnered attention. Connectors between Hadoop and fast MPP database clusters are not going away, but there is growing interest in moving many interactive SQL tasks into systems that coexist on the same cluster with Hadoop.
Having a Hadoop cluster support fast/interactive SQL queries dates back a few years to HadoopDB, an open source project out of Yale. The creators of HadoopDB have since started a commercial software company (Hadapt) to build a system that unites Hadoop/MapReduce and SQL. In Hadapt, a (Postgres) database is placed in nodes of a Hadoop cluster, resulting in a system2 that can use MapReduce, SQL, and search (Solr). Now on version 2.0, Hadapt is a fault-tolerant system that comes with analytic functions (HDK) that one can use via SQL. Read more…
How Amazon Web Services and Rackspace measure up; IBM's Watson goes to school; Google researches data; and what will we call really, really big data?
Here are a few stories from the data space that caught my attention this week.
Rackspace vs Amazon
As Rackspace continues to ramp up its services to compete with Amazon Web Services (AWS) — this week, announcing a partnership with Hortonworks to develop a cloud-based enterprise-ready Hadoop platform to compete against Amazon’s Elastic MapReduce — Derrick Harris at GigaOm compared apples to apples.
John Engates, CTO of Rackspace, told Harris the most fundamental difference between the two services is the level of control given to the customer. Harris writes that Rackspace’s new Hadoop services aims to give the customer “granular control over how their systems are configured and how their jobs run,” providing “the experience of owning a Hadoop cluster without actually owning any of the hardware.” Engates pointed out, “It’s not MapReduce as a service; it’s more Hadoop as a service.”
Harris also points out that Rackspace is considering making moves into NoSQL and looks at AWS’ DynamoDB service. He notes that Amazon and Rackspace aren’t the only players on any of these fields, pointing to the likes of Microsoft’s HDInsight, IBM’s BigInsights, Qubole, Infochimps, MongoDB, Cassandra and CouchDB-based services.
In related news, Rackspace announced its new Cloud Networks feature this week that allows customers to design their own networks on Rackspace’s Cloud Servers. In an interview with Jack McCarthy at CRN, Engates explained the background:
“When we went from dedicated physical networks to our public cloud, we lost the ability to segment these networks. We used to have a vLAN. As we moved to OpenStack, we wanted to give our customers the ability to enable segmented networks in the cloud. Cloud Networks gives customers a degree of control over how they build networks in the cloud, whether it’s building networks application servers or for Web servers or databases.”
Engates also points out the networks are software-defined, “so customers can program their network on the fly.” You can read more about the new feature on the Rackspace blog.