ENTRIES TAGGED "machine learning"
Areas concerned with shapes, invariants, and dynamics, in high-dimensions, are proving useful in data analysis
I’ve been noticing unlikely areas of mathematics pop-up in data analysis. While signal processing is a natural fit, topology, differential and algebraic geometry aren’t exactly areas you associate with data science. But upon further reflection perhaps it shouldn’t be so surprising that areas that deal in shapes, invariants, and dynamics, in high-dimensions, would have something to contribute to the analysis of large data sets. Without further ado, here are a few examples that stood out for me. (If you know of other examples of recent applications of math in data analysis, please share them in the comments.)
Compressed sensing is a signal processing technique which makes efficient data collection possible. As an example using compressed sensing images can be reconstructed from small amounts of data. Idealized Sampling is used to collect information to measure the most important components. By vastly decreasing the number of measurements to be collected, less data needs to stored, and one reduces the amount of time and energy1 needed to collect signals. Already there have been applications in medical imaging and mobile phones.
The problem is you don’t know ahead of time which signals/components are important. A series of numerical experiments led Emanuel Candes to believe that random samples may be the answer. The theoretical foundation as to why a random set of signals would work, where laid down in a series of papers by Candes and Fields Medalist Terence Tao2.
We should raise our collective expectations of what they should provide
There is a tremendous amount of commercial attention on machine learning (ML) methods and applications. This includes product and content recommender systems, predictive models for churn and lead scoring, systems to assist in medical diagnosis, social network sentiment analysis, and on and on. ML often carries the burden of extracting value from big data.
But getting good results from machine learning still requires much art, persistence, and even luck. An engineer can’t yet treat ML as just another well-bahaved part of the technology stack. There are many underlying reasons for this, but for the moment I want to focus on how we measure or evaluate ML systems.
Reflecting their academic roots, machine learning methods have traditionally been evaluated in terms of narrow quantitative metrics: precision, recall, RMS error, and so on. The data-science-as-competitive-sport site Kaggle has adopted these metrics for many of its competitions. They are objective and reassuringly concrete.
Our tools should make common cases easy and safe, but that's not the reality today.
Recently, the Mathbabe (aka Cathy O’Neil) vented some frustration about the pitfalls in applying even simple machine learning (ML) methods like k-nearest neighbors. As data science is democratized, she worries that naive practitioners will shoot themselves in the foot because these tools can offer very misleading results. Maybe data science is best left to the pros? Mike Loukides picked up this thread, calling for healthy skepticism in our approach to data and implicitly cautioning against a “cargo cult” approach in which data collection and analysis methods are blindly copied from previous efforts without sufficient attempts to understand their potential biases and shortcomings.
Well, arguing against greater understanding of the methods we apply is like arguing against motherhood and apple pie, and Cathy and Mike are spot on in their diagnoses of the current situation. And yet …
The simplest and quickest way to mine your data is to deploy efficient algorithms designed to answer key questions at scale.
For many organizations real-time1 analytics entails complex event processing systems (CEP) or newer distributed stream processing frameworks like Storm, S4, or Spark Streaming. The latter have become more popular because they are able to process massive amounts of data, and fit nicely with Hadoop and other cluster computing tools. For these distributed frameworks peak volume is function of network topology/bandwidth and the throughput of the individual nodes.
Scaling up machine-learning: Find efficient algorithms
Faced with having to crunch through a massive data set, the first thing a machine-learning expert will try to do is devise a more efficient algorithm. Some popular approaches involve sampling, online learning, and caching. Parallelizing an algorithm tends to be lower on the list of things to try. The key reason is that while there are algorithms that are embarrassingly parallel (e.g., naive bayes), many others are harder to decouple. But as I highlighted in a recent post, efficient tools that run on single servers can tackle large data sets. In the machine-learning context recent examples2 of efficient algorithms that scale to large data sets, can be found in the products of startup SkyTree.
The importance of data science tools that let organizations easily combine, deploy, and maintain algorithms
Data science often depends on data pipelines, that involve acquiring, transforming, and loading data. (If you’re fortunate most of the data you need is already in usable form.) Data needs to be assembled and wrangled, before it can be visualized and analyzed. Many companies have data engineers (adept at using workflow tools like Azkaban and Oozie), who manage1 pipelines for data scientists and analysts.
A workflow tool for data analysts: Chronos from airbnb
A raw bash scheduler written in Scala, Chronos is flexible, fault-tolerant2, and distributed (it’s built on top of Mesos). What’s most interesting is that it makes the creation and maintenance of complex workflows more accessible: at least within airbnb, it’s heavily used by analysts.
Job orchestration and scheduling tools contain features that data scientists would appreciate. They make it easy for users to express dependencies (start a job upon the completion of another job), and retries (particularly in cloud computing settings, jobs can fail for a variety of reasons). Chronos comes with a web UI designed to let business analysts3 define, execute, and monitor workflows: a zoomable DAG highlights failed jobs and displays stats that can be used to identify bottlenecks. Chronos lets you include asynchronous jobs – a nice feature for data science pipelines that involve long-running calculations. It also lets you easily define repeating jobs over a finite time interval, something that comes in handy for short-lived4 experiments (e.g. A/B tests or multi-armed bandits).
It helps to reduce context-switching during long data science workflows.
An integrated data stack boosts productivity
As I noted in my previous post, Python programmers willing to go “all in”, have Python tools to cover most of data science. Lest I be accused of oversimplification, a Python programmer still needs to commit to learning a non-trivial set of tools1. I suspect that once they invest the time to learn the Python data stack, they tend to stick with it unless they absolutely have to use something else. But being able to stick with the same programming language and environment is a definite productivity boost. It requires less “setup time” in order to explore data using different techniques (viz, stats, ML).
Multiple tools and languages can impede reproducibility and flow
On the other end of the spectrum are data scientists who mix and match tools, and use packages and frameworks from several languages. Depending on the task, data scientists can avail of tools that are scalable, performant, require less2 code, and contain a lot of features. On the other hand this approach requires a lot more context-switching, and extra effort is needed to annotate long workflows. Failure to document things properly makes it tough to reproduce3 analysis projects, and impedes knowledge transfer4 within a team of data scientists. Frequent context-switching also makes it more difficult to be in a state of flow, as one has to think about implementation/package details instead of exploring data. It can be harder to discover interesting stories with your data, if you’re constantly having to think about what you’re doing. (It’s still possible, you just have to concentrate a bit harder.)
James Pustejovsky and Amber Stubbs on machine learning best practices.
We sat down to talk about natural language annotation as it relates to machine learning. James and Amber reviewed methods, best practices, and what they see coming in the future.
Highlights from the conversation include:
- Learn why it is important to create your own corpus for machine learning. [Discussed 20 seconds in.]
- Discover different methods for creating a corpus. [Discussed at the 6:15 mark.]
- Understand the MATTER Annotation Development Process. [Discussed at the 9:58 mark.]
- Hear what James and Amber see coming next for machine learning. [Discussed at the 15:23 mark.]
You can view the entire interview in the following video.
Describe and run bleeding edge algorithms on massive data sets
In the course of applying machine-learning against large data sets, data scientists face a few pain points. They need to tune and compare several suitable algorithms – a process that may involve having to configure a hodgepodge of tools, requiring different input files, programming languages, and interfaces. Some software tools may not scale to big data, so they first sample and test ideas on smaller subsets, before tackling the problem of having to implement a distributed version of the final algorithm.
To increase productivity, ideally data scientists should be able to quickly test ideas without doing much coding, context switching, tuning and configuration. A research project0 out of UC Berkeley’s Amplab and Brown seems to do just that: MLbase aims to make cutting edge, scalable machine-learning algorithms available to non-experts. MLbase will have four pieces: a declarative language (MQL – discussed below), a library of distributed algorithms (ML-Library), an optimizer and a runtime (ML-Optimizer and ML-Runtime). Read more…
Networked sensors and machine learning make it easy to see when things are out of the ordinary.
Much of health care — particularly for the elderly — is about detecting change, and, as the mobile health movement would have it, computers are very good at that. Given enough sensors, software can model an individual’s behavior patterns and then figure out when things are out of the ordinary — when gait slows, posture stoops or bedtime moves earlier.
Technology already exists that lets users set parameters for households they’re monitoring. Systems are available that send an alert if someone leaves the house in the middle of the night or sleeps past a preset time. Those systems involve context-specific hardware (i.e., a bed-pressure sensor) and conscientious modeling (you have to know what time your grandmother usually wakes up).
The next step would be a generic system. One that, following simple setup, would learn the habits of the people it monitors and then detect the sorts of problems that beset elderly people living alone — falls, disorientation, and so forth — as well as more subtle changes in behavior that could signal other health problems.
A group of researchers from Austria and Turkey has developed just such a system, which they presented at the IEEE’s Industrial Electronics Society meeting in Montreal in October.*
Activity as surmised in different rooms by the researchers’ machine-learning algorithms. Source: “Activity Recognition Using a Hierarchical Model.”
In their approach, the researchers train a machine-learning algorithm with several days of routine household activity using door and motion sensors distributed through the living space. The sensors aren’t associated with any particular room at the outset: their software algorithmically determines the relative positions of the sensors, then classifies the rooms that they’re in based on activity patterns over the course of the day. Read more…
How Amazon Web Services and Rackspace measure up; IBM's Watson goes to school; Google researches data; and what will we call really, really big data?
Here are a few stories from the data space that caught my attention this week.
Rackspace vs Amazon
As Rackspace continues to ramp up its services to compete with Amazon Web Services (AWS) — this week, announcing a partnership with Hortonworks to develop a cloud-based enterprise-ready Hadoop platform to compete against Amazon’s Elastic MapReduce — Derrick Harris at GigaOm compared apples to apples.
John Engates, CTO of Rackspace, told Harris the most fundamental difference between the two services is the level of control given to the customer. Harris writes that Rackspace’s new Hadoop services aims to give the customer “granular control over how their systems are configured and how their jobs run,” providing “the experience of owning a Hadoop cluster without actually owning any of the hardware.” Engates pointed out, “It’s not MapReduce as a service; it’s more Hadoop as a service.”
Harris also points out that Rackspace is considering making moves into NoSQL and looks at AWS’ DynamoDB service. He notes that Amazon and Rackspace aren’t the only players on any of these fields, pointing to the likes of Microsoft’s HDInsight, IBM’s BigInsights, Qubole, Infochimps, MongoDB, Cassandra and CouchDB-based services.
In related news, Rackspace announced its new Cloud Networks feature this week that allows customers to design their own networks on Rackspace’s Cloud Servers. In an interview with Jack McCarthy at CRN, Engates explained the background:
“When we went from dedicated physical networks to our public cloud, we lost the ability to segment these networks. We used to have a vLAN. As we moved to OpenStack, we wanted to give our customers the ability to enable segmented networks in the cloud. Cloud Networks gives customers a degree of control over how they build networks in the cloud, whether it’s building networks application servers or for Web servers or databases.”
Engates also points out the networks are software-defined, “so customers can program their network on the fly.” You can read more about the new feature on the Rackspace blog.