ENTRIES TAGGED "big data analytics"
Areas concerned with shapes, invariants, and dynamics, in high-dimensions, are proving useful in data analysis
I’ve been noticing unlikely areas of mathematics pop-up in data analysis. While signal processing is a natural fit, topology, differential and algebraic geometry aren’t exactly areas you associate with data science. But upon further reflection perhaps it shouldn’t be so surprising that areas that deal in shapes, invariants, and dynamics, in high-dimensions, would have something to contribute to the analysis of large data sets. Without further ado, here are a few examples that stood out for me. (If you know of other examples of recent applications of math in data analysis, please share them in the comments.)
Compressed sensing is a signal processing technique which makes efficient data collection possible. As an example using compressed sensing images can be reconstructed from small amounts of data. Idealized Sampling is used to collect information to measure the most important components. By vastly decreasing the number of measurements to be collected, less data needs to stored, and one reduces the amount of time and energy1 needed to collect signals. Already there have been applications in medical imaging and mobile phones.
The problem is you don’t know ahead of time which signals/components are important. A series of numerical experiments led Emanuel Candes to believe that random samples may be the answer. The theoretical foundation as to why a random set of signals would work, where laid down in a series of papers by Candes and Fields Medalist Terence Tao2.
The simplest and quickest way to mine your data is to deploy efficient algorithms designed to answer key questions at scale.
For many organizations real-time1 analytics entails complex event processing systems (CEP) or newer distributed stream processing frameworks like Storm, S4, or Spark Streaming. The latter have become more popular because they are able to process massive amounts of data, and fit nicely with Hadoop and other cluster computing tools. For these distributed frameworks peak volume is function of network topology/bandwidth and the throughput of the individual nodes.
Scaling up machine-learning: Find efficient algorithms
Faced with having to crunch through a massive data set, the first thing a machine-learning expert will try to do is devise a more efficient algorithm. Some popular approaches involve sampling, online learning, and caching. Parallelizing an algorithm tends to be lower on the list of things to try. The key reason is that while there are algorithms that are embarrassingly parallel (e.g., naive bayes), many others are harder to decouple. But as I highlighted in a recent post, efficient tools that run on single servers can tackle large data sets. In the machine-learning context recent examples2 of efficient algorithms that scale to large data sets, can be found in the products of startup SkyTree.
Business Intelligence, machine-learning, and graph processing systems tackle large data sets with single servers.
About a year ago a blog post from SAP posited1 that when it comes to analytics, most companies are in the multi-terabyte range: data sizes that are well-within the scope of distributed in-memory solutions like Spark, SAP HANA, ScaleOut Software, GridGain, and Terracotta.
How do the cloud offerings from Amazon, Google and Microsoft compare?
Big data and cloud technology go hand-in-hand: but it's comparatively early days. Strata conference chair Edd Dumbill explains the cloud landscape and compares the offerings of Amazon, Google and Microsoft.