ENTRIES TAGGED "realtime data"
The simplest and quickest way to mine your data is to deploy efficient algorithms designed to answer key questions at scale.
For many organizations real-time1 analytics entails complex event processing systems (CEP) or newer distributed stream processing frameworks like Storm, S4, or Spark Streaming. The latter have become more popular because they are able to process massive amounts of data, and fit nicely with Hadoop and other cluster computing tools. For these distributed frameworks peak volume is function of network topology/bandwidth and the throughput of the individual nodes.
Scaling up machine-learning: Find efficient algorithms
Faced with having to crunch through a massive data set, the first thing a machine-learning expert will try to do is devise a more efficient algorithm. Some popular approaches involve sampling, online learning, and caching. Parallelizing an algorithm tends to be lower on the list of things to try. The key reason is that while there are algorithms that are embarrassingly parallel (e.g., naive bayes), many others are harder to decouple. But as I highlighted in a recent post, efficient tools that run on single servers can tackle large data sets. In the machine-learning context recent examples2 of efficient algorithms that scale to large data sets, can be found in the products of startup SkyTree.
The second in a series looking at the major themes of this year's TOC conference.
Several overriding themes permeated this year’s Tools of Change for Publishing conference. The second in a series looking at five of the major themes, here we take a look at data in publishing — how publishers can benefit, practical applications, and innovative ways it can be used.
Hilary Mason on how Bitly applies the Internet's real-time data.
In this interview, Bitly chief scientist and Strata speaker Hilary Mason discusses the application of real-time data and the difference between analytics and data science.