ENTRIES TAGGED "realtime"
The simplest and quickest way to mine your data is to deploy efficient algorithms designed to answer key questions at scale.
For many organizations real-time1 analytics entails complex event processing systems (CEP) or newer distributed stream processing frameworks like Storm, S4, or Spark Streaming. The latter have become more popular because they are able to process massive amounts of data, and fit nicely with Hadoop and other cluster computing tools. For these distributed frameworks peak volume is function of network topology/bandwidth and the throughput of the individual nodes.
Scaling up machine-learning: Find efficient algorithms
Faced with having to crunch through a massive data set, the first thing a machine-learning expert will try to do is devise a more efficient algorithm. Some popular approaches involve sampling, online learning, and caching. Parallelizing an algorithm tends to be lower on the list of things to try. The key reason is that while there are algorithms that are embarrassingly parallel (e.g., naive bayes), many others are harder to decouple. But as I highlighted in a recent post, efficient tools that run on single servers can tackle large data sets. In the machine-learning context recent examples2 of efficient algorithms that scale to large data sets, can be found in the products of startup SkyTree.
Barlow's distilled insights regarding the ever evolving definition of real time big data analytics
During a break in between offsite meetings that Edd and I were attending the other day, he asked me, “did you read the Barlow piece?”
“Umm, no.” I replied sheepishly. Insert a sidelong glance from Edd that said much without saying anything aloud. He’s really good at that.
In my utterly meager defense, Mike Loukides is the editor on Mike Barlow’s Real-Time Big Data Analytics: Emerging Architecture. As Loukides is one of the core drivers behind O’Reilly’s book publishing program and someone who I perceive to be an unofficial boss of my own choosing, I am not really inclined to worry about things that I really don’t need to worry about. Then I started getting not-so-subtle inquiries from additional people asking if I would consider reviewing the manuscript for the Strata community site. This resulted in me emailing Loukides for a copy and sitting in a local cafe on a Sunday afternoon to read through the manuscript.
Shark is 100X faster than Hive for SQL, and 100X faster than Hadoop for machine-learning
Hadoop’s strength is in batch processing, MapReduce isn’t particularly suited for interactive/adhoc queries. Real-time1 SQL queries (on Hadoop data) are usually performed using custom connectors to MPP databases. In practice this means having connectors between separate Hadoop and database clusters. Over the last few months a number of systems that provide fast SQL access within Hadoop clusters have garnered attention. Connectors between Hadoop and fast MPP database clusters are not going away, but there is growing interest in moving many interactive SQL tasks into systems that coexist on the same cluster with Hadoop.
Having a Hadoop cluster support fast/interactive SQL queries dates back a few years to HadoopDB, an open source project out of Yale. The creators of HadoopDB have since started a commercial software company (Hadapt) to build a system that unites Hadoop/MapReduce and SQL. In Hadapt, a (Postgres) database is placed in nodes of a Hadoop cluster, resulting in a system2 that can use MapReduce, SQL, and search (Solr). Now on version 2.0, Hadapt is a fault-tolerant system that comes with analytic functions (HDK) that one can use via SQL. Read more…
Hilary Mason on how Bitly applies the Internet's real-time data.
In this interview, Bitly chief scientist and Strata speaker Hilary Mason discusses the application of real-time data and the difference between analytics and data science.
OSCON's co-chairs dig into the OSCON Data program.
OSCON's co-chairs discuss sessions in the OSCON Data conference and the people who might be interested in the associated topics.
Jud Valeski on how Gnip handles the Twitter fire hose.
Gnip CEO Jud Valeski talks about managing Twitter's fire hose and how the Internet's architecture must adapt to real-time needs.