Big Data and Real-time Structured Data Analytics

The emergence of sensors as sources of Big Data highlights the need for real-time analytic tools. Popular web apps like Twitter, Facebook, and blogs are also faced with having to analyze (mostly unstructured) data in near real-time. But as Truviso founder and UC Berkeley CS Professor Michael Franklin recently noted, there are mountains of structured data generated by web apps that lend themselves to real-time analysis:

The information stream driving the data analytics challenge is orders of magnitude larger than the streams of tweets, blog posts, etc. that are driving interest in searching the real-time web. Most tweets, for example, are created manually by people at keyboards or touchscreens, 140 characters at a time. Multiply that by the millions of active users and the result is indeed an impressive amount of information. The data driving the data analytics tsunami, on the other hand, is automatically generated. Every page view, ad impression, ad click, video view, etc. done by every user on the web generates thousands of bytes of log information. Add in the data automatically generated by the underlying infrastructure (CDNs, servers, gateways, etc.) and you can quickly find yourself dealing with petabytes of data.

In our report on Big Data, we listed some tools that can turn SQL data warehouses into real-time intelligence systems. The typical data warehouse usually reports on data that are a day, week, or even a month old. Not every company requires real-time reports, alerts, or exception tracking, but some domains may benefit from dramatically reducing latency. To supplement the typical post-campaign reports generated by traditional (static) data warehouses, advertisers and content providers could track and make adjustments to their campaigns in real-time. Web applications that rely on data generated by sensors (e.g. smart grids, location-aware mobile apps, logistics & supply-chain tracking, environmental sensors) would be able to display reports that are continuously updated in real-time. Web site performance and security reports are also natural candidates for real-time analytics.

If you desire (near) real-time analysis, traditional SQL databases and MapReduce systems are batch-oriented (load all the data, then analyze), and might not be able to deliver the low latency you’re seeking. Fortunately, there are tools that allow structured data sets (such as data warehouses) to be easily analyzed in real-time.

Recognizing that “data is moving until it gets stored”, the idea behind many real-time analytic engines is to start applying the same analytic techniques to moving (streams) and static (stored) data. Truviso separates the processing and analysis of data, and performs both in real-time. End-users and business analysts can access/query real-time data and historical data using SQL: in Truviso’s case the underlying Postgres engine and optimizer have been extended to include an embedded stream processor to handle “live data” in any SQL statement’s FROM clause††. To specify how “live data” is to be processed by a database engine, most real-time analytic vendors provide SQL extensions that allow users to specify the time windows to be analyzed. As data flows continuously into the system, the results of queries involving “live data” are continuously updated in real-time. Leveraging a popular database such as Postgres means structured data warehouses can be ported and made real-time with Truviso.

A major challenge facing stream databases is what do with out-of-order data. Streams are timestamped data sets, and most systems expect data to arrive in the correct time sequence. Unfortunately, things happen when data flows in from multiple sources and it is not uncommon for timestamped data to arrive out-of-order. While some real-time analytic systems simply drop out-of-order data (potentially leading to misleading query results), Truviso has developed algorithms that look for contiguous data and produce query results that correctly handle out-of-order data.

What about real-time analysis of unstructured data? Truviso hasn’t focused on unstructured data, preferring instead to target companies with existing data warehouses. After all, the general notion is that unstructured data doesn’t quite fit into SQL databases like Truviso. But the perception that unstructured data isn’t for relational databases may be changing slightly. Recently, a team at UC Berkeley used a SQL database to perform entity-extraction. They took unstructured text, passed it through a Conditional Random Fields algorithm (coded in SQL), and turned it into structured data.

(†) We recently had the chance to meet with the founders of Truviso. There are many other real-time analytic solutions including streambase and SQLstream.

(††) In Truviso’s system, “live data” or streams can be created (CREATE stream) and accessed in SQL much like static database tables.

tags: , , , , ,
  • trish

    while not on the same topic as your post, We thought you and those who read your site would enjoy this radio interview.

    On Monday August 17th, noon PST, Conversations Live with Vicki St. Clair (Seattle’s KKNW
    1150 AM) welcomes award-winning author and speaker, Vanessa Hall to discuss The Truth About Trust in Business. We’ll also talk with photographer’s agent and stock photography diva, Danita Delimont, as part of
    Conversations Live The Real Story Behind Entrepreneurism series. As a successful local business woman, Danita Delimont knows firsthand what it takes to build an international, million-dollar business, based on quality, relationships, and trust.!

    Here’s a direct link to information about the show.

    http://conversationslive.net/index.php?option=com_content&task=view&id=170&Itemid=29

    You can also stream live by clicking the LISTEN LIVE button.

    Podcasts can be heard at
    http://conversationslive.net/index2.php?option=com_podcast&feed=RSS2.0&no_html=1

  • http://sethgrimes.com Seth Grimes

    Ben, you’ve identified a very promising area. Regarding this, you might like a couple of blog articles I wrote:

    BI on Content Feeds, a.k.a. Continuous (Twitter) Transformation, http://www.intelligententerprise.com/blog/archives/2008/12/bi_on_content_f.html

    Event Processing Meets Text: Reuters at Gartner, http://www.intelligententerprise.com/blog/archives/2008/09/event_processin_1.html

  • http://blog.b3k.us Benjamin Black

    Ben,

    A very interesting, open source stream processing system is Esper, http://esper.codehaus.org/ . Very exciting space!

    b

  • http://www.susiewee.com Susie Wee

    Starting to sound like the same challenges that are faced in streaming video and audio, a.k.a. networked multimedia. The answer to unordered data is buffering, much like how MPEG I, P, and B frames are handled and processed. Unstructured data is another problem, though. You need a network transcoder to take care of that (also in the multimedia world)!

  • RB

    Putting this information in context and in an accessible format is the first step. Often, the delta of time and value between the streams of data is more valuable than the sensor data itself. Next, extracting relevant and meaningful events from this stream must be accomplished. After that, you can then implement sense/respond analytics, process integration, real-world enabled mashups, and a whole host of other apps…

  • http://dailyrider.blogspot.com/ Dan

    Hey Ben – I don’t have any personal project to demonstrate how clever I, too, am on the subject. I just wanted to say great article; much appreciated.

  • http://gregalbrecht.com/ gba

    luckily splunk (http://www.splunk.com) can handle both out-of-order and unstructured data while still offering many of the time slice and extensible query features you’ve described.

  • http://www.juicytags.com Mike Hennessy

    We’ve created a personal, real-time tracking system for anyone who uses the web. Post a link, track a web page, or even track documents you share with other people.

    The engine is quite sophisticated and built for large volumes of data crunching. For normal web users, real-time reports is where it’s at!

    You can learn more here: http://jucy.tw/d90I

  • http://www.webtrekk.com Benjamin Dageroth

    Very interesting. Web Analytics is also well on the way to Realtime Analytics and quite anumber of our customers use their data in Real Time to determine the popularity of articles and adjust their homepage accordingly. Especially with larger Sites it’s not easy to handle the load of data, but once you have the writing and update process under control, our system can be queried with pretty much any filters and timeslices one can imagine. Currently we are down to two minute updates and could go further down in theory – just a matter of price, but two minutes seem currently good enough for most purposes. Especially since you can grab the segment for user with one request and then write it into a cookie with the next request in order to use segment based targeting.

    Great to see other pushing into the space, I’ll certainly take a look at Truviso. If anoyne needs a realtime Web Analytics Solution, let me know ;-)

    Benjamin

  • Abhay Pimprikar

    Come and Join us at MIT/Stanford VLAB event moderated by none other than Roger Magoulos of OReilly Research. He will be moderating the event which will target new applications and business models that have evolved due to all the data that resides on the Internet.
    Data Exhaust Alchemy – Turning the Web’s Waste into Solid Gold
    Time:6:00PM Tuesday, January 19th
    Location:http://www.vlab.org/article.html?aid=304

    Moderator:
    Roger Magoulas, Research Director, O’Reilly

    Presenter:
    Mike (JB) John-Baptiste, CEO, PeerSet

    Panelists:
    Dr. DJ Patil, Chief Scientist and Sr. Director of Product, LinkedIn
    Jeff Hammerbacher, Founder and Vice President of Products, Cloudera
    Mark Breier, General Partner, In-Q-Tel
    Pablos Holman, Futurist, Inventor, Security Expert, and Notorious Hacker