Counting Unique Users in Real-time with Streaming Databases

As the web increasingly becomes real-time, marketers and publishers need analytic tools that can produce real-time reports. As an example, the basic task of calculating the number of unique users is typically done in batch mode (e.g. daily) and in many cases using a random sample from relevant log files. If unique user counts can be accurately computed in real-time, publishers and marketers can mount A/B tests or referral analysis to dynamically adjust their campaigns.

In a previous post I described SQL databases designed to handle data streams. In their latest release, Truviso announced technology that allows companies to track unique users in real-time. Truviso uses the same basic idea I described in my earlier post:

Recognizing that “data is moving until it gets stored”, the idea behind many real-time analytic engines is to start applying the same analytic techniques to moving (streams) and static (stored) data.

Truviso uses (compressed) bitmaps and set theory to compute the number of unique customers in real-time. In the process they are able to handle the standard SQL queries associated with these types of problems: counting the number of distinct users, for any given set of demographic filters. Bitmaps are built as data streams into the system and use the same underlying technology that allows Truviso to handle massive data sets from high-traffic web sites.

pathint

Once companies can do simple counts and averages in real-time, the next step is to use real-time information for more sophisticated analyses. Truviso has customers using their system for “on-the-fly predictive modeling”.

The other main enhancement in this release is Truviso’s move towards parallel processing. Their new execution engine processes runs or blocks of data in parallel in multi-core systems or multi-node environments. Using Truviso’s parallel execution engine is straightforward on a single multi-core server, but on a multi-node cluster it may require considerable attention to configuration.

[For my previous posts on real-time analytic tools see here and here.]

UPDATE (11/23/2009): Google acquires Teracent, a San Mateo startup that specializes in real-time analytics for optimizing online banner/display ads.

tags: , , , , , ,
  • http://shiftmarket.com Vlad Stesin

    The reason reach is calculated, say, daily, is because there is always a notion of frequency.

    If everytime you analyze 1 hour blocks, you will not know whether this is the user’s 3rd visit today or 10th this week.

    Not only that, but isn’t there a risk of making wrong decisions based on something that is not statistically significant in any way (ex. data over 1 hour which may be during a Yankees game?).

    • http://radar.oreilly.com/ben/ Ben Lorica

      Hi Vlad,

      1. “Not only that, but isn’t there a risk of making wrong decisions based on something that is not statistically significant in any way (ex. data over 1 hour which may be during a Yankees game?).”
      >> With the web in general (and web publishing in particular) becoming increasingly becoming real-time, there are legitimate reasons why you may want to measure in the timeframes you alluded to. In fact publishers are already doing real-time A/B testing, and marketers are already tuning and evaluating ad/marketing campaigns over those short timeframes.

      2. “If everytime you analyze 1 hour blocks, you will not know whether this is the user’s 3rd visit today or 10th this week.”
      >> Truviso’s technology (compressed bitmaps, set theory, and a real-time SQL parallel processing engine) can actually handle the problem you cited: number of uniques over several periods of varying length.

      Regards,
      Ben Lorica

  • http://www.truviso.com Tom K

    Vlad, while that’s the current general practice, it’s done because tracking any more detail is too expensive. In counting daily uniques, you aren’t actually tracking unique ID’s individually, you’re tracking whether or not you’ve seen that cookie ID before. A much different level of magnitude in computation and table reads.

    If you track vists and revisits during shorter time periods and associate those with an ID, it’s a huge and generally unmanageable processing effort using a batch-oriented or RDBMS system.

    With the Truviso approach, you can actually tell whether this is a person’s first or 10th visit today, or this month, and create segments and scores based on that. Marketers and business managers can really get a better understanding of loyalty and engagement, and relate that directly to a person – not just as a ‘new’ or ‘returning’ visitor statistic.

  • Vlad Stesin

    Just to be clear, I’m not criticizing Truviso, just trying to understand what the use cases would be for looking at a small chunk of data rather than on aggregate.

    I’m all for A/B testing, as long as it is statistically significant, based off a sample that is sufficient.

    Thank for clarifying.

  • http://www.mit.edu/~y_z/ Yang

    Vlad, stream processing systems aggregate over windows of user-specified length; there’s nothing constraining you to operate on small windows. Don’t confuse processing frequency with window size.

  • Brett Sheppard

    Good blog entry. I posted a TrackBack from the Truviso blog at http://www.truviso.com/blog/2009/11/tracking-unique-users-with-truviso/ . The next time your blog admin signs in, he/show should see the Trackback and be able to accept it, so your blog posting at the end of the article will update to show from “0 TrackBacks” to “1 TrackBacks”.