Despite the attention big data has received in the media and among the technology community, it is surprising that we are still shortchanging the full capabilities of what data can do for us. At times, we get caught up in the excitement of the technical challenge of processing big data and lose sight of the ultimate goal: to derive meaningful insights that can help us make informed decisions and take action to improve our businesses and our lives.
I recently spoke on the topic of automating content at the O’Reilly Strata Conference. It was interesting to see the various ways companies are attempting to make sense out of big data. Currently, the lion’s share of the attention is focused on ways to analyze and crunch data, but very little has been done to help communicate results of big data analysis. Data can be a very valuable asset if properly exploited. As I’ll describe, there are many interesting applications one can create with big data that can describe insights or even become monetizable products.
To date, the de facto format for representing big data has been visualizations. While visualizations are great for compacting a large amount of data into something that can be interpreted and understood, the problem is just that — visualizations still require interpretation. There were many sessions at Strata about how to create effective visualizations, but the reality is the quality of visualizations in the real world varies dramatically. Even for the visualizations that do make intuitive sense, they often require some expertise and knowledge of the underlying data. That means a large number of people who would be interested in the analysis won’t be able to gain anything useful from it because they don’t know how to interpret the information.
To be clear, I’m a big fan of visualizations, but they are not the end-all in data analysis. They should be considered just one tool in the big data toolbox. I think of data as the seeds for content, whereby data can ultimately be represented in a number of different formats depending on your requirements and target audiences. In essence, data are the seeds that can spout as large a content tree as your imagination will allow.
Below, I describe each limb of the content tree. The examples I cite are sports related because that’s what we’ve primarily focused on at my company, Automated Insights. But we’ve done very similar things in other content areas rich in big data, such as finance, real estate, traffic and several others. In each case, once we completed our analysis and targeted the type of content we wanted to create, we completely automated the future creation of the content.
By long-form, I mean three or more paragraphs — although it could be several pages or even book length — that use human-readable language to reveal key trends, records and deltas in data. This is the hardest form of content to automate, but technology in this space is rapidly improving. For example, here is a recap of an NFL game generated out of box score and play-by-play data.
A long-form sports recap driven by data. See the full story.
These are bullets, headlines, and tweets of insights that can boil a huge dataset into very actionable bits of language. For example, here is a game notes article that was created automatically out of an NCAA basketball box score and historical stats.
Mobile and social content
We’ve done a lot of work creating content for mobile applications and various social networks. Last year, we auto-generated more than a half-million tweets. For example, here is the automated Twitter stream we maintain that covers UNC Basketball.
By metrics, I’m referring to the process of creating a single number that’s representative of a larger dataset. Metrics are shortcuts to boil data into something easier to understand. For instance, we’ve created metrics for various sports, such as a quarterback ranking system that’s based on player performance.
Instead of thinking of data as something you crunch and analyze days or weeks after it was created, there are opportunities to turn big data into real-time information that provides interested users with updates as soon as they occur. We have a real-time NCAA basketball scoreboard that updates with new scores.
This is one few people consider, but creating content-based applications is a great way to make use of and monetize data. For example, we created StatSmack, which is an app that allows sports fans to discover 10-20+ statistically based “slams” that enable them to talk trash about any team.
A variation on visualizations
Used in the right context, visualizations can be an invaluable tool for understanding a large dataset. The secret is combining bulleted text-based insights with the graphical visualization to allow them to work together to truly inform the user. For example, this page has a chart of win probability over the course of game seven of the 2011 World Series game. It shows the ebb and flow of the game.
Play-by-play win probability from game seven of the 2011 World Series.
As more people get their heads around how to crunch and analyze data, the issue of how to effectively communicate insights from that data will be a bigger concern. We are still in the very early stages of this capability, so expect a lot of innovation over the next few years related to automating the conversion of data to content.