As the “coming out” year for big data and data science draws to a close, what can we expect over the next 12 months?
More powerful and expressive tools for analysis
This year has seen consolidation and engineering around improving the basic storage and data processing engines of NoSQL and Hadoop. That will doubtless continue, as we see the unruly menagerie of the Hadoop universe increasingly packaged into distributions, appliances and on-demand cloud services. Hopefully it won’t be long before that’s dull, yet necessary, infrastructure.
Looking up the stack, there’s already an early cohort of tools directed at programmers and data scientists (Karmasphere, Datameer), as well as Hadoop connectors for established analytical tools such as Tableau and R. But there’s a way to go in making big data more powerful: that is, to decrease the cost of creating experiments.
Here are two ways in which big data can be made more powerful.
- Better programming language support. As we consider data, rather than business logic, as the primary entity in a program, we must create or rediscover idiom that lets us focus on the data, rather than abstractions leaking up from the underlying Hadoop machinery. In other words: write shorter programs that make it clear what we’re doing with the data. These abstractions will in turn lend themselves to the creation of better tools for non-programmers.
- We require better support for interactivity. If Hadoop has any weakness, it’s in the batch-oriented nature of computation it fosters. The agile nature of data science will favor any tool that permits more interactivity.
Streaming data processing
Hadoop’s batch-oriented processing is sufficient for many use cases, especially where the frequency of data reporting doesn’t need to be up-to-the-minute. However, batch processing isn’t always adequate, particularly when serving online needs such as mobile and web clients, or markets with real-time changing conditions such as finance and advertising.
Over the next few years we’ll see the adoption of scalable frameworks and platforms for handling streaming, or near real-time, analysis and processing. In the same way that Hadoop has been borne out of large-scale web applications, these platforms will be driven by the needs of large-scale location-aware mobile, social and sensor use.
For some applications, there just isn’t enough storage in the world to store every piece of data your business might receive: at some point you need to make a decision to throw things away. Having streaming computation abilities enables you to analyze data or make decisions about discarding it without having to go through the store-compute loop of map/reduce.
Rise of data marketplaces
Your own data can become that much more potent when mixed with other datasets. For instance, add in weather conditions to your customer data, and discover if there are weather related patterns to your customers’ purchasing patterns. Acquiring these datasets can be a pain, especially if you want to do it outside of the IT department, and with some exactness. The value of data marketplaces is in providing a directory to this data, as well as streamlined, standardized methods of delivering it. Microsoft’s direction of integrating its Azure marketplace right into analytical tools foreshadows the coming convenience of access to data.
Development of data science workflows and tools
As data science teams become a recognized part of companies, we’ll see a more regularized expectation of their roles and processes. One of the driving attributes of a successful data science team is its level of integration into a company’s business operations, as opposed to being a sidecar analysis team.
Software developers already have a wealth of infrastructure that is both logistical and social, including wikis and source control, along with tools that expose their process and requirements to business owners. Integrated data science teams will need their own versions of these tools to collaborate effectively. One example of this is EMC Greenplum’s Chorus, which provides a social software platform for data science. In turn, use of these tools will support the emergence of data science process within organizations.
Data science teams will start to evolve repeatable processes, hopefully agile ones. They could do worse than to look at the ground-breaking work newspaper data teams are doing at news organizations such as The Guardian and New York Times: given short timescales these teams take data from raw form to a finished product, working hand-in-hand with the journalist.
Increased understanding of and demand for visualization
Visualization fulfills two purposes in a data workflow: explanation and exploration. While business people might think of a visualization as the end result, data scientists also use visualization as a way of looking for questions to ask and discovering new features of a dataset.
If becoming a data-driven organization is about fostering a better feel for data among all employees, visualization plays a vital role in delivering data manipulation abilities to those without direct programming or statistical skills.
Throughout a year dominated by business’ constant demand for data scientists, I’ve repeatedly heard from data scientists about what they want most: people who know how to create visualizations.