Here are some of the key big data themes I expect to dominate 2013, and of course will be covering in Strata.
Emergence of a big data architecture
The coming year will mark the graduation for many big data pilot projects, as they are put into production. With that comes an understanding of the practical architectures that work. These architectures will identify:
- best of breed tools for different purposes, for instance, Storm for streaming data acquisition
- appropriate roles for relational databases, Hadoop, NoSQL stores and in-memory databases
- how to combine existing data warehouses and analytical databases with Hadoop
Of course, these architectures will be in constant evolution as big data tooling matures and experience is gained.
In parallel, I expect to see increasing understanding of where big data responsibility sits within a company’s org chart. Big data is fundamentally a business problem, and some of the biggest challenges in taking advantage of it lie in the changes required to cross organizational silos and reform decision making.
One to watch: it’s hard to move data, so look for a starring architectural role for HDFS for the foreseeable future.
Hadoop is not the only fruit
Though deservedly the poster child for big data software, Hadoop is not the only way to process big data. Credible competitors are emerging, especially where specialized applications are concerned. For example, the Berkeley Data Analytics Stack offers an alternative platform that performs much faster than Hadoop MapReduce for some applications focused on data mining and machine learning.
At the same time, Hadoop is reinventing itself. Hadoop distributions this year will embrace Hadoop 2.0, and in particular YARN, a replacement for the batch-oriented MapReduce part of Hadoop that will permit other kinds of workloads to be executed.
For any big data competitor to get traction, it will need to both be open source and also fully support SQL-like access to data, which has become an entry-level requirement over the course of 2012. Hadoop’s not going anywhere soon, but a pleasing diversity of tools is emerging.
One to watch: expect to see one or more startups emerging to commercialize the Berkeley Data Analytics Stack.
Turnkey big data platforms
Hadoop has a lot of moving parts. A lot. Even with the administration tools from vendors such as Cloudera and Hortonworks, there’s still significant work required in setting up and running a Hadoop cluster. In our age of cloud services, there’s no reason that should be so, as demonstrated by Amazon’s Elastic MapReduce service.
Expect Hadoop vendors to focus on removing system administration overhead over the course of this year, and other companies providing integrated big data stacks. InfoChimps offers a big data stack managed as a service within private data centers. For those content to run in the public cloud, Qubole takes the concept one level further, with a turnkey Hadoop and Hive analysis platform that runs on Amazon EC2.
Data governance comes into focus
As big data goes into production, it will need to integrate with the rest of the enterprise. Many of the issues concerned with data governance will rise to the fore, including:
- data security
- data consistency
- reducing data duplication
- regulatory compliance
One to watch: data security will become a hot topic this year, including approaches to securing Hadoop and databases with fine-grained security, such as Apache Accumulo.
End-to-end analytic solutions emerge
There are far more people who want to access analytic capabilities than have the IT resource to set up their own Hadoop clusters and code for them. For many “big data” applications, the big data comes from outside sources such as Twitter, or GIS data, but the internal data might be reasonably manageable, such as customer or sales data.
This year will see the growth of SaaS analytics platforms, delivered in the cloud for the swipe of a credit card. Web analytics platforms have pioneered the way here. In 2013, Google intends to expand their analytics offering to address “universal analytics,” a service currently in closed beta-test.
The Frankenstein nature of current big data and BI offerings, most often involving gluing Tableau to an underlying database and accompanying ETL work, means that there’s a clear gap in the market for compelling end-to-end analytic solutions, especially targeted at marketing applications.
One to watch: the launch of ClearStory Data into public availability in 2013 will provide dynamic competition for analytics incumbents.