Recently I spent a day at the Hadoop Summit in San Jose. One session in particular caught my attention because it hints at a continued merging of the RDBMS and Hadoop worlds. EMC’s Lei Chang gave a detailed talk about his team’s GOH project (Greenplum DB on HDFS). The project is testing the feasibility of running the Greenplum Database directly against its native data structures hosted in the Hadoop Distributed File System.
Convergence of this sort seems inevitable because Hadoop, MPP Databases, and even Linux super-computing clusters all share at least a superficial architectural pattern — horizontally distributed nodes of compute and storage connected either by Gigabit Ethernet or fast interconnects. And more and more it is the heaviness (and host stickiness) of the large scale data hosted on them that is driving design.
I couldn’t find Lei Chang’s slides, but a previous talk given by Donald Miner of EMC makes it clear that data flexibility is driving their work here (see slides 26-28). They are trying to provide an analytics platform that doesn’t require organizations to host multiple SQL and M/R-specific copies of their data. Their Unified Analytics Platform includes the MPP Greenplum DB, Hadoop, and tooling, and today many of their customers presumably have to do just that — store the same data twice to access with both Hadoop M/R and SQL. Today you either continuously ETL it back and forth or rely on slow and inflexible choices like external tables and Hive to access it in place.
At my previous company we sold some work that was designed to demonstrate Hadoop’s power to contribute to corporate strategic analysis. The idea was to combine Hadoop with an MPP RDBMS (in this case we used Cloudera with Greenplum DB) to get the power of each. Hadoop could groom unstructured data for combination with a traditional transactional data warehouse for downstream SQL analysis. Hadoop could also be used to do analysis directly on un-structured or semi-structured data using non-traditional approaches, or to do analysis on very large cache’s of traditional transactional data in a different timescale. The Greenplum DB environment would then provide SQL access to the combined stores of traditional transactional data and freshly groomed unstructured data.
We proved value in the approach, but it was unnecessarily complex because we had to store everything twice so that the SQL and M/R tribes within the group each have native access to everything. We also made use of GPDB external tables hosted in HDFS but performance suffered for queries involving that data.
At around the same time I was also working with a customer that already had a significant investment in a Linux super-computing cluster but was looking at moving some of their processing and analysis to a complementary Hadoop cluster. About half of the analysis they were running was amenable to Map/Reduce and the other half still required the more granular parallelism of MPI, but if two distinct clusters were required then all that data was going to have to be moved between processes. It would be a lot more interesting to leave the data where it is and simply shift the processes that were running on the nodes.
Data is getting heavier (more voluminous) relative to the networks that carry it around the data center (to the chips that process it, which also aren’t really getting faster). So the low-energy state is moving inexorably toward stationary data and processed with mobile algorithms. Today algorithm mobility is hampered by the national borders that separate similar but different enough machine clusters. But Yarn (M/R 2.0), MPI on Hadoop, experiments like GOH, an evolving and improving Hive, Bulk Synchronous Parallel and a whole slew of other projects all hint at the possibility of a convergence toward a unified cluster of multi-use machines that will be able to expose all of the different kinds of data management capabilities currently resident in different system types. Or, put another way, we’ll see something like an E.U. of clusters with materially better algorithm mobility across borders that are defined by the data resident therein rather than the kinds of algorithms that they can host.
It’s easy to imagine a future where large clusters of like machines dynamically adapt between SQL, M/R, MPI, and other programming paradigms depending on a combination of the resident data and the required processing. Regions of nodes will “express” themselves as Hadoop, MPP SQL or whatever at different times depending on what the data needed, without having to move it across slow networks and I/O.
I guess I’m describing a kind of data utopia where this perfectly homogenized cluster supports every algorithm class against every data type. And we all know that never happens. After all, we never even came close to the master-data-managed every-data-once third-normal-form relational world either. But we did have a well-understood ideal to pursue however imperfectly. Maybe this data-centric convergent cluster becomes that ideal for our emerging era of co-mingled structured/semi-structured/unstructured data.