Timothy M. O'Brien
Did Google just prove the industry wrong? Early thoughts on the Spanner database.
In case you missed it, Google Research published another one of “those” significant research papers — a paper like the BigTable paper from 2006 that had ramifications for the entire industry (that paper was one of the opening volleys in the NoSQL movement).
Google’s new paper is about a distributed relational database called Spanner that was a follow up to a presentation from earlier in the year about a new database for AdWords called F1. If you recall, that presentation revealed Google’s migration of AdWords from MySQL to a new database that supported SQL and hierarchical schemas — two ideas that buck the trend from relational databases.
This new database, Spanner, is a database unlike anything we’ve seen. It’s a database that embraces ACID, SQL, and transactions, that can be distributed across thousands of nodes spanning multiple data centers across multiple regions. The paper dwells on two main features that define this database:
- Schematized Semi-relational Tables — A hierarchical approach to grouping tables that allows Spanner to co-locate related data into directories that can be easily stored, replicated, locked, and managed on what Google calls spanservers. They have a modified SQL syntax that allows for the data to be interleaved, and the paper mentions some changes to support columns encoded with Protobufs.
- “Reification of Clock Uncertainty” — This is the real emphasis of the paper. The missing link in relational database scalability was a strong emphasis on coordination backed by a serious attempt to minimize time uncertainty. In Google’s new global-scale database, the variable that matters is epsilon — time uncertainty. Google has achieved very low overhead (14ms introduced by Spanner in this paper for datacenters at 1ms network distance) for read-write (RW) transactions that span U.S. East Coast and U.S. West Coast (data centers separated by around 2ms of network time) by creating a system that facilitates distributed transactions bound only by network distance (measured in milliseconds) and time uncertainty (epsilon).
RedMonk's Steve O'Grady weighs in on data's pressing issues.
Redmonk analyst Steve O'Grady discusses the demand for data scientists, the problem of using data to asking the right questions, and why you shouldn't rush into a NoSQL investment.