Here are a few of the data stories that caught my attention this week.
Moving an elephant: How Facebook moved to a new datacenter
Migrating data to a new system is always a hassle. But when you’re Facebook and you’re dealing with data on a petabyte scale — and with the impossibility, really, of downtime — it’s more than a hassle. It’s a huge engineering challenge. But that’s what Facebook recently undertook when it migrated it’s Hadoop deployment to a new datacenter.
Moving the actual machines themselves to the new datacenter wasn’t an option. Facebook’s Paul Yang described the process that the team developed in order to replicate 30 petabytes of its Hadoop cluster to the new location, noting the challenges due to it being a live file system on a massive scale.
Once the required systems were developed, the replication approach was executed in two steps. First, a bulk copy transferred most of the data from the source cluster to the destination. Yang wrote:
Most of the directories were copied via DistCp — an application shipped with Hadoop that uses a MapReduce job to copy files in parallel. Our Hadoop engineers made code and configuration changes to handle special cases with Facebook’s dataset, including the ability for multiple mappers to copy a single large file, and for the proper handling of directories with many small files. After the bulk copy was done, file changes after the start of the bulk copy were copied over to the destination cluster through the new replication system. File changes were detected through a custom Hive plug-in that recorded the changes to an audit log. The replication system continuously polled the audit log and copied modified files so that the destination would never be more than a couple of hours behind. The plug-in recorded Hive metadata changes as well, so that metadata modifications such as the last accessed time of Hive tables and partitions were propagated. Both the plug-in and the replication system were developed in-house by members of the Hive team.
Yang said the speed of the replication system was key, as it kept downtime to a minimum and made the identification and repair of any corrupt files possible without falling behind on schedule. He also noted that the system Facebook devised pointed to a potential disaster-recovery system using Hive. “With replication deployed, operations could be switched over to the replica cluster with relatively little work in case of a disaster. The replication system could increase the appeal of using Hadoop for high-reliability enterprise applications.”
Ex-NASA CTO Launches Nebula
At OSCON this week, former NASA CTO Chris Kemp announced his new company, Nebula, which will sell an Open Stack-based appliance that will enable any company to implement cloud computing. Nebula builds upon Open Compute, the infrastructure project that Facebook open sourced earlier this year. Nebula shares a name with the computing project that NASA open-sourced last year as part of the initial Open Stack initiative, and the new startup aims to offer a turnkey solution to help companies implement Open Stack.
As Kemp told O’Reilly Radar’s Alex Howard in an interview:
As people face this industrial revolution of big data, they can’t use Oracle anymore. It doesn’t scale. We want to be the platform that enables that. We really believe that, if all of this stuff will achieve its potential, in being open, it will reshape the core of computing. We really think there’s this new paradigm of computing where people are building on top of infrastructure services instead of infrastructure.
In announcing the startup at OSCON, Kemp spoke of the democratizing power of Nebula, putting this big data computing power in the hands of everyone, not just large companies with massive infrastructure.
Liking and linking library data
GlueJar‘s Eric Hellman continues his blog series on libraries, data, and search engines with a post on “Liking Library Data. He offers thoughts on how to implement Facebook’s Open Graph Protocol on library sites, not just so that visitors can “like” the library’s website, of course, but so that books can be tied to individual social graphs.
There’s the caveat of course:
But Hellman argues that it’s important that library resources become more fully integrated with social networks — it’s about “connections, not just collections,” he says.
Got data news?
Feel free to email me.