A new open source file system that takes up half the space and runs significantly faster than HDFS is now available for Hadoop thanks to a firm named Quantcast. Their Quantcast File System (QFS) is being released today under an Apache 2 license and is immediately available for free download on GitHub.
If you’re one of those grumblers (I admit to it) who complains about the widespread tracking of web users for marketing purposes, you can pause to thank Quantcast for funding this significant advance out of their own pockets as a big data company in the advertising space. They started using Hadoop when they launched in 2006, storing a terabyte of data on web audiences each day. Now, using QFS as their primary data store, they add 40 terabytes of new data and their daily Hadoop processing can exceed 20 petabytes.
As they grew, Quantcast tweaked and enhanced the various tools in the Hadoop chain. In 2008, they adopted the Kosmos File System (KFS) and hired its lead developer, Sriram Rao. After much upgrading for reliability, scalability, manageability, they are now releasing the file system to the public as QFS. They hope to see other large-scale Hadoop users evaluate and adopt it for their own big data processing needs and collaborate on its ongoing development. The source code is available on GitHub, as well as prebuilt binaries for several popular versions of Linux.
The key enhancement to QFS seemed simple in retrospect, but tricky to implement. Standard HDFS achieves fault-tolerance by storing three copies of each file; in contrast, QFS uses a technique called Reed-Solomon encoding, which has been in wide use since the 1980s in products such as CDs and DVDs.
According to Jim Kelly, vice president of R&D at Quantcast, HDFS’s optimization approach was well chosen when it was invented. Networks were relatively slow, so data locality was important, and HDFS tried to store a complete copy of each file on the node most likely to access it. But in intervening years, networks have grown tenfold in speed, leaving disks as the major performance bottleneck, so it’s now possible to achieve better performance, fault tolerance, and disk space efficiency by distributing data more widely.
The form of Reed-Solomon encoding used in QFS stores redundant data in 9 places and is able to reconstruct the file from any 6 of these stripes. Whereas HDFS could lose a file if the 3 disks hosting it happen to fail, QFS is more robust.
More importantly, Reed-Solomon adds only 50% to the size of the data stored, making it twice as efficient as HDFS in terms of storage space, which also has ripple effects in savings on servers, power, cooling, and more.
Furthermore, the technique increases performance: writes are faster, because only half as much data needs to be written, and reads are faster, because every read is done by six drives working in parallel. Quantcast’s benchmarks of Hadoop jobs using HDFS and QFS show a 47% performance improvement in reads over HDFS, and a 75% improvement in writes.
QFS is also a bit more efficient because it is written in C++ instead of Java. Hadoop uses existing JNI binds to communicate with it.
Quantcast expects QFS to be of most interest to established Hadoop shops processing enough data that cost-efficient use of hardware is a significant concern. Smaller environments, those new to Hadoop, or those needing specific HDFS features will probably find HDFS a better fit. They have done intensive testing internally, running QFS in production for over a year, so now it’s time to see how the code holds up in a wider public test.