Here are a few stories from the data space that caught my attention this week.
Rackspace vs Amazon
As Rackspace continues to ramp up its services to compete with Amazon Web Services (AWS) — this week, announcing a partnership with Hortonworks to develop a cloud-based enterprise-ready Hadoop platform to compete against Amazon’s Elastic MapReduce — Derrick Harris at GigaOm compared apples to apples.
John Engates, CTO of Rackspace, told Harris the most fundamental difference between the two services is the level of control given to the customer. Harris writes that Rackspace’s new Hadoop services aims to give the customer “granular control over how their systems are configured and how their jobs run,” providing “the experience of owning a Hadoop cluster without actually owning any of the hardware.” Engates pointed out, “It’s not MapReduce as a service; it’s more Hadoop as a service.”
Harris also points out that Rackspace is considering making moves into NoSQL and looks at AWS’ DynamoDB service. He notes that Amazon and Rackspace aren’t the only players on any of these fields, pointing to the likes of Microsoft’s HDInsight, IBM’s BigInsights, Qubole, Infochimps, MongoDB, Cassandra and CouchDB-based services.
In related news, Rackspace announced its new Cloud Networks feature this week that allows customers to design their own networks on Rackspace’s Cloud Servers. In an interview with Jack McCarthy at CRN, Engates explained the background:
“When we went from dedicated physical networks to our public cloud, we lost the ability to segment these networks. We used to have a vLAN. As we moved to OpenStack, we wanted to give our customers the ability to enable segmented networks in the cloud. Cloud Networks gives customers a degree of control over how they build networks in the cloud, whether it’s building networks application servers or for Web servers or databases.”
Engates also points out the networks are software-defined, “so customers can program their network on the fly.” You can read more about the new feature on the Rackspace blog.
Watson goes to med school, Google releases data science research
IBM announced this week that Watson will be heading to medical school at the Cleveland Clinic. Steve Lohr reports at the New York Times that Watson will be fed questions from the the United States Medical Licensing Exam, with students on hand to correct its answers and respond to any questions Watson may have. Dr. David Ferrucci, the principal I.B.M. scientist behind the Watson Project, told Lohr the idea isn’t to certify Watson as a doctor to replace humans, but to feed it enough data and keep it current on new research so it can serve as an effective physician’s assistant. Lohr reports:
“In medicine, [Dr. David Ferrucci] said, you have a problem with many variables. For example, a 69-year-old female with certain symptoms, vital signs, family history, medications taken, genetic makeup, diet and exercise regimens. Someday, [he] said, Watson should be able to collect and assess all that patient data, and then construct ‘inference paths’ toward a probable diagnosis — digesting information, missing nothing and winnowing choices for a human doctor.”
In loosely related news, Google released new research this week looking into the data science behind its speech recognition applications. Derrick Harris at GigaOm reports that, unsurprisingly, the research shows the more data you have, the better trained the speech recognition system will be — the larger the datasets and language models, the fewer the errors the system will experience when predicting next words based on previous words, for instance.
Harris also highlights another key finding from the research: the type of data matters. He reports that Google used 230 billion words from “a random sample of anonymized queries from google.com that did not trigger spelling correction” for the voice tests, but that “because people speak and write prose differently than they type searches, the YouTube models were fed data from transcriptions of news broadcasts and large web crawls.” The research also puts the increasing focus on big data and data science into context — “As consumers demand ever smarter applications and more frictionless user experiences,” Harris writes, “every last piece of data and every decision about how to analyze it matters.”
Really, really big data
Stacey Higginbotham at GigaOm this week reported on a presentation from Shantanu Gupta, director of Connected Intelligent Solutions at Intel, on new words that we’ll soon need to talk about volumes of data. We’ve got exabytes, then the larger zettabyte, and the still larger yottabyte, she says, but what then? Gupta has the answer: a brontobyte and a gegobyte. Higginbotham reports:
“A brontobyte, which isn’t an official SI prefix but is apparently recognized by some people in the measurement community, is a 10 followed by 27 zeros. Gupta uses it to describe the type of sensor data we’ll get from the Internet of things. A gegobyte is 10 to the power of 30.”
To put the volumes of data into perspective, Higginbotham highlights some statistics from Gupta’s presentation, including: Facebook databases take in 500 terabytes of new data daily, 72 hours of video are uploaded onto YouTube per minute (that’s a terabyte per four minutes), and Boeing jet engine sensors produce 20 terabytes of data per hour.
Tip us off
News tips and suggestions are always welcome, so please send them along.