ENTRIES TAGGED "Big Data"
Would you let people know about your dandruff problem if it might mean a cure for Lupus?
Two weeks ago, I had the privilege to attend the 2013 Genomes, Environments and Traits conference in Boston, as a participant of Harvard Medical School’s Personal Genome Project. Several hundreds of us attended the conference, eager to learn what new breakthroughs might be in the works using the data and samples we have contributed, and to network with the researchers and each other.
The Personal Genome Project (PGP) is a very different type of beast from the traditional research study model, in several ways. To begin with, it is a Open Consent study, which means that all the data that participants donate is available for research by anyone without further consent by the subject. In other words, having initially consented to participate in the PGP, anyone can download my genome sequence, look at my phenotypic traits (my physical characteristics and medical history), or even order some of my blood from a cell line that has been established at the Coriell biobank, and they do not need to gain specific consent from me to do so. By contrast, in most research studies, data and samples can only be collected for one specific study, and no other purposes. This is all in an effort to protect the privacy of the participants, as was famously violated in the establishment of the HeLa cell line.
The other big difference is that in most studies, the participants rarely receive any information back from the researchers. For example, if the researcher does a brain MRI to gather data about the structure of a part of your brain, and sees a huge tumor, they are under no obligation to inform you about it, or even to give you a copy of the scan. This is because researchers are not certified as clinical laboratories, and thus are not authorized to report medical findings. This makes sense, to a certain extent, with traditional medical tests, as the research version may not be calibrated to detect the same things, and the researcher is not qualified to interpret the results for medical purposes.
Tachyon enables data sharing across frameworks and performs operations at memory speed
In earlier posts I’ve written about how Spark and Shark run much faster than Hadoop and Hive by1 caching data sets in-memory. But suppose one wants to share datasets across jobs/frameworks, while retaining speed gains garnered by being in-memory? An example would be performing computations using Spark, saving it, and accessing the saved results in Hadoop MapReduce. An in-memory storage system would speed up sharing across jobs by allowing users to save at near memory speeds. In particular the main challenge is being able to do memory-speed “writes” while maintaining fault-tolerance.
In-memory storage system from UC Berkeley’s AMPLab
The team behind the BDAS stack recently released a developer preview of Tachyon – an in-memory, distributed, file system. The current version of Tachyon was written in Java and supports Spark, Shark, and Hadoop MapReduce. Working data sets can be loaded into Tachyon where they can be accessed at memory speed, by many concurrent users. Tachyon implements the HDFS FileSystem interface for standard file operations (such as create, open, read, write, close, and delete).
Business Intelligence, machine-learning, and graph processing systems tackle large data sets with single servers.
About a year ago a blog post from SAP posited1 that when it comes to analytics, most companies are in the multi-terabyte range: data sizes that are well-within the scope of distributed in-memory solutions like Spark, SAP HANA, ScaleOut Software, GridGain, and Terracotta.
Kate Crawford argues for caution and care in data-driven decision making.
Microsoft principal researcher Kate Crawford (@katecrawford) gave a strong talk at last week’s Strata Conference in Santa Clara, Calif. about the limits of big data. She pointed out potential biases in data collection, questioned who may be excluded from it, and hammered home the constant need for context in conclusions. Video of her talk is embedded below:
Crawford explored many of these same topics in our interview, which follows.
Tools slowly democratize many data science tasks
Here are a few observations based on conversations I had during the just concluded Strata Santa Clara conference.
Spark is attracting attention
I’ve written numerous times about components of the Berkeley Data Analytics Stack (Spark, Shark, MLbase). Two Spark-related sessions at Strata were packed (slides here and here) and I talked to many people who were itching to try the BDAS stack. Being able to combine batch, real-time, and interactive analytics in a framework that uses a simple programming model is very attractive. The release of version 0.7 adds a Python API to Spark’s native Scala interface and Java API.
Barlow's distilled insights regarding the ever evolving definition of real time big data analytics
During a break in between offsite meetings that Edd and I were attending the other day, he asked me, “did you read the Barlow piece?”
“Umm, no.” I replied sheepishly. Insert a sidelong glance from Edd that said much without saying anything aloud. He’s really good at that.
In my utterly meager defense, Mike Loukides is the editor on Mike Barlow’s Real-Time Big Data Analytics: Emerging Architecture. As Loukides is one of the core drivers behind O’Reilly’s book publishing program and someone who I perceive to be an unofficial boss of my own choosing, I am not really inclined to worry about things that I really don’t need to worry about. Then I started getting not-so-subtle inquiries from additional people asking if I would consider reviewing the manuscript for the Strata community site. This resulted in me emailing Loukides for a copy and sitting in a local cafe on a Sunday afternoon to read through the manuscript.
The biggest problems will almost always be those for which the size of the data is part of the problem.
A recent VentureBeat article argues that “Big Data” is dead. It’s been killed by marketers. That’s an understandable frustration (and a little ironic to read about it in that particular venue). As I said sarcastically the other day, “Put your Big Data in the Cloud with a Hadoop.”
You don’t have to read much industry news to get the sense that “big data” is sliding into the trough of Gartner’s hype curve. That’s natural. Regardless of the technology, the trough of the hype cycle is driven by by a familiar set of causes: it’s fed by over-agressive marketing, the longing for a silver bullet that doesn’t exist, and the desire to spout the newest buzzwords. All of these phenomena breed cynicism. Perhaps the most dangerous is the technologist who never understands the limitations of data, never understands what data isn’t telling you, or never understands that if you ask the wrong questions, you’ll certainly get the wrong answers.
Big data is not a term I’m particularly fond of. It’s just data, regardless of the size. But I do like Roger Magoulas’ definition of “big data”: big data is when the size of the data becomes part of the problem. I like that definition because it scales. It was meaningful in 1960, when “big data” was a couple of megabytes. It will be meaningful in 2030, when we all have petabyte laptops, or eyeglasses connected directly to Google’s yottabyte cloud. It’s not convenient for marketing, I admit; today’s “Big Data!!! With Hadoop And Other Essential Nutrients Added” is tomorrow’s “not so big data, small data actually.” Marketing, for better or for worse, will deal. Read more…
Is data collection entering discriminatory territory? Also, big data's role in crime fighting and its debut in the NBA.
Data mining opens new doors for discrimination, marginalization
“For most of the Internet’s short history, the primary goal of this data collection was classic product marketing: for example, advertisers might want to show me Nikes and my wife Manolo Blahniks. But increasingly, data collection is leapfrogging well beyond strict advertising and enabling insurance, medical and other companies to benefit from analyzing your personal, highly detailed ‘Big Data’ record without your knowledge. Based on this analysis, these companies then make decisions about you — including whether you are even worth marketing to at all.”
The consequences of such detailed data mining run deep. Fertik notes that advances in online data mining are enabling companies to “skirt the spirit of the law” and make discriminatory choices in who receives credit or loan offers, for example, by simply not displaying online offers to less credit-attractive users. “If you live on the wrong side of the digital tracks,” he says, “you won’t even see a credit offer from leading lending institutions, and you won’t realize that loans are available to help you with your current personal or professional priorities.”
A Call for Industry-Standard Benchmarks for Big Data Platforms at Strata SC 2013
Big data systems are characterized by their flexibility in processing diverse data genres, such as transaction logs, connection graphs, and natural language text, with algorithms characterized by multiple communication patterns, e.g. scatter-gather, broadcast, multicast, pipelines, and bulk-synchronous. A single benchmark that characterizes a single workload could not be representative of such a multitude of use-cases. However, our systematic study of several use-cases of current big data platforms indicates that most workloads are composed of a common set of stages, which capture the variety of data genres and algorithms commonly used to implement most data-intensive end-to-end workloads. Our upcoming session at Strata SC discusses the BigData Top 100 List, a new community-based initiative for benchmarking big data systems.