ENTRIES TAGGED "Big Data"

Improving options for unlocking your graph data

Graph data is an area that has attracted many enthusiastic entrepreneurs and developers

The popular open source project GraphLab received a major boost early this week when a new company comprised of its founding developers, raised funding to develop analytic tools for graph data sets. GraphLab Inc. will continue to use the open source GraphLab to “push the limits of graph computation and develop new ideas”, but having a commercial company will accelerate development, and allow the hiring of resources dedicated to improving usability and documentation.

While social media placed graph data on the radar of many companies, similar data sets can be found in many domains including the life and health sciences, security, and financial services. Graph data is different enough that it necessitates special tools and techniques. Because tools were a bit too complex for casual users, in the past this meant graph data analytics was the province of specialists. Fortunately graph data is an area that has attracted many enthusiastic entrepreneurs and developers. The tools have improved and I expect things to get much easier for users in the future. A great place to learn more about tools for graph data, is at the upcoming GraphLab Workshop (on July 1st in SF).

Data wrangling: creating graphs
Before you can take advantage of the other tools mentioned in this post, you’ll need to turn your data (e.g., web pages) into graphs. GraphBuilder is an open source project from Intel, that uses Hadoop MapReduce1 to build graphs out of large data sets. Another option is the combination of GraphX/Spark described below. (A startup called Trifacta is building a general-purpose, data wrangling tool, that could help as well. )

Data management and search
Once you have a graph, there are many options for how to store it. The choice of database largely depends on amount of data (# of nodes, edges, along with the size of data associated with them), the types of tasks (pattern-matching and search, analytics), and workload. In the course of evaluating alternatives to MySQL (for storing social graph data), Facebook’s engineering team developed and released Linkbench – a data set that can be used to study how graph databases handle production workloads.

Most graph databases (such as Neo4j2, AllegroGraph, Yarcdata, and InfiniteGraph) come with tools for facilitating and speeding up search – Neo4j comes with a simple query language (Cipher) for search, other graph databases support SPARQL. The Titan distributed graph database supports different storage engines (including HBase and Cassandra) and comes with tools for search and traversal (based on Lucene and Gremlin). Used by Twitter to store graph data FlockDB, targets operations involving adjacency lists.

Among Hadoop users HBase is a popular option for storing graph data. Hadapt’s analytic platform3 integrates Apache Hadoop and SQL, and now also supports graph analysis.

Graph-parallel frameworks: Pregel, PowerGraph, and GraphX
BSP is a parallel computing model that has inspired many graph analytics tools. Just like Hadoop’s map and reduce, Pregel4, Giraph and Pregelix, come with primitives that let neighboring nodes send/receive messages to one another, or change the state of a node (based on the state of its neighboring nodes). Efficient graph algorithms are a sequence of iterations built from such primitives. GraphLab uses similar primitives (called PowerGraph) but allows for asynchronous iterative computations, leading to an expanded set of (potentially) faster algorithms.

GraphX is a new, fault-tolerant, framework that runs within Spark. Its core data structure is an immutable graph5 (Resilient Distributed Graph – or RDG), and GraphX programs are a sequence of transformations on RDG’s (with each transformation yielding a new RDG). Transformations on RDG’s can affect nodes, edges, or both (depending on the state of neighboring edges and nodes). GraphX greatly enhances productivity by simplifying a range of tasks (graph loading, construction, transformation, and computations). But it does so at the expense of performance: early prototype algorithms written in GraphX were slower6 than those written in GraphLab/PowerGraph.

Machine-learning and analytics
Machine-learning tools that target graph data lead to familiar applications such as detecting influential users (PageRank) and communities, fraud detection, and recommendations (collaborative filtering is popular among GraphLab users). Moreover techniques developed in one domain are often reused in other settings. Besides GraphLab, distributed analytics have been implemented in Giraph, GraphX, Faunus, and Grappa. In addition, graph databases like Neo4j and Yarcdata come with some analytic capabilities. As I noted in a recent post, open source, single-node systems like Twitter’s Cassovary7 are being used for computations involving massive graphs.

Visualization
When you’re dealing with large graphs, being able to zoom in/out helps with clutter, but so do clever layout algorithms. Popular tools for visualizing nodes and edges include Gephi and GraphViz. Users who want to customize their graphs turn to packages like d3.



(1) I would love to see a version of GraphBuilder that’s built on top of Spark.
(2) Many of these systems are quite efficient. For example a single instance of Neo4j can handle very large graphs (“into the tens of billions of nodes/ relationships/ properties”).
(3) Note that using standard Hadoop for graph processing may not be the most efficient option. This talk by Hadapt co-founder Daniel Abadi describes an advanced approach to graph analysis using Hadoop.
(4) Related frameworks include GoldenOrb and Hama.
(5) Resilient Distributed Graphs (RDG) extend Spark’s Resilient Distributed Dataset (RDD).
(6) As the developers of GraphX note: “We emphasize that it is not our intention to beat PowerGraph in performance. … We believe that the loss in performance may, in many cases, be ameliorated by the gains in productivity achieved by the GraphX system. … It is our belief that we can shorten the gap in the near future, while providing a highly usable interactive system for graph data mining and computation”
(7) On the plus side, being single-node means Cassovary doesn’t have to deal with finding the optimal way to partition a graph. On the other hand, it is limited to graphs that fit in the memory of a server – a limitation it alleviates through the use of efficient data structures.

O’Reilly Strata Conference — Strata brings together the leading minds in data science and big data — decision makers and practitioners driving the future of their businesses and technologies. Get the skills, tools, and strategies you need to make data work.

Strata Rx Health Data Conference: September 25-27 | Boston, MA
Strata + Hadoop World: October 28-30 | New York, NY
Strata in London: November 15-17 | London, England

Comments: 2 |

Genomics and Privacy at the Crossroads

Would you let people know about your dandruff problem if it might mean a cure for Lupus?

Two weeks ago, I had the privilege to attend the 2013 Genomes, Environments and Traits conference in Boston, as a participant of Harvard Medical School’s Personal Genome Project. Several hundreds of us attended the conference, eager to learn what new breakthroughs might be in the works using the data and samples we have contributed, and to network with the researchers and each other.

The Personal Genome Project (PGP) is a very different type of beast from the traditional research study model, in several ways. To begin with, it is a Open Consent study, which means that all the data that participants donate is available for research by anyone without further consent by the subject. In other words, having initially consented to participate in the PGP, anyone can download my genome sequence, look at my phenotypic traits (my physical characteristics and medical history), or even order some of my blood from a cell line that has been established at the Coriell biobank, and they do not need to gain specific consent from me to do so. By contrast, in most research studies, data and samples can only be collected for one specific study, and no other purposes. This is all in an effort to protect the privacy of the participants, as was famously violated in the establishment of the HeLa cell line.

The other big difference is that in most studies, the participants rarely receive any information back from the researchers. For example, if the researcher does a brain MRI to gather data about the structure of a part of your brain, and sees a huge tumor, they are under no obligation to inform you about it, or even to give you a copy of the scan. This is because researchers are not certified as clinical laboratories, and thus are not authorized to report medical findings. This makes sense, to a certain extent, with traditional medical tests, as the research version may not be calibrated to detect the same things, and the researcher is not qualified to interpret the results for medical purposes.

Read more…

Comment |

Tachyon: An open source, distributed, fault-tolerant, in-memory file system

Tachyon enables data sharing across frameworks and performs operations at memory speed

In earlier posts I’ve written about how Spark and Shark run much faster than Hadoop and Hive by1 caching data sets in-memory. But suppose one wants to share datasets across jobs/frameworks, while retaining speed gains garnered by being in-memory? An example would be performing computations using Spark, saving it, and accessing the saved results in Hadoop MapReduce. An in-memory storage system would speed up sharing across jobs by allowing users to save at near memory speeds. In particular the main challenge is being able to do memory-speed “writes” while maintaining fault-tolerance.

In-memory storage system from UC Berkeley’s AMPLab
The team behind the BDAS stack recently released a developer preview of Tachyon – an in-memory, distributed, file system. The current version of Tachyon was written in Java and supports Spark, Shark, and Hadoop MapReduce. Working data sets can be loaded into Tachyon where they can be accessed at memory speed, by many concurrent users. Tachyon implements the HDFS FileSystem interface for standard file operations (such as create, open, read, write, close, and delete).

Read more…

Comment |

Single server systems can tackle big data

Business Intelligence, machine-learning, and graph processing systems tackle large data sets with single servers.

About a year ago a blog post from SAP posited1 that when it comes to analytics, most companies are in the multi-terabyte range: data sizes that are well-within the scope of distributed in-memory solutions like Spark, SAP HANA, ScaleOut Software, GridGain, and Terracotta.

Read more…

Comments: 3 |

Untangling algorithmic illusions from reality in big data

Kate Crawford argues for caution and care in data-driven decision making.

Microsoft principal researcher Kate Crawford (@katecrawford) gave a strong talk at last week’s Strata Conference in Santa Clara, Calif. about the limits of big data. She pointed out potential biases in data collection, questioned who may be excluded from it, and hammered home the constant need for context in conclusions. Video of her talk is embedded below:

Crawford explored many of these same topics in our interview, which follows.

Read more…

Comment |

Data Science Tools: Fast, easy to use, and scalable

Tools slowly democratize many data science tasks

Here are a few observations based on conversations I had during the just concluded Strata Santa Clara conference.

Spark is attracting attention
I’ve written numerous times about components of the Berkeley Data Analytics Stack (Spark, Shark, MLbase). Two Spark-related sessions at Strata were packed (slides here and here) and I talked to many people who were itching to try the BDAS stack. Being able to combine batch, real-time, and interactive analytics in a framework that uses a simple programming model is very attractive. The release of version 0.7 adds a Python API to Spark’s native Scala interface and Java API.

Read more…

Comments: 2 |

On reading Mike Barlow’s “Real-Time Big Data Analytics: Emerging Architecture”

Barlow's distilled insights regarding the ever evolving definition of real time big data analytics

Reading Barlow on a Sunday Afternoon

Reading Barlow on a Sunday afternoon

During a break in between offsite meetings that Edd and I were attending the other day, he asked me, “did you read the Barlow piece?”

“Umm, no.” I replied sheepishly. Insert a sidelong glance from Edd that said much without saying anything aloud. He’s really good at that.

In my utterly meager defense, Mike Loukides is the editor on Mike Barlow’s Real-Time Big Data Analytics: Emerging Architecture. As Loukides is one of the core drivers behind O’Reilly’s book publishing program and someone who I perceive to be an unofficial boss of my own choosing, I am not really inclined to worry about things that I really don’t need to worry about. Then I started getting not-so-subtle inquiries from additional people asking if I would consider reviewing the manuscript for the Strata community site. This resulted in me emailing Loukides for a copy and sitting in a local cafe on a Sunday afternoon to read through the manuscript.

Read more…

Comment |

Big data is dead, long live big data: Thoughts heading to Strata

The biggest problems will almost always be those for which the size of the data is part of the problem.

A recent VentureBeat article argues that “Big Data” is dead. It’s been killed by marketers. That’s an understandable frustration (and a little ironic to read about it in that particular venue). As I said sarcastically the other day, “Put your Big Data in the Cloud with a Hadoop.”

You don’t have to read much industry news to get the sense that “big data” is sliding into the trough of Gartner’s hype curve. That’s natural. Regardless of the technology, the trough of the hype cycle is driven by by a familiar set of causes: it’s fed by over-agressive marketing, the longing for a silver bullet that doesn’t exist, and the desire to spout the newest buzzwords. All of these phenomena breed cynicism. Perhaps the most dangerous is the technologist who never understands the limitations of data, never understands what data isn’t telling you, or never understands that if you ask the wrong questions, you’ll certainly get the wrong answers.

Big data is not a term I’m particularly fond of. It’s just data, regardless of the size. But I do like Roger Magoulas’ definition of “big data”: big data is when the size of the data becomes part of the problem. I like that definition because it scales. It was meaningful in 1960, when “big data” was a couple of megabytes. It will be meaningful in 2030, when we all have petabyte laptops, or eyeglasses connected directly to Google’s yottabyte cloud. It’s not convenient for marketing, I admit; today’s “Big Data!!! With Hadoop And Other Essential Nutrients Added” is tomorrow’s “not so big data, small data actually.” Marketing, for better or for worse, will deal. Read more…

Comment |

Strata Week: The data divide is growing

Is data collection entering discriminatory territory? Also, big data's role in crime fighting and its debut in the NBA.

Data mining opens new doors for discrimination, marginalization

In a post at Scientific American, Michael Fertik took a look at how Internet data collection practices are beginning to create an unequal — even discriminatory — online environment. Fertik writes:

“For most of the Internet’s short history, the primary goal of this data collection was classic product marketing: for example, advertisers might want to show me Nikes and my wife Manolo Blahniks. But increasingly, data collection is leapfrogging well beyond strict advertising and enabling insurance, medical and other companies to benefit from analyzing your personal, highly detailed ‘Big Data’ record without your knowledge. Based on this analysis, these companies then make decisions about you — including whether you are even worth marketing to at all.”

The consequences of such detailed data mining run deep. Fertik notes that advances in online data mining are enabling companies to “skirt the spirit of the law” and make discriminatory choices in who receives credit or loan offers, for example, by simply not displaying online offers to less credit-attractive users. “If you live on the wrong side of the digital tracks,” he says, “you won’t even see a credit offer from leading lending institutions, and you won’t realize that loans are available to help you with your current personal or professional priorities.”

Read more…

Comment |

BigData Top 100 Initiative

A Call for Industry-Standard Benchmarks for Big Data Platforms at Strata SC 2013

By Milind Bhandarka, Chaitan Baru, Raghunath Nambiar, Meikel Poess, and Dr. Tilmann Rabl

Big data systems are characterized by their flexibility in processing diverse data genres, such as transaction logs, connection graphs, and natural language text, with algorithms characterized by multiple communication patterns, e.g. scatter-gather, broadcast, multicast, pipelines, and bulk-synchronous. A single benchmark that characterizes a single workload could not be representative of such a multitude of use-cases. However, our systematic study of several use-cases of current big data platforms indicates that most workloads are composed of a common set of stages, which capture the variety of data genres and algorithms commonly used to implement most data-intensive end-to-end workloads. Our upcoming session at Strata SC discusses the BigData Top 100 List, a new community-based initiative for benchmarking big data systems.

Read more…

Comment |