Improving options for unlocking your graph data
Graph data is an area that has attracted many enthusiastic entrepreneurs and developers
The popular open source project GraphLab received a major boost early this week when a new company comprised of its founding developers, raised funding to develop analytic tools for graph data sets. GraphLab Inc. will continue to use the open source GraphLab to “push the limits of graph computation and develop new ideas”, but having a commercial company will accelerate development, and allow the hiring of resources dedicated to improving usability and documentation.
While social media placed graph data on the radar of many companies, similar data sets can be found in many domains including the life and health sciences, security, and financial services. Graph data is different enough that it necessitates special tools and techniques. Because tools were a bit too complex for casual users, in the past this meant graph data analytics was the province of specialists. Fortunately graph data is an area that has attracted many enthusiastic entrepreneurs and developers. The tools have improved and I expect things to get much easier for users in the future. A great place to learn more about tools for graph data, is at the upcoming GraphLab Workshop (on July 1st in SF).
Data wrangling: creating graphs
Before you can take advantage of the other tools mentioned in this post, you’ll need to turn your data (e.g., web pages) into graphs. GraphBuilder is an open source project from Intel, that uses Hadoop MapReduce1 to build graphs out of large data sets. Another option is the combination of GraphX/Spark described below. (A startup called Trifacta is building a general-purpose, data wrangling tool, that could help as well. )
Data management and search
Once you have a graph, there are many options for how to store it. The choice of database largely depends on amount of data (# of nodes, edges, along with the size of data associated with them), the types of tasks (pattern-matching and search, analytics), and workload. In the course of evaluating alternatives to MySQL (for storing social graph data), Facebook’s engineering team developed and released Linkbench – a data set that can be used to study how graph databases handle production workloads.
Most graph databases (such as Neo4j2, AllegroGraph, Yarcdata, and InfiniteGraph) come with tools for facilitating and speeding up search – Neo4j comes with a simple query language (Cipher) for search, other graph databases support SPARQL. The Titan distributed graph database supports different storage engines (including HBase and Cassandra) and comes with tools for search and traversal (based on Lucene and Gremlin). Used by Twitter to store graph data, FlockDB targets operations involving adjacency lists.
Among Hadoop users HBase is a popular option for storing graph data. Hadapt’s analytic platform3 integrates Apache Hadoop and SQL, and now also supports graph analysis.
Graph-parallel frameworks: Pregel, PowerGraph, and GraphX
BSP is a parallel computing model that has inspired many graph analytics tools. Just like Hadoop’s map and reduce, Pregel4, Giraph and Pregelix, come with primitives that let neighboring nodes send/receive messages to one another, or change the state of a node (based on the state of its neighboring nodes). Efficient graph algorithms are a sequence of iterations built from such primitives. GraphLab uses similar primitives (called PowerGraph) but allows for asynchronous iterative computations, leading to an expanded set of (potentially) faster algorithms.
GraphX is a new, fault-tolerant, framework that runs within Spark. Its core data structure is an immutable graph5 (Resilient Distributed Graph – or RDG), and GraphX programs are a sequence of transformations on RDG’s (with each transformation yielding a new RDG). Transformations on RDG’s can affect nodes, edges, or both (depending on the state of neighboring edges and nodes). GraphX greatly enhances productivity by simplifying a range of tasks (graph loading, construction, transformation, and computations). But it does so at the expense of performance: early prototype algorithms written in GraphX were slower6 than those written in GraphLab/PowerGraph.
Machine-learning and analytics
Machine-learning tools that target graph data lead to familiar applications such as detecting influential users (PageRank) and communities, fraud detection, and recommendations (collaborative filtering is popular among GraphLab users). Moreover techniques developed in one domain are often reused in other settings. Besides GraphLab, distributed analytics have been implemented in Giraph, GraphX, Faunus, and Grappa. In addition, graph databases like Neo4j and Yarcdata come with some analytic capabilities. As I noted in a recent post, open source, single-node systems like Twitter’s Cassovary7 are being used for computations involving massive graphs.
Visualization
When you’re dealing with large graphs, being able to zoom in/out helps with clutter, but so do clever layout algorithms. Popular tools for visualizing nodes and edges include Gephi and GraphViz. Users who want to customize their graphs turn to packages like d3.
(1) I would love to see a version of GraphBuilder that’s built on top of Spark.
(2) Many of these systems are quite efficient. For example a single instance of Neo4j can handle very large graphs (“into the tens of billions of nodes/ relationships/ properties”).
(3) Note that using standard Hadoop for graph processing may not be the most efficient option. This talk by Hadapt co-founder Daniel Abadi describes an advanced approach to graph analysis using Hadoop.
(4) Related frameworks include GoldenOrb and Hama.
(5) Resilient Distributed Graphs (RDG) extend Spark’s Resilient Distributed Dataset (RDD).
(6) As the developers of GraphX note: “We emphasize that it is not our intention to beat PowerGraph in performance. … We believe that the loss in performance may, in many cases, be ameliorated by the gains in productivity achieved by the GraphX system. … It is our belief that we can shorten the gap in the near future, while providing a highly usable interactive system for graph data mining and computation”
(7) On the plus side, being single-node means Cassovary doesn’t have to deal with finding the optimal way to partition a graph. On the other hand, it is limited to graphs that fit in the memory of a server – a limitation it alleviates through the use of efficient data structures.
O’Reilly Strata Conference — Strata brings together the leading minds in data science and big data — decision makers and practitioners driving the future of their businesses and technologies. Get the skills, tools, and strategies you need to make data work.
Strata Rx Health Data Conference: September 25-27 | Boston, MA
Strata + Hadoop World: October 28-30 | New York, NY
Strata in London: November 15-17 | London, England
Strata Week: Are customized Google maps a neutrality win or the next “filter bubble”?
Two views on new Google Maps; a look at predictive, intelligent apps; and Aaron Swartz's and Kevin Poulsen's anonymous inbox launches.
Google aims for a new level of map customization
Google introduced a new version of Google maps at Google I/O this week that learns from each use to customize itself to individual users, adapting based on user clicks and searches. A post on the Google blog outlines the updates, which include recommendations for places you might enjoy (based upon your map activity), ratings and reviews, integrated Google Earth, and tours generated from user photos, to name a few.
On becoming a code artist
An interview with Scott Murray, author of Interactive Data Visualization for the Web
Scott Murray, a code artist, has written Interactive Data Visualization for the Web for nonprogrammers. In this interview, Scott provides some insights on what inspired him to write an introduction to D3 for artists, graphic designers, journalists, researchers, or anyone that is looking to begin programming data visualizations.
What inspired you to become a code artist?
Scott Murray: I had designed websites for a long time, but several years ago was frustrated by web browsers’ limitations. I went back to school for an MFA to force myself to explore interactive options beyond the browser. At MassArt, I was introduced to Processing, the free programming environment for artists. It opened up a whole new world of programmatic means of manipulating and interacting with data — and not just traditional data sets, but also live “data” such as from input devices or dynamic APIs, which can then be used to manipulate the output. Processing let me start prototyping ideas immediately; it is so enjoyable to be able to build something that really works, rather than designing static mockups first, and then hopefully, one day, invest the time to program it. Something about that shift in process is both empowering and liberating — being able to express your ideas quickly in code, and watch the system carry out your instructions, ultimately creating images and experiences that are beyond what you had originally envisioned.
Visualization of the Week: Real-time Wikipedia edits
The Wikipedia Recent Changes Map visualizes Wikipedia edits around the world in real-time.
Stephen LaPorte and Mahmoud Hashemi have put together an addictive visualization of real-time edits on Wikipedia, mapped across the world. Every time an edit is made, the user’s location and the entry they edited are listed along with a corresponding dot on the map.
Read more…
Big data, cool kids
Making sense of the hype-cycle scuffle.
The big data world is a confusing place. We’re no longer in a market dominated mostly by relational databases, and the alternatives have multiplied in a baby boom of diversity.
These child prodigies of the data scene show great promise but spend a lot of time knocking each other around in the schoolyard. Their egos can sometimes be too big to accept that everybody has their place, and eyeball-seeking media certainly doesn’t help.
POPULAR KID: Look at me! Big data is the hotness!
HADOOP: My data’s bigger than yours!
SCIPY: Size isn’t everything, Hadoop! The bigger they come, the harder they fall. And aren’t you named after a toy elephant?
R: Backward sentences mine be, but great power contains large brain.
EVERYONE: Huh?
SQL: Oh, so you all want to be friends again now, eh?!
POPULAR KID: Yeah, what SQL said! Nobody really needs big data; it’s all about small data, dummy.
Steering the ship that is data science
Ideas on avoiding the data science equivalent of "repair-ware."
Mike Loukides recently recapped a conversation we’d had about leading indicators for data science efforts in an organization. We also pondered where the role of data scientist is headed and realized we could treat software development as a prototype case.
It’s easy (if not eerie) to draw parallels between the Internet boom of the mid 1990s and the Big Data boom of the present day: in addition to the exuberance in the press and the new business models, a particular breed of technical skill became a competitive advantage and a household name. Back then, this was the software developer. Today, it’s the data scientist.
The time in the sun improved software development in some ways, but it also brought its share of problems. Some companies were short on the skill and discipline required to manage custom software projects, and they were equally ill-equipped to discern the true technical talent from the pretenders. That combination led to low-quality software projects that simply failed to deliver business value. (A number of these survive today as “repair-ware” that requires constant, expensive upkeep.)
Evaluating machine learning systems: Kaggle’s not enough
We should raise our collective expectations of what they should provide
There is a tremendous amount of commercial attention on machine learning (ML) methods and applications. This includes product and content recommender systems, predictive models for churn and lead scoring, systems to assist in medical diagnosis, social network sentiment analysis, and on and on. ML often carries the burden of extracting value from big data.
But getting good results from machine learning still requires much art, persistence, and even luck. An engineer can’t yet treat ML as just another well-bahaved part of the technology stack. There are many underlying reasons for this, but for the moment I want to focus on how we measure or evaluate ML systems.
Reflecting their academic roots, machine learning methods have traditionally been evaluated in terms of narrow quantitative metrics: precision, recall, RMS error, and so on. The data-science-as-competitive-sport site Kaggle has adopted these metrics for many of its competitions. They are objective and reassuringly concrete.
11 Essential Features that Visual Analysis Tools Should Have
Visual analysis tools are adding advanced analytics for big data
After recently playing with SAS Visual Analytics, I’ve been thinking about tools for visual analysis. By visual analysis I mean the type of analysis most recently popularized by Tableau, QlikView, and Spotfire: you encounter a data set for the first time, conduct exploratory data analysis, with the goal of discovering interesting patterns and associations. Having used a few visualization tools myself, here’s a quick wish-list of features (culled from tools I’ve used or have seen in action).
Requires little (to no) coding
The viz tools I currently use require programming skills. Coding means switching back-and-forth between a visual (chart) and text (code). It’s nice1 to be able to customize charts via code, but when you’re in the exploratory phase not having to think about code syntax is ideal. Plus GUI-based tools allow you to collaborate with many more users.
Strata Week: President Obama opens up U.S. government data
U.S. opens data, Wong tapped for U.S. chief privacy officer, FBI might read your email sans warrant, and big data spells trouble for anonymity.
U.S. government data to be machine-readable, Nicole Wong may fill new White House chief privacy officer role
The U.S. government took major steps this week to open up government data to the public. U.S. President Obama signed an executive order requiring government data to be made available in machine-readable formats, and the Office of Management and Budget and the Office of Science and Technology Policy released a Open Data Policy memo (PDF) to address the order’s implementation.
The press release announcing the actions notes the benefit the U.S. economy historically has experienced with the release of government data — GPS data, for instance, sparked a flurry of innovation that ultimately contributed “tens of billions of dollars in annual value to the American economy,” according to the release. President Obama noted in a statement that he hopes a similar result will come from this open data order: “Starting today, we’re making even more government data available online, which will help launch even more new startups. And we’re making it easier for people to find the data and use it, so that entrepreneurs can build products and services we haven’t even imagined yet.”
FCW’s Adam Mazmanian notes a bit from the Open Data Policy memo that indicates the open data framework doesn’t only apply to data the government intends to make public. Read more…
Genomics and Privacy at the Crossroads
Would you let people know about your dandruff problem if it might mean a cure for Lupus?
Two weeks ago, I had the privilege to attend the 2013 Genomes, Environments and Traits conference in Boston, as a participant of Harvard Medical School’s Personal Genome Project. Several hundreds of us attended the conference, eager to learn what new breakthroughs might be in the works using the data and samples we have contributed, and to network with the researchers and each other.
The Personal Genome Project (PGP) is a very different type of beast from the traditional research study model, in several ways. To begin with, it is a Open Consent study, which means that all the data that participants donate is available for research by anyone without further consent by the subject. In other words, having initially consented to participate in the PGP, anyone can download my genome sequence, look at my phenotypic traits (my physical characteristics and medical history), or even order some of my blood from a cell line that has been established at the Coriell biobank, and they do not need to gain specific consent from me to do so. By contrast, in most research studies, data and samples can only be collected for one specific study, and no other purposes. This is all in an effort to protect the privacy of the participants, as was famously violated in the establishment of the HeLa cell line.
The other big difference is that in most studies, the participants rarely receive any information back from the researchers. For example, if the researcher does a brain MRI to gather data about the structure of a part of your brain, and sees a huge tumor, they are under no obligation to inform you about it, or even to give you a copy of the scan. This is because researchers are not certified as clinical laboratories, and thus are not authorized to report medical findings. This makes sense, to a certain extent, with traditional medical tests, as the research version may not be calibrated to detect the same things, and the researcher is not qualified to interpret the results for medical purposes.





