Extending GraphLab to tables

The popular graph analytics framework extends its coverage of the data science workflow

GraphLab’s SFrame, an interesting and somewhat under-the-radar tool was unveiled1 at Strata Santa Clara. It is a disk-based, flat table representation that extends GraphLab to tabular data. With the addition of SFrame, users can leverage GraphLab’s many algorithms on data stored as either graphs or tables. More importantly SFrame increases GraphLab’s coverage of the data science workflow: it allows users with terabyte-sized datasets to clean their data and create new features directly within GraphLab (SFrame performance can scale linearly with the number of available cores).

The beta version of SFrame can read data from local disk, HDFS, S3 or a URL, and save to a human-readable .csv or a more efficient native format. Once an SFrame is created and saved to disk no reprocessing of the data is needed. Below is Python code that illustrates how to read a .csv file into SFrame, create a new data feature and save it to disk on S3:

Read more…

Comment |

Big Data solutions through the combination of tools

Applications get easier to build as packaged combinations of open source tools become available

As a user who tends to mix-and-match many different tools, not having to deal with configuring and assembling a suite of tools is a big win. So I’m really liking the recent trend towards more integrated and packaged solutions. A recent example is the relaunch of Cloudera’s Enterprise Data hub, to include Spark1 and Spark Streaming. Users benefit by gaining automatic access to analytic engines that come with Spark2. Besides simplifying things for data scientists and data engineers, easy access to analytic engines is critical for streamlining the creation of big data applications.

Another recent example is Dendrite3 – an interesting new graph analysis solution from Lab41. It combines Titan (a distributed graph database), GraphLab (for graph analytics), and a front-end that leverages AngularJS, into a Graph exploration and analysis tool for business analysts:

Smiley face

Read more…

Comment |

Semi-automatic method for grading a million homework assignments

Organize solutions into clusters and “force multiply” feedback provided by instructors

One of the hardest things about teaching a large class is grading exams and homework assignments. In my teaching days a “large class” was only in the few hundreds (still a challenge for the TAs and instructor). But in the age of MOOCs, classes with a few (hundred) thousand students aren’t unusual.

Researchers at Stanford recently combed through over one million homework submissions from a large MOOC class offered in 2011. Students in the machine-learning course submitted programming code for assignments that consisted of several small programs (the typical submission was about 16 lines of code). While over 120,000 enrolled only about 10,000 students completed all homework assignments (about 25,000 submitted at least one assignment).

The researchers were interested in figuring out ways to ease the burden of grading the large volume of homework submissions. The premise was that by sufficiently organizing the “space of possible solutions”, instructors would provide feedback to a few submissions, and their feedback could then be propagated to the rest.

Read more…

Comment |

Improving options for unlocking your graph data

Graph data is an area that has attracted many enthusiastic entrepreneurs and developers

The popular open source project GraphLab received a major boost early this week when a new company comprised of its founding developers, raised funding to develop analytic tools for graph data sets. GraphLab Inc. will continue to use the open source GraphLab to “push the limits of graph computation and develop new ideas”, but having a commercial company will accelerate development, and allow the hiring of resources dedicated to improving usability and documentation.

While social media placed graph data on the radar of many companies, similar data sets can be found in many domains including the life and health sciences, security, and financial services. Graph data is different enough that it necessitates special tools and techniques. Because tools were a bit too complex for casual users, in the past this meant graph data analytics was the province of specialists. Fortunately graph data is an area that has attracted many enthusiastic entrepreneurs and developers. The tools have improved and I expect things to get much easier for users in the future. A great place to learn more about tools for graph data, is at the upcoming GraphLab Workshop (on July 1st in SF).

Data wrangling: creating graphs
Before you can take advantage of the other tools mentioned in this post, you’ll need to turn your data (e.g., web pages) into graphs. GraphBuilder is an open source project from Intel, that uses Hadoop MapReduce1 to build graphs out of large data sets. Another option is the combination of GraphX/Spark described below. (A startup called Trifacta is building a general-purpose, data wrangling tool, that could help as well. )

Read more…

Comments: 3 |

GraphChi: Graph analytics over billions of edges using your laptop

A disk-based, single-node, graph analytics system that scales to massive graphs

GraphChi is a spinoff project of GraphLab, an open source, distributed, in-memory software system for analytics and machine-learning.

Designed specifically to run on a single computer with limited memory1 (DRAM), since its release a few months ago GraphChi has been used to analyze graphs with billions of edges. Running on a single machine means deployment and debugging are simpler. In addition it is no longer necessary to find (optimal) graph partitions that minimize communication between compute nodes – the starting point for many distributed graph computations.

The stated goal of GraphChi is to “Compute on graphs with billions of edges, in a reasonable time, on a single PC.” One way to define “reasonable amount of computation time” is to compare against the results produced by other graph processing systems. That’s exactly what GraphChi’s creators did in a recent paper. They found that GraphChi compared favorably to graph analytics packages such as Pegasus and Stanford GPS. While GraphChi was 2-3X slower2 in some cases, it is easier to deploy, easier to debug, and way more energy efficient. Read more…

Comment: 1 |
Strata Gems: Make beautiful graphs of your Twitter network

Strata Gems: Make beautiful graphs of your Twitter network

Use Gephi and Python to find your personal communities

Using a bit of Python and the Gephi graph tool, exploring your own Twitter network is a great way to learn about analyzing networks: and the results definitely have a "wow" factor.

Comments: 7 |

Strata Gems: Explore and visualize graphs with Gephi

Powerful open source graph manipulation

A Photoshop for data, Gephi is a powerful tool for exploring and presenting data as a graph. It's easy to get started with sample data sets, then import your own by generating files in a standard graph format.

Comment: 1 |