Curiosity turned loose on GitHub data

Ilya Grigorik's GitHub project shows what happens when questions, data, and tools converge.

GitHub Archive logoI’m fascinated by people who:

1. Ask the question, “I wonder what happens if I do this?” and then follow it all the way through.

2. Start a project on a whim and open it up so anyone can participate.

Ilya Grigorik (@igrigorik) did both of these things, which is why our recent conversation at Strata Conference + Hadoop World was one of my favorite parts of the event.

By day, Grigorik is a developer advocate on Google’s Make the Web Fast team (he’s a perfect candidate for a future Velocity interview). On the side, he likes to track open source projects on GitHub. As he explained during our chat, this can be a time-intensive hobby:

“I follow about 3,000 open source projects, and I try to keep up with what’s going on, what are people contributing to, what are the new interesting sub-branches of work being done … The problem I ran into about six months ago was that, frankly, it was just too much to keep up with. The GitHub timeline was actually overflowing. In order to keep up, I would have to go in every four hours and scan through everything, and then repeat it. That doesn’t give you much time for sleep.” [Discussed 15 seconds into the interview.]

Grigorik built a system — including a newsletter— that lets him stay in the loop efficiently. He worked with GitHub to archive public GitHub activity, and he then made that data available in raw form and through Google BigQuery (the data is updated hourly).

This is a fun project, no doubt, but it’s also a big deal. Here’s why: When you shorten the distance between questions and answers, you empower people to ask more questions. It’s the liberation of curiosity, and that’s exactly what happened here.

“I solved my problem,” Grigorik said, “but what’s most exciting about this project is because we made that data publicly available to others, it’s the questions that other people started to ask. It’s the kind of stuff I would never think of.” [3:30]

Grigorik pointed to a number of projects that popped up once the dataset was available. Some are tongue in cheek, but others have deeper goals. Academic researchers are tapping the data to identify the DNA of a successful open source project, and another effort is looking for programming languages that inspire joy, frustration or surprise. “If your language has a C — C, C++, C#, Objective-C — it’s anger inducing,” Grigorik laughed. “Which languages generate surprise? Perl — which shouldn’t be surprising to a lot of people who have worked with Perl.” [4:03]

I’ve harped on this topic before, but Grigorik’s work reinforces my belief that access to datasets and useful tools creates vital streams of knowledge. Each stream wanders into different spaces. Many of those spaces will be frivolous or dead ends, but others will yield new insights — insights that wouldn’t have been possible without free-form exploration. Combine all this with open source distribution and the Internet’s dead-simple sharing tools, and you can see how projects like Grigorik’s can open up tremendous learning opportunities. And all the data making these explorations possible is just out there, ready to be mined, explored, combined and re-shared.


You can see the full interview with Grigorik in this video:

More interviews and keynotes from Strata Conference + Hadoop World are available here.

Strata Conference Santa Clara — Strata Conference Santa Clara, being held Feb. 26-28, 2013 in California, gives you the skills, tools, and technologies you need to make data work today.

Save 20% on registration with the code STRATA20

Related:

tags: , , , , , , , , , , ,
  • http://igvita.com/ Ilya Grigorik

    Thanks for the great writeup Mac! For anyone that’s curious, here are the slides from our Strata presentation: http://www.igvita.com/slides/2012/bigquery-github-strata.pdf

    Would love to see more people playing with the data! If you come across any fun results, definitely let me know.