Strata Week: Behind LinkedIn Signal

Life-size visualizations, how Hadoop is used, SciDB has its first release

Professional social networking site LinkedIn yesterday announced a new service, Signal, that applies the filters of the LinkedIn network over status updates, such as those from Twitter. Signal lets you do things such as watch tweets from particular industries, companies or locales, or filter by your professional network. All in real time.

Screenshot of LinkedIn Signal

Overlaying the Twitter nation with LinkedIn’s map is a great idea, so what’s the technology behind Signal? Like fellow social networks Facebook and Twitter, LinkedIn has a smart big data and analytics team, who often leverage or create open source solutions.

LinkedIn engineer John Wang (@javasoze) gave some clues as to Signal’s infrastructure of “Zoie, Bobo, Sensei and Lucene”, and I thought it would be fascinating to examine the parts in more detail.

Signal uses a variety of open source technologies, some developed in-house at LinkedIn by their Search, Network and Analytics team.

  • Zoie (source code) is a real-time search and indexing system built on top of the Apache Lucene search platform. As documents are added to the index, they become immediately searchable.
  • Bobo is another extension to Apache Lucene. While Lucene is great for searching free text data, Bobo takes it a step further and provides faceted searching and browsing over data sets (source code)
  • Sensei (source code) is a distributed, scalable, database offering fast searching and indexing. It is particularly tuned to answer the kind of queries LinkedIn excels at: free text search, restricted over various axes in their social network. Sensei uses Bobo and Zoie, adding clustered, elastic database features.
  • Voldemort is an open source fault-tolerant distributed key-value store, similar to Amazon’s Dynamo.

LinkedIn also use the Scala and JRuby JVM programming languages, alongside Java.

If you’re interested in hearing more about LinkedIn Signal, check out the coverage on TechCrunch, Forbes.com, Mashable and The Daily Beast.

Bringing visualization back to the future

Speaking at this week’s Web 2.0 Expo in New York, Julia Grace of IBM encouraged attendees to raise their game with data visualization. As long ago as the 1980s movie directors envisioned exciting and dynamic data visualizations, but today most people are still sharing flat two-dimensional charts, which restrict the opportunities for understanding and telling stories with data. Julia decided to make some location-based data very real by projecting it onto a massive globe.

Julia’s talk is embedded below, and you can also read an extended interview with her published earlier this month on O’Reilly Radar.

Hadoop goes viral

Software vendor Karmasphere creates developer tools for data intelligence that work with Hadoop-based SMAQ big data systems. They recently commissioned a study into Hadoop usage. One of the most interesting results of the survey suggests that Hadoop systems tend to start as skunkworks projects inside organizations, and move rapidly into production.

Once used inside an organization, Hadoop appears to spread:

Additionally, organizations are finding that the longer Hadoop is used, the more useful it is found to be; 65% of organizations using Hadoop for a year or more indicated more than three reasons for using Hadoop, as compared to 36% for new users.

There are challenges too. Hadoop offers the benefits of affordable big data processing, but it has an immature ecosystem that is only just starting to emerge. Respondents to the Karmasphere survey indicated that pain points included a steep learning curve, hiring qualified people, tool availability and educational materials.

This is good news for vendors such as Karmasphere, Datameer and IBM, all of whom are concentrating on making Hadoop work in ways that are familiar to enterprises, through the medium of IDEs and spreadsheet interfaces.

SciDB source released

The SciDB database is an answer to the data and analytic needs of the scientific world; serving among others the needs of biology, physics, and astronomy. In the words of their website, a database “for the toughest problems on the planet.” SciDB Inc., the sponsors of the open source project, say that although science has become steadily more data intensive, scientists have had to use databases intended for commercial, rather than scientific, applications.

One of the most intriguing aspects of SciDB is that it emanates from the work of serial database innovator Michael Stonebraker. Scientific data is inherently multi-dimensional, Stonebraker told The Register earlier this month, and thus ill-suited for use with traditional relational databases.

The SciDB project has now made their source code available. The current release, R0.5, is an early stage product, for the “curious and intrepid”. It features a new array query language, known as AQL, an SQL-like language extended for the array data model of SciDB. The release will run on Linux systems, and is expected to be followed up at the end of the year by a more robust and stable version.

SciDB is available under the GPL3 free software license, and may be downloaded on application to the SciDB team. According to the authors, more customary use of open source repositories is likely to follow soon.

Send us news

Email us news, tips and interesting tidbits at strataweek@oreilly.com.

tags: , , , , , ,
  • http://twitter.com/MarketingCMI AFT

    Unfortunately this isn’t new information as I learned this back in the late 90′s in Geomatics class at Rutgers (crssa.rutgers.edu). Julia Grace should take a look at what ESRI (esri.com)has been up to all these years.
    There are very real reasons why data is displayed in the three types of graphs. Most people don’t know how to do anything else. How many people slept through stats class in college?
    How many business managers will pay for the time it takes to translate data into useable formats for 3D? How many pay for clean data for 2D formats?
    Everyone likes pretty data. It would have been nice to see her recommendations for actually implementing this in the workplace or at least listing the reasons why it hasn’t been implemented thus far.

  • http://nz.linkedin.com/in/drllau drllau

    See Visualisation Toolkit (VTK http://www.vtk.org). IBM dataExplorer is also pretty neat (http://www.research.ibm.com/dx/). And read any of Tufte’s books (http://en.wikipedia.org/wiki/Edward_Tufte. )But I was told that the bulk of time/effort is actually cleaning up the datasets rather then rendering them.

  • http://www.fax.com anonymous

    It makes total sense that she works for IBM Research, clearly a leader in this field.