ENTRIES TAGGED "bigquery"
The inaugural Spark Summit will feature a wide variety of real-world applications
When an interesting piece of big data technology gets introduced, early1 adopters tend to focus on technical features and capabilities. Applications get built as companies develop confidence that it’s reliable and that it really scales to large data volumes. That seems to be where Spark is today. With over 90 contributors from 25 companies, it has one of the largest developer communities among big data projects (second only to Hadoop MapReduce).
I recently became an advisor to Databricks (a startup commercializing Spark) and a member of the program committee for the inaugural Spark Summit. As I pored over submissions to Spark’s first community gathering, I learned how companies have come to rely on Spark, Shark, and other components of the Berkeley Data Analytics Stack (BDAS). Spark is at that stage where companies are deploying it, and the upcoming Spark Summit in San Francisco will showcase many real-world applications. These applications cut across many domains including advertising, marketing, finance, and academic/scientific research, but can generally be grouped into the following categories:
Data processing workflows: ETL and Data Wrangling
Many companies rely on a wide variety of data sources for their analytic products. That means cleaning, transforming, and fusing (unstructured) external data with internal data sources. Many companies – particularly startups – use Spark for these types of data processing workflows. There are even companies that have created simple user interfaces that open up batch data processing tasks to non-programmers.
Data stores are rolling out easy-to-use analysis tools
Originated by the NSA, Apache Accumulo is a BigTable inspired data store known for being highly scalable and for its interesting security model. Federal agencies and Defense contractors have deployed Accumulo on clusters of a thousand or more servers. It also uses “cell-level” security to control access to values stored in individual cells1.
What Accumulo was lacking were easy-to-use, standard analytic engines that allow users to interact with data. The release of Sqrrl Enterprise this past week fills that gap. Sqrrl Enterprise provides an initial set of analytic engines for the Accumulo ecosystem2. It includes support for interactive SQL, fulltext search, and queries over graph data. Each of these engines takes into account security labels placed on data: since every data object ingested into Sqrrl has a security label, (query & analytic) results incorporate those access levels. Analysts interact with data as they normally would. For example Sqrrl’s indexing technology accounts for security labels, and search queries are written in standard Lucene syntax. Reminiscent of the Phoenix project for HBase3, SQL queries4 in Sqrrl are converted into optimized Accumulo iterators.
Ilya Grigorik's GitHub project shows what happens when questions, data, and tools converge.
1. Ask the question, “I wonder what happens if I do this?” and then follow it all the way through.
2. Start a project on a whim and open it up so anyone can participate.
By day, Grigorik is a developer advocate on Google’s Make the Web Fast team (he’s a perfect candidate for a future Velocity interview). On the side, he likes to track open source projects on GitHub. As he explained during our chat, this can be a time-intensive hobby:
“I follow about 3,000 open source projects, and I try to keep up with what’s going on, what are people contributing to, what are the new interesting sub-branches of work being done … The problem I ran into about six months ago was that, frankly, it was just too much to keep up with. The GitHub timeline was actually overflowing. In order to keep up, I would have to go in every four hours and scan through everything, and then repeat it. That doesn’t give you much time for sleep.” [Discussed 15 seconds into the interview.]
Grigorik built a system — including a newsletter— that lets him stay in the loop efficiently. He worked with GitHub to archive public GitHub activity, and he then made that data available in raw form and through Google BigQuery (the data is updated hourly).
This is a fun project, no doubt, but it’s also a big deal. Here’s why: When you shorten the distance between questions and answers, you empower people to ask more questions. It’s the liberation of curiosity, and that’s exactly what happened here. Read more…
BigQuery for all, a new resource for data journalists, open data is challenged.
In this week's data news, Google's BigQuery opens up to everyone, the Data Journalism Handbook is released, and the open data movement is called to the mat.