<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Strata</title>
	<atom:link href="http://strata.oreilly.com/feed" rel="self" type="application/rss+xml" />
	<link>http://strata.oreilly.com</link>
	<description>Making Data Work</description>
	<lastBuildDate>Tue, 21 May 2013 19:54:11 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.5.1</generator>
		<item>
		<title>Looking ahead to a world of data-dominated decisions</title>
		<link>http://strata.oreilly.com/2013/05/looking-ahead-to-a-world-of-data-dominated-decisions.html</link>
		<comments>http://strata.oreilly.com/2013/05/looking-ahead-to-a-world-of-data-dominated-decisions.html#comments</comments>
		<pubDate>Tue, 21 May 2013 19:54:11 +0000</pubDate>
		<dc:creator>Andy Oram</dc:creator>
				<category><![CDATA[Data]]></category>
		<category><![CDATA[Big Data]]></category>
		<category><![CDATA[Cukier]]></category>
		<category><![CDATA[data]]></category>
		<category><![CDATA[Mayer-Schoenberger]]></category>

		<guid isPermaLink="false">http://strata.oreilly.com/?p=57528</guid>
		<description><![CDATA[Measuring a world-shaking trend with feet planted in every area of human endeavor cannot be achieved in a popular book of 200 pages, but one has to start somewhere. I am happy to recommend the adept efforts of Viktor Mayer-Schönberger &#8230; ]]></description>
				<content:encoded><![CDATA[<p>Measuring a world-shaking trend with feet planted in every area of human endeavor cannot be achieved in a popular book of 200 pages, but one has to start somewhere. I am happy to recommend the adept efforts of Viktor Mayer-Schönberger and Kenneth Cukier as a starting point. Their recent book <em>Big Data: A Revolution That Will Transform How We Live, Work, and Think</em> (recently featured in a video interview on the <a href="http://strata.oreilly.com/2013/03/data-brokers-sensors-search-data-big-brother.html#more-55603">O&#8217;Reilly Strata site</a>) does not quite unravel the mystery of the zeal for recording and measurement that is taking over governments and business, but it does what a good popularization should: alert us to what&#8217;s happening, provide some frameworks for talking about it, and provide a launchpad for us to debate the movement&#8217;s good and evil.</p>
<p>Because readers of this blog have been grappling with these concerns for some time. I&#8217;ll provide the barest summary of topics covered in Mayer-Schönberger and Cukier&#8217;s extensive overview, then provide some complementary ideas of my own.<br />
<span id="more-57528"></span></p>
<h3>Summary of book topics</h3>
<p>Some of the themes of <em>Big Data</em> that grabbed my interest include:</p>
<ul>
<li>New tools for measuring the world and people&#8217;s activities provide data sets that are many orders of magnitude higher than we are used to having, and computers tied together in clusters are run novel techniques to find insights never available to us before.</li>
<li>Data crunchers are finding correlations that provide useful guidance for actions. Mere correlation cannot tell us <em>why</em> something is happening, but often it doesn&#8217;t matter. The authors cite numerous examples where correlations by themselves suggested valuable actions.</li>
<li>Because big data opens up efficiencies to those savvy enough to use it, the future of business belongs to huge organizations (including middlemen and aggregators) with the resources to collect both data and experts to manipulate it, or to smaller organizations who are nimble enough to make hay from open or cheaply available data sets.</li>
<li>Control over our own lives may slip more and more from our hands as institutions use statistical insights to determine not only what we are doing, but what we are likely to do in the future. One chapter in <em>Big Data</em> is devoted to policy-related remedies.</li>
<li>Old-timers&#8217; intuitions are challenged by the findings of big data, just as Deep Blue&#8217;s brute-force processing of chess moves can overcome the world&#8217;s best human chess masters. Nevertheless, the authors end affirming the importance of human insight and choice.</li>
</ul>
<p>These represents a grand agenda for one book (nor have I exhausted all its topics), but I&#8217;d like to jump ahead to ideas that the <em>Big Data</em> stimulated for me, leaving it up to readers to get the book for themselves if they want to study all its conclusions. In the interest of full disclosure, I&#8217;ll mention that one author&#8211; Cukier, the data editor of The Economist&#8211;helped me get an article published several years ago.</p>
<h3>Other aspects of big data</h3>
<p>Mayer-Schönberger and Cukier&#8217;s view of traditional statistical techniques deserves a bit of examination. They tend to place these in opposition to newer techniques of crunching big data. According to their thesis, the old techniques were developed to deal with small samples and all the uncertainties they presented about representing the whole population. Those outdated assumptions compromise their applicability to a new age, where computers just iterate over the whole population. The authors even recount a suspicion of traditional statisticians made by one of their big data experts, New York City&#8217;s Mike Flowers, who was put off by statisticians&#8217; interest in &#8220;arcane concerns about mathematical models&#8221; (p. 186).</p>
<p>Certainly, the authors say, there is a place for traditional statistics. It can even be used to run traditional experiments in order to validate suggestions made by big data crunching. But I think the relationship between old and new techniques is much tighter. This question has an important bearing on the power exerted by big data, because I believe proper techniques will be harder to learn and accurately apply than Mayer-Schönberger and Cukier suggest. While they expect the skills soon to become &#8220;commonplace&#8221; (pp. 125-126), I think there will be a crying shortage for some time, allowing a few large institutions with deep pockets to corner the market.</p>
<p>Let&#8217;s take the common big-data task of clustering, which might help in such situations as an art dealer trying to determine that Leonardo da Vinci is closer to El Greco in style than to Andy Warhol. Clustering algorithms can take a very long time to run, and choosing good starting points is important to reduce compute time. In fact, characteristics of the data can help a data scientist choose which algorithm to run in the first place. So what can provide with a starting point for the big data venture? A traditional statistical analysis of a random sample could be a good choice.</p>
<p>This extends throughout the field of data. Even the choice of the best sorting technique&#8211;a common exercise in the first classes for programming students&#8211;can vary depending on the characteristics of the data being sorted.</p>
<p><em>Big Data</em> is not oriented toward this sort of technical discussion. The authors chose quite reasonably to avoid equations or other accoutrements of a mathematical explanation, which I&#8217;m sure would have scared off readers. And yet without some such background (to be sure, I&#8217;m no mathematician or statistician), one can&#8217;t determine the real strengths and weaknesses of the big data movement.</p>
<p>Let&#8217;s turn to the critical question of <em>transparency</em>, which Mayer-Schönberger and Cukier consider a necessity to help people challenge the decisions that others derive from data analysis. Transparency is no panacea, in my view. First, algorithms are incredibly complex. Second, as we&#8217;ve seen, the choice of algorithm (as well as the data to be analyzed) requires some subjective judgment, which is hard to challenge.</p>
<p>Worse still, any calculation affecting humans has winners and losers, and therefore makes some people eager to game the system. <em>Big Data</em> mentions the trivial example of orange used cars being found to be in better shape than other used cars (pp. 66-67), and points out how ridiculous it would be for car owners to paint the cars orange before selling them. These kind of dueling incentives apply across the board. It&#8217;s one reason Google doesn&#8217;t publish its search rank algorithms&#8211;and in fact, one reason it changes that algorithm on a daily basis.</p>
<p>Thus, instead of asking banks or insurers to reveal their decision-making processes, it may be better to give individuals access to data about themselves, and a process such as the &#8220;external algorithmist&#8221; proposed by Mayer-Schönberger and Cukier (p. 181) to allow individuals to present exculpatory evidence.</p>
<p>Both the external algorithmist and the internal algorithmist (similar to an ombudsman) envisioned by the authors are good additions to an organization. Dr. Brigitte Piniewski (whose work with an <a href="http://aimlab.cs.uoregon.edu/SMASH/">NIH-funded experiment in the community collection of health data</a>, was covered in a <a href="http://strata.oreilly.com/2012/10/open-source-software-as-a-model-for-health-care.html">Strata article</a>) suggests that the role of an algorithmist is a very difficult one, and in fact too big for a single person to fill. The algorithmist, in her assessment, must have some basic knowledge not only of statistics but of real-world disciplines such as physics and biology, in order to have a sense of what is possible and what analytic results are absurd on their face.</p>
<p>Even more important for algorithmists is a correct attitude: an inate skepticism that may be in-born trait rather than something teachable. And she says this willingness to constantly challenge accepted beliefs must run throughout the organization, which cannot rely on a single expert to provide this corrective.</p>
<p>Because many institutional decisions take place in the background where the affected individuals never find out they took place at all (when have you ever learned that a marketer decided <em>not</em> to offer you a great deal?), the reactive approach has grave limitations. Furthermore, it&#8217;s hard to trust institutions to take self-corrective actions to preserve privacy and individual automomy at this historical moment when we&#8217;re reeling from revelations that the IRS targeted institutions based on their political positions and the Justice Department gathered phone records across the board from Associated Press reporters.</p>
<p>My last cluster of concerns relates to the role of human intuition and creativity in an age of big data, where the authors end their book on a high note. It&#8217;s important to recognize that big data analysis consists of applying what happened in the <em>past</em> to what <em>will</em> happen in the future. Had there been no recent influenza outbreaks, Google could not have run the tests that produced their famous flu prediction algorithm.</p>
<p>And thus the data we collect in the past hangs over us. Suppose we analyze arrest and conviction records to determine whom we should target most heavily for policing? Guess what? African-Americans and Hispanics will get the bulk of policy scrutiny, because the police have targeted them disproportionately for generations. In short correlations could turn into textbook cases of self-fulfilling prophecies. As Mayer-Schönberger and Cukier say, we still need human intervention to think outside the box.</p>
<p>I think big data will accentuate today&#8217;s trend to differentiate between the commoditized and the innovative. Like manufacturing, we will see more and more decision-making calling on big data&#8211;but with crucial human correctives, as noted before.</p>
<p>The relation between invention and standardization is a bit like the promise of 3D printers such as the Makerbot. On the one hand, they allow flights of invention by clever tinkerers like never before. But the printers depend on microchips made in sterile labs by the millions, and other materials from large manufacturers. (The biggest US producer of polylactide filament is a <a href="http://www.natureworksllc.com/About-NatureWorks-LLC">subsidiary of Cargill</a>.) Used properly, big data could similarly be the greatest contributor in history to personal innovation.</p>
]]></content:encoded>
			<wfw:commentRss>http://strata.oreilly.com/2013/05/looking-ahead-to-a-world-of-data-dominated-decisions.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Six disruptive possibilities from big data</title>
		<link>http://strata.oreilly.com/2013/05/six-disruptive-possibilities-from-big-data.html</link>
		<comments>http://strata.oreilly.com/2013/05/six-disruptive-possibilities-from-big-data.html#comments</comments>
		<pubDate>Mon, 20 May 2013 16:00:17 +0000</pubDate>
		<dc:creator>Jeff Needham</dc:creator>
				<category><![CDATA[Data]]></category>
		<category><![CDATA[Big Data]]></category>
		<category><![CDATA[business]]></category>
		<category><![CDATA[customers]]></category>
		<category><![CDATA[disruption]]></category>
		<category><![CDATA[Disruptive Possibilities]]></category>
		<category><![CDATA[ecosystem]]></category>
		<category><![CDATA[enterprise]]></category>
		<category><![CDATA[vendors]]></category>

		<guid isPermaLink="false">http://strata.oreilly.com/?p=56218</guid>
		<description><![CDATA[My new book, Disruptive Possibilities: How Big Data Changes Everything, is derived directly from my experience as a performance and platform architect in the old enterprise world and the new, Internet-scale world. I pre-date the Hadoop crew at Yahoo!, but &#8230; ]]></description>
				<content:encoded><![CDATA[<p><a href="http://oreilly.com/radarreports/disruptive-possibilities.csp"><img class="alignright size-full wp-image-56225" alt="Disruptive Possibilities" src="http://s.radar.oreilly.com/wp-files/5/2013/04/0413-disruptive-possibilities-cover.png" width="250" height="250" /></a>My new book, <a href="http://oreilly.com/radarreports/disruptive-possibilities.csp"><em>Disruptive Possibilities: How Big Data Changes Everything</em></a>, is derived directly from my experience as a performance and platform architect in the old enterprise world and the new, Internet-scale world.</p>
<p>I pre-date the Hadoop crew at Yahoo!, but I intimately understood the grid engineering that made Hadoop possible. For years, the working title of this book was <em>The Art and Craft of Platform Engineering</em>, and when I started working on Hadoop after a stint in the Red Hat kernel group, many of the ideas that were jammed into my head, going back to my experience with early supercomputers, all seem to make perfect sense for Hadoop. This is why I frequently refer to big data as &#8220;commercial supercomputing.&#8221;</p>
<p>In <em>Disruptive Possibilities</em>, I discuss the implications of the big data ecosystem over the next few years. These implications will inundate vendors and customers in a number of ways, including: <span id="more-56218"></span></p>
<ol>
<li>The disruption to the silo mentality, both in IT organizations and the industry that serves them, will be the big story of big data.</li>
<li>The IT industry will be battered by the new technology of big data because many of the products that pre-date Hadoop are laughably unaffordable at scale. Big data hardware and software is hundreds of times faster than existing enterprise-scale products and often thousands of times cheaper.</li>
<li>Technology as new and disruptive as big data is often resisted by IT organizations because their corporate mandate requires them to obsess about minimizing <a href="http://en.wikipedia.org/wiki/Operating_expense">OPEX</a> and not tolerate innovation, forcing IT to be the big bad wolf of big data.</li>
<li>IT organizations will be affected by the generation that replaces those who invested their careers in Oracle, Microsoft, and EMC. The old adage &#8220;no one ever gets fired for buying (historically) IBM&#8221; only applies to mature, established technology, not to immature and disruptive technology. Big data is the most disruptive force this industry has seen since the introduction of the relational database.</li>
<li>Big data requires data scientists and programmers to develop a better understanding of how the data flows underneath them, including an introduction (or reintroduction) to the computing platform that makes it possible. This may be outside of their comfort zones if they are similarly entrenched within silos. Professionals willing to learn new ways of collaborating, working, and thinking will prosper. That prosperity will be as much about highly efficient and small teams of people as it is about highly efficient and large groups of servers.</li>
<li>Civil liberties and privacy will be compromised as technology improvements make it affordable for any organization (private, public or clandestine) to analyze the patterns of data and behavior of anyone who uses a mobile phone.</li>
</ol>
<p>Much more is covered in <a href="http://oreilly.com/radarreports/disruptive-possibilities.csp"><em>Disruptive Possibilities: How Big Data Changes Everything</em></a>. Download it for free <a href="http://oreilly.com/radarreports/disruptive-possibilities.csp">here</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://strata.oreilly.com/2013/05/six-disruptive-possibilities-from-big-data.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Improving options for unlocking your graph data</title>
		<link>http://strata.oreilly.com/2013/05/improving-options-for-unlocking-your-graph-data.html</link>
		<comments>http://strata.oreilly.com/2013/05/improving-options-for-unlocking-your-graph-data.html#comments</comments>
		<pubDate>Sun, 19 May 2013 16:00:24 +0000</pubDate>
		<dc:creator>Ben Lorica</dc:creator>
				<category><![CDATA[Data]]></category>
		<category><![CDATA[Big Data]]></category>
		<category><![CDATA[graph]]></category>
		<category><![CDATA[machine]]></category>
		<category><![CDATA[social graph]]></category>
		<category><![CDATA[social network analysis]]></category>
		<category><![CDATA[spark]]></category>

		<guid isPermaLink="false">http://strata.oreilly.com/?p=57388</guid>
		<description><![CDATA[The popular open source project GraphLab received a major boost early this week when a new company comprised of its founding developers, raised funding to develop analytic tools for graph data sets. GraphLab Inc. will continue to use the open &#8230; ]]></description>
				<content:encoded><![CDATA[<p>The popular open source project <a href="http://graphlab.org/">GraphLab</a> received a major boost early this week when a new company comprised of its founding developers, <a href="http://graphlab.com/press/">raised funding</a> to develop analytic tools for graph data sets. <a href="http://graphlab.com/">GraphLab Inc.</a> will continue to use the open source GraphLab to &#8220;push the limits of graph computation and develop new ideas&#8221;, but having a commercial company will accelerate development, and allow the hiring of resources dedicated to improving usability and documentation.</p>
<p>While social media placed graph data on the radar of many companies, similar data sets can be found in many domains including the life and health sciences, security, and financial services. Graph data is different enough that it necessitates special tools and techniques. Because tools were a bit too complex for casual users, in the past this meant graph data analytics was the province of specialists. Fortunately graph data is an area that has attracted many enthusiastic entrepreneurs and developers. The tools have improved and I expect things to get much easier for users in the future. A great place to learn more about tools for graph data, is at the upcoming <a href="http://graphlab.org/graphlab-workshop-2013/">GraphLab Workshop</a> (on July 1st in SF).</p>
<p><b>Data wrangling: creating graphs</b><br />
Before you can take advantage of the other tools mentioned in this post, you&#8217;ll need to turn your data (e.g., web pages) into graphs. <a href="https://01.org/graphbuilder/">GraphBuilder</a> is an open source project from Intel, that uses Hadoop MapReduce<sup>1</sup> to build graphs out of large data sets. Another option is the combination of GraphX/Spark <a href="#gx">described below</a>. (A startup called <a href="http://trifacta.com/">Trifacta</a> is building a general-purpose, data wrangling tool, that could help as well. )</p>
<p><span id="more-57388"></span></p>
<p><b>Data management and search</b><br />
Once you have a graph, there are many options for how to store it. The choice of database largely depends on amount of data (# of nodes, edges, along with the size of data associated with them), the types of tasks (pattern-matching and search, analytics), and workload. In the course of evaluating alternatives to MySQL (for storing social graph data), Facebook&#8217;s engineering team developed and released <a href="https://github.com/facebook/linkbench">Linkbench</a> &#8211; a data set that can be used to study how graph databases handle production workloads.</p>
<p>Most <a href="http://en.wikipedia.org/wiki/Graph_database#Graph_database_features">graph databases</a> (such as <a href="http://www.neo4j.org/">Neo4j</a><sup>2</sup>, <a href="http://www.franz.com/agraph/allegrograph/">AllegroGraph</a>, <a href="http://yarcdata.com/Products/">Yarcdata</a>, and <a href="http://www.objectivity.com/infinitegraph">InfiniteGraph</a>) come with tools for facilitating and speeding up search &#8211; Neo4j comes with a simple query language (<a href="http://www.neo4j.org/learn/cypher">Cipher</a>) for search, other graph databases support <a href="http://en.wikipedia.org/wiki/SPARQL">SPARQL</a>. The <a href="http://thinkaurelius.github.io/titan/">Titan</a> distributed graph database supports different storage engines (including HBase and Cassandra) and comes with tools for search and traversal (based on Lucene and <a href="https://github.com/tinkerpop/gremlin/wiki">Gremlin</a>). Used by Twitter to store graph data, <a href="https://github.com/twitter/flockdb">FlockDB</a> targets operations involving <a href="http://engineering.twitter.com/2010/05/introducing-flockdb.html">adjacency</a> lists.</p>
<p>Among Hadoop users <a href="http://www.cloudera.com/content/cloudera/en/resources/library/hbasecon/video-hbasecon-2012-storing-and-manipulating-graphs-in-hbase.html">HBase is a popular option for storing graph data</a>. Hadapt&#8217;s analytic platform<sup>3</sup> integrates Apache Hadoop and SQL, and now also <a href="http://hadapt.com/product/">supports graph analysis</a>.</p>
<p><b>Graph-parallel frameworks</b>: Pregel, PowerGraph, and <a name="gx">GraphX</a><br />
<a href="http://en.wikipedia.org/wiki/Bulk_Synchronous_Parallel">BSP</a> is a parallel computing model that has inspired many graph analytics tools. Just like Hadoop&#8217;s <i>map</i> and <i>reduce</i>, <a href="http://googleresearch.blogspot.com/2009/06/large-scale-graph-computing-at-google.html">Pregel</a><sup>4</sup>, <a href="http://giraph.apache.org/">Giraph</a> and <a href="http://hyracks.org/projects/pregelix/">Pregelix</a>, come with <i>primitives</i> that let neighboring nodes send/receive messages to one another, or change the state of a node (based on the state of its neighboring nodes). Efficient graph algorithms are a sequence of iterations built from such primitives. GraphLab uses similar primitives (called <i><a href="http://www.select.cs.cmu.edu/publications/paperdir/osdi2012-gonzalez-low-gu-bickson-guestrin.pdf">PowerGraph</a></i>) but allows for <a href="http://www.select.cs.cmu.edu/publications/paperdir/vldb2012-low-gonzalez-kyrola-bickson-guestrin-hellerstein.pdf"><i>asynchronous</i> iterative computations</a>, leading to an expanded set of (potentially) faster algorithms.</p>
<p><a href="https://amplab.cs.berkeley.edu/publication/graphx-grades/">GraphX is a new, fault-tolerant, framework</a> that runs within <a href="http://spark-project.org/">Spark</a>. Its core data structure is an immutable <i>graph</i><sup>5</sup> (Resilient Distributed Graph &#8211; or RDG), and GraphX programs are a sequence of <i>transformations</i> on RDG&#8217;s (with each transformation yielding a new RDG). Transformations on RDG&#8217;s can affect nodes, edges, or both (depending on the state of neighboring edges and nodes). GraphX greatly enhances productivity by simplifying a range of tasks (graph loading, construction, transformation, and computations). But it does so at the expense of performance: early prototype algorithms written in GraphX were slower<sup>6</sup> than those written in GraphLab/PowerGraph.</p>
<p><b>Machine-learning and analytics</b><br />
Machine-learning tools that target graph data lead to familiar applications such as detecting influential users (PageRank) and communities, fraud detection, and recommendations (<a href="http://strata.oreilly.com/2012/12/graphchi-graph-analytics-over-billions-of-edges-using-your-laptop.html">collaborative filtering is popular among GraphLab users</a>). Moreover techniques developed in one domain are often reused in other settings. Besides GraphLab, distributed analytics have been implemented in <a href="http://engineering.linkedin.com/open-source/apache-giraph-framework-large-scale-graph-processing-hadoop-reaches-01-milestone">Giraph</a>, <a href="https://amplab.cs.berkeley.edu/publication/graphx-grades/">GraphX</a>, <a href="http://hortonworks.com/blog/big-graph-data-on-hortonworks-data-platform/">Faunus</a>, and <a href="http://sampa.cs.washington.edu/grappa/overview.html">Grappa</a>. In addition, graph databases like Neo4j and Yarcdata come with some analytic capabilities. As I noted in a <a href="http://strata.oreilly.com/2013/04/single-server-systems-can-tackle-big-data.html">recent post, open source, single-node systems like Twitter&#8217;s Cassovary</a><sup>7</sup> are being used for computations involving massive graphs.</p>
<p><b>Visualization</b><br />
When you&#8217;re dealing with large graphs, being able to zoom in/out helps with clutter, but so do <a href="http://www2.research.att.com/~yifanhu/GALLERY/GRAPHS/">clever layout algorithms</a>. Popular tools for visualizing nodes and edges include <a href="https://gephi.org/">Gephi</a> and <a href="http://graphviz.org/">GraphViz</a>. Users who want to customize their graphs turn to packages like <a href="https://github.com/mbostock/d3/wiki/Gallery">d3</a>.</p>
<hr />
<p><small><br />
(1) I would love to see a version of GraphBuilder that&#8217;s built on top of <a href="http://strata.oreilly.com/2012/08/seven-reasons-why-i-like-spark.html">Spark</a>.<br />
(2) Many of these systems are quite efficient. For example a single instance of Neo4j <a href="http://blog.neo4j.org/2013/01/2013-whats-coming-next-in-neo4j.html">can handle very large graphs</a> (&#8220;into the tens of billions of nodes/ relationships/ properties&#8221;).<br />
(3) Note that using standard Hadoop for graph <i>processing</i> <a href="http://dbmsmusings.blogspot.com/2011/07/hadoops-tremendous-inefficiency-on.html">may not be the most efficient</a> option. This <a href="http://www.slideshare.net/cloudera/hadoop-and-graph-data-management-challenges-and-opportunities-daniel-abadi-yale-university-hadapt">talk by Hadapt co-founder Daniel Abadi</a> describes an advanced approach to graph analysis using Hadoop.<br />
(4) Related frameworks include <a href="http://wwwrel.ph.utexas.edu/Members/jon/golden_orb/">GoldenOrb</a> and <a href="http://hama.apache.org/">Hama</a>.<br />
(5) Resilient Distributed Graphs (RDG) extend Spark’s Resilient Distributed Dataset (RDD).<br />
(6) <a href="https://amplab.cs.berkeley.edu/publication/graphx-grades/">As the developers of GraphX note</a>: <i>&#8220;We emphasize that it is not our intention to beat PowerGraph in performance. &#8230; We believe that the loss in performance may, in many cases, be ameliorated by the gains in productivity achieved by the GraphX system. &#8230; It is our belief that we can shorten the gap in the near future, while providing a highly usable <u>interactive</u> system for graph data mining and computation&#8221;</i><br />
(7) On the plus side, being single-node means Cassovary doesn&#8217;t have to deal with finding the optimal way to partition a graph. On the other hand, it is limited to graphs that fit in the memory of a server &#8211; a limitation it alleviates through the use of efficient data structures.<br />
</small></p>
<div style="float: left;border-top: thin gray solid;border-bottom: thin gray solid;padding: 20px;margin: 20px 2px;clear: both">
<p><a href="http://strataconf.com/?intcmp=il-strata-stny13-blog-promo"><img style="float: left;border: none;padding-right: 10px" alt="" src="http://cdn.oreilly.com/radar/images/promos/2013-strata-rx-london-ny.gif" /></a><a href="http://strataconf.com/?intcmp=il-strata-stny13-blog-promo"><strong>O&#8217;Reilly Strata Conference</strong></a> — Strata brings together the leading minds in data science and big data — decision makers and practitioners driving the future of their businesses and technologies. Get the skills, tools, and strategies you need to make data work.</p>
<p><a href="http://strataconf.com/rx2013?intcmp=il-strata-strx13-strata-blog-banner-148x178">Strata Rx Health Data Conference</a>: September 25-27 | Boston, MA<br />
<a href="http://strataconf.com/stratany2013?intcmp=il-strata-stny13-blog-promo">Strata + Hadoop World</a>: October 28-30 | New York, NY<br />
<a href="http://strataconf.com/strataeu2013/?intcmp=il-strata-steu13-blog-promo">Strata in London</a>: November 15-17 | London, England</p>
</div>
]]></content:encoded>
			<wfw:commentRss>http://strata.oreilly.com/2013/05/improving-options-for-unlocking-your-graph-data.html/feed</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Strata Week: Are customized Google maps a neutrality win or the next &#8220;filter bubble&#8221;?</title>
		<link>http://strata.oreilly.com/2013/05/strata-week-are-customized-google-maps-a-neutrality-win-or-the-next-filter-bubble.html</link>
		<comments>http://strata.oreilly.com/2013/05/strata-week-are-customized-google-maps-a-neutrality-win-or-the-next-filter-bubble.html#comments</comments>
		<pubDate>Fri, 17 May 2013 17:30:40 +0000</pubDate>
		<dc:creator>Jenn Webb</dc:creator>
				<category><![CDATA[Data]]></category>
		<category><![CDATA[Aaron Swartz]]></category>
		<category><![CDATA[anonymous inbox]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[Google Maps]]></category>
		<category><![CDATA[Kevin Poulsen]]></category>
		<category><![CDATA[predictive apps]]></category>

		<guid isPermaLink="false">http://strata.oreilly.com/?p=57434</guid>
		<description><![CDATA[Google aims for a new level of map customization Google introduced a new version of Google maps at Google I/O this week that learns from each use to customize itself to individual users, adapting based on user clicks and searches. &#8230; ]]></description>
				<content:encoded><![CDATA[<h2 id="google-maps">Google aims for a new level of map customization</h2>
<p>Google introduced <a href="http://maps.google.com/help/maps/helloworld/desktop/preview/">a new version of Google maps</a> at Google I/O this week that learns from each use to customize itself to individual users, adapting based on user clicks and searches. <a href="http://google-latlong.blogspot.com/2013/05/meet-new-google-maps-map-for-every.html">A post on the Google blog</a> outlines the updates, which include recommendations for places you might enjoy (based upon your map activity), ratings and reviews, integrated Google Earth, and tours generated from user photos, to name a few.</p>
<p><iframe width="640" height="360" src="http://www.youtube.com/embed/THxJHcR1D2c?feature=oembed" frameborder="0" allowfullscreen></iframe></p>
<p><span id="more-57434"></span></p>
<p><a href="http://www.theatlantic.com/technology/archive/2013/05/why-the-new-google-maps-is-the-most-honest-form-of-cartography/275947/">Leo Mirani at The Atlantic</a> says the update &#8220;fixes the one thing that has always been wrong with maps&#8221; — namely, neutrality. Though maps are typically viewed as neutral objects, Mirani argues, they&#8217;re &#8220;about as impartial as journalism.&#8221; He <a href="http://www.theatlantic.com/technology/archive/2013/05/why-the-new-google-maps-is-the-most-honest-form-of-cartography/275947/">writes</a>:</p>
<blockquote><p>&#8220;Is the hot dog vendor who stands on a street corner as worthy of inclusion as the bank on the same corner? That decision still lay with mapmakers — in this case a giant internet company. Google&#8217;s solution to the problem was to remove itself from the equation.&#8221;</p></blockquote>
<p>Mirani notes that &#8220;<a href="http://en.wikipedia.org/wiki/Mental_mapping">mental maps</a>,&#8221; maps created based on how a person sees the world, is nothing new, but Google has managed to provide the mental map as a service to millions of users.</p>
<p>In <a href="http://www.theatlanticcities.com/technology/2013/05/potential-problem-personalized-google-maps-we-may-never-know-what-were-not-seeing/5617/">a post at The Atlantic Cities, Emily Badger argues</a> that this level of individual customization comes at a cost — &#8220;An algorithm that knows you too well,&#8221; she writes, &#8220;does a terrible job of telling you things you don&#8217;t already know.&#8221; Badger relates the new maps to Eli Pariser&#8217;s &#8220;<a href="http://www.thefilterbubble.com">filter bubble</a>&#8221; concept: as the algorithms get to know you, you increasingly get content that leans toward your established point of view, preventing you from broadening your experiences. Badger also notes the increasing issue with &#8220;the inequality of information online&#8230;rendering some <a href="http://www.theatlanticcities.com/technology/2013/02/how-internet-reinforces-inequality-real-world/4602/">real-world people and places virtually invisible</a>&#8221; and wonders if the new maps won&#8217;t exacerbate the problem. You can read her full report <a href="http://www.theatlanticcities.com/technology/2013/05/potential-problem-personalized-google-maps-we-may-never-know-what-were-not-seeing/5617/">at The Atlantic Cities</a>.</p>
<h2 id="intelligent-apps">App development trends lean toward predictive, intelligent service offerings</h2>
<p><a href="http://www.technologyreview.com/news/514366/with-personal-data-predictive-apps-stay-a-step-ahead/?utm_campaign=socialsync&amp;utm_medium=social-post&amp;utm_source=twitter">MIT Technology&#8217;s Tom Simonite took a look</a> at the growing app development trend to provide users with personalized information, service connections, and recommendations before even being prompted or searched. Noting the departure of this trend from typical &#8220;dumb&#8221; computers and software that waited for human operator interaction, Simonite looks at such apps as Google Now, which aims to predict a user&#8217;s actions in order to provide appropriate assistance, and the Osito iPhone app, which similarly predicts actions to offer helpful information but also provides actionable assistance, such as a button to call a cab when a user&#8217;s flight reminder pops up.</p>
<p>Bit.ly chief data scientist Hilary Mason told Simonite that Google Now is far from perfect in the usefulness of the information it provides, but she uses it anyway and finds the technology important &#8220;because it&#8217;s the first time Google has taken all they know about us to make a product that makes our lives better.&#8221; You can read Simonite&#8217;s full report <a href="http://www.technologyreview.com/news/514366/with-personal-data-predictive-apps-stay-a-step-ahead/?utm_campaign=socialsync&amp;utm_medium=social-post&amp;utm_source=twitter">at MIT Technology Review</a>.</p>
<p>In related news, Google announced new services at its I/O conference this week that will help developers build apps that can track users as well as Google does — and without draining a user&#8217;s battery. <a href="http://www.technologyreview.com/news/514956/google-wants-to-help-apps-track-you/">Jessica Leber reports at MIT Technology Review</a> that the service will allow developers to build apps that tap into the accelerometer of a user&#8217;s device instead of the &#8220;power-hungry&#8221; GPS sensor to determine whether a person is driving, walking or cycling; the apps would run Google algorithms, Leber notes, that would &#8220;learn over time whether a person is stuck in traffic or just out for an evening stroll.&#8221; The new services also include the ability for developers to create geofences to trigger actions based on a user&#8217;s location. You can read Leber&#8217;s full report <a href="http://www.technologyreview.com/news/514956/google-wants-to-help-apps-track-you/">at MIT Technology Review</a>.</p>
<h2 id="anonymous-inbox">The New Yorker gets an anonymous inbox to protect sources</h2>
<p>The New Yorker launched <a href="http://www.newyorker.com/strongbox/">Strongbox</a> this week, a new tool for people to share files, information, and messages with The New Yorker staff with an increased level of anonymity. In <a href="http://www.newyorker.com/online/blogs/closeread/2013/05/introducing-strongbox-anonymous-document-sharing-tool.html">a post announcing the launch, Amy Davidson explains</a> that with the way Strongbox is set up, they won&#8217;t be able to tell where a piece of information came from, and thus won&#8217;t be able to tell anyone where it came from, providing better protection for their sources.</p>
<p>The <a href="http://www.newyorker.com/strongbox/">Strongbox site</a> explains further that the system is only accessible through the Tor network and that when a user submits a message or file, no I.P. address is recorded and no information about a user&#8217;s browser, computer or operating system is gathered. Strongbox also doesn&#8217;t include any third-party content or use cookies. Users are given a randomly generated, unique code name so that New Yorker staff have a way to contact them if necessary via a message left in Strongbox.</p>
<p>Davidson notes that the Strongbox tool was developed by Aaron Swartz and Kevin Poulsen, and is based on their underlying code dubbed <a href="http://deaddrop.github.io">DeadDrop, which will be open source</a>. You can read the development story and history in <a href="http://www.newyorker.com/online/blogs/newsdesk/2013/05/strongbox-and-aaron-swartz.html">a post Poulsen wrote for the New Yorker</a>.</p>
<h2>Tip us off</h2>
<p>News tips and suggestions are always welcome, so please send them <a href="pitchstrata@oreilly.com ">along</a>.</p>
<p><strong>Related:</strong></p>
<ul>
<li><a href="http://strata.oreilly.com/2012/05/google-knowledge-graph-yahoo-census.html">Google unveils its Knowledge Graph</a></li>
<li><a href="http://radar.oreilly.com/2009/05/google-rich-snippets-semantic-web.html">Google&#8217;s Rich Snippets and the Semantic Web</a></li>
<li><a href="http://strata.oreilly.com/2012/06/predictive-data-analytics-big-data-nyc.html">Predictive data analytics is saving lives and taxpayer dollars in New York City</a></li>
<li><a href="http://strata.oreilly.com/2011/05/anonymize-data-limits.html">Why you can&#8217;t really anonymize your data</a></li>
<li><a href="http://strata.oreilly.com/tag/strata-week">More Strata Week coverage</a></li>
</ul>
<div style="float: left;border-top: thin gray solid;border-bottom: thin gray solid;padding: 20px;margin: 20px 2px;clear: both"><a href="http://strataconf.com/?intcmp=il-strata-stny13-blog-promo"><img style="float: left;border: none;padding-right: 10px" src="http://cdn.oreilly.com/radar/images/promos/2013-strata-rx-london-ny.gif" /></a><a href="http://strataconf.com/?intcmp=il-strata-stny13-blog-promo"><strong>O&#8217;Reilly Strata Conference</strong></a> &mdash; Strata brings together the leading minds in data science and big data &mdash; decision makers and practitioners driving the future of their businesses and technologies. Get the skills, tools, and strategies you need to make data work.</p>
<p> <a href="http://strataconf.com/rx2013?intcmp=il-strata-strx13-strata-blog-banner-148x178">Strata Rx Health Data Conference</a>: September 25-27 |  Boston, MA<br /> <a href="http://strataconf.com/stratany2013?intcmp=il-strata-stny13-blog-promo">Strata + Hadoop World</a>: October 28-30 |  New York, NY<br /><a href="http://strataconf.com/strataeu2013/?intcmp=il-strata-steu13-blog-promo">Strata in London</a>: November 15-17  |  London, England </div>
]]></content:encoded>
			<wfw:commentRss>http://strata.oreilly.com/2013/05/strata-week-are-customized-google-maps-a-neutrality-win-or-the-next-filter-bubble.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>On becoming a code artist</title>
		<link>http://strata.oreilly.com/2013/05/becoming-a-code-artist.html</link>
		<comments>http://strata.oreilly.com/2013/05/becoming-a-code-artist.html#comments</comments>
		<pubDate>Thu, 16 May 2013 13:00:06 +0000</pubDate>
		<dc:creator>Ann Spencer</dc:creator>
				<category><![CDATA[Data]]></category>
		<category><![CDATA[d3]]></category>
		<category><![CDATA[D3.js]]></category>
		<category><![CDATA[data]]></category>
		<category><![CDATA[data visualization]]></category>

		<guid isPermaLink="false">http://strata.oreilly.com/?p=57140</guid>
		<description><![CDATA[Scott Murray, a code artist, has written Interactive Data Visualization for the Web for nonprogrammers. In this interview, Scott provides some insights on what inspired him to write an introduction to D3 for artists, graphic designers, journalists, researchers, or anyone &#8230; ]]></description>
				<content:encoded><![CDATA[<p><a href="http://alignedleft.com/">Scott Murray</a>, a code artist, has written <a href="http://shop.oreilly.com/product/0636920026938.do"><em>Interactive Data Visualization for the Web</em></a> for nonprogrammers. In this interview, Scott provides some insights on what inspired him to write an introduction to D3 for artists, graphic designers, journalists, researchers, or anyone that is looking to begin programming data visualizations.</p>
<h3><strong>What inspired you to become a code artist?</strong></h3>
<div id="attachment_57222" class="wp-caption alignright" style="width: 160px"><a href="http://s.radar.oreilly.com/wp-files/5/2013/05/Scott-Murray.jpg"><img class="size-thumbnail wp-image-57222 " alt="Scott Murray" src="http://s.radar.oreilly.com/wp-files/5/2013/05/Scott-Murray-150x150.jpg" width="150" height="150" /></a><p class="wp-caption-text">Scott Murray</p></div>
<p><strong>Scott Murray:</strong> I had designed websites for a long time, but several years ago was frustrated by web browsers&#8217; limitations. I went back to school for an MFA to force myself to explore interactive options beyond the browser. At <a href="http://www.massart.edu/">MassArt</a>, I was introduced to <a href="http://www.processing.org/">Processing</a>, the free programming environment for artists. It opened up a whole new world of programmatic means of manipulating and interacting with data — and not just traditional data sets, but also live &#8220;data&#8221; such as from input devices or dynamic APIs, which can then be used to manipulate the output. Processing let me start prototyping ideas immediately; it is so enjoyable to be able to build something that really works, rather than designing static mockups first, and then hopefully, one day, invest the time to program it. Something about that shift in process is both empowering and liberating — being able to express your ideas quickly in code, and watch the system carry out your instructions, ultimately creating images and experiences that are beyond what you had originally envisioned.</p>
<p><span id="more-57140"></span></p>
<h3><strong>Why did you decide to write<em> Interactive Data Visualization for the Web</em>?</strong></h3>
<p><strong>Scott Murray:</strong> <a href="http://d3js.org/">D3.js</a> is the most powerful tool for creating visualizations on the web, hands down. Yet when I tried to use it for a project in late 2011, there wasn&#8217;t a lot of useful information available on how to use it. A few early members of the community had written tutorials, but they assumed a level of JavaScript familiarity that I didn&#8217;t have. The official D3 documentation is excellent, but not necessarily accessible to beginners. So I set about banging my head against the wall for a few weeks, struggling to learn the peculiarities of D3, and taking notes every time I encountered a challenging concept or had a revelation. Once the project was done, I revisited my notes and starting writing tutorials that would introduce each of those challenging concepts to beginners — those new to JavaScript, and even HTML and CSS. (I&#8217;m seeing more people drawn to JavaScript via D3, just as my first experiences with JavaScript were thanks to jQuery.) My goal was to spare others the frustrations I experienced. Within months, the tutorials on my site were getting a lot of traffic, and people wrote in to request more tutorials: on the basics, on interactivity, on mapping. So I expanded the tutorials into <a href="http://shop.oreilly.com/product/0636920026938.do">a full-length book</a>.</p>
<h3><strong>Who do you envision reading this book? What will they learn after reading your book?</strong></h3>
<p><strong>Scott Murray:</strong> This book is for anyone interested in learning how to use D3 to create and publish visualizations on the web: journalists, designers, data scientists, statisticians, students, researchers, and would-be mapmakers. The book includes an introductory chapter on web fundamentals — HTML, CSS, JavaScript — so it&#8217;s very accessible to people new to web development, and even programming generally. There are also more than 100 code examples that accompany the book, so it&#8217;s easy to follow along and tweak the examples as you learn. In the end, anyone who works through the book will have a grip on all the basic concepts of D3. You&#8217;ll be able to make charts and graphs, even highly customized geographic maps with data overlays. And hopefully you&#8217;ll be inspired to learn more, experiment, and share your creations with the datavis community.</p>
<h3><strong>Any words of advice for an aspiring code artist?</strong></h3>
<p><strong>Scott Murray:</strong> Start making things now. I get to work with students, and I see them get stuck in their heads, trying to plan out every last detail before they start working on a project. This results in the project never starting at all, or being finished in a very rushed fashion. People (myself included) often have a tendency to over-think things, especially intimidating projects that will involve learning something new, or trying something we&#8217;ve never tried. While thinking is good, over-thinking prevents us from doing. And, in reality, doing is a critical part of thinking — you can&#8217;t really separate the two. So I suggest getting comfortable with not knowing what you&#8217;re doing before you do it. Just start making things now, today, even if you feel underprepared or like you don&#8217;t have all the answers you need yet. Guess what? No one has all the answers (even though we pretend to). We&#8217;re all just here figuring this stuff out as we go. So don&#8217;t over-think it, start producing projects, and get those projects out in the world. You&#8217;ll learn what you need to learn along the way.</p>
<h4><em>This interview was edited and condensed.</em></h4>
<p><strong>Related: </strong></p>
<ul>
<li><a href="http://strataconf.com/strata2013/public/schedule/detail/27425">Slides from Scott Murray&#8217;s D3.js tutorial at Strata Santa Clara 2013</a></li>
<li><a href="http://oreillynet.com/pub/e/2584">Engaging Audiences with Data Visualization &#8211; Webcast</a></li>
<li><a href="http://shop.oreilly.com/product/0636920026938.do">Interactive Data Visualization for the Web &#8211; Book</a></li>
<li><a href="http://shop.oreilly.com/product/0636920025603.do">Data Journalism Handbook</a></li>
<li><a href="http://strata.oreilly.com/2013/04/data-journalism-simon-rogers-twitter-pew-data-intel-mashery.html">Movers and shakers on the data journalism front</a></li>
</ul>
<div style="float: left;border-top: thin gray solid;border-bottom: thin gray solid;padding: 20px;margin: 20px 2px;clear: both">
<p><a href="http://strataconf.com/?intcmp=il-strata-stny13-blog-promo"><img style="float: left;border: none;padding-right: 10px" alt="" src="http://cdn.oreilly.com/radar/images/promos/2013-strata-rx-london-ny.gif" /></a><a href="http://strataconf.com/?intcmp=il-strata-stny13-blog-promo"><strong>O&#8217;Reilly Strata Conference</strong></a> — Strata brings together the leading minds in data science and big data — decision makers and practitioners driving the future of their businesses and technologies. Get the skills, tools, and strategies you need to make data work.</p>
<p><a href="http://strataconf.com/rx2013?intcmp=il-strata-strx13-strata-blog-banner-148x178">Strata Rx Health Data Conference</a>: September 25-27 | Boston, MA<br />
<a href="http://strataconf.com/stratany2013?intcmp=il-strata-stny13-blog-promo">Strata + Hadoop World</a>: October 28-30 | New York, NY<br />
<a href="http://strataconf.com/strataeu2013/?intcmp=il-strata-steu13-blog-promo">Strata in London</a>: November 15-17 | London, England</p>
</div>
]]></content:encoded>
			<wfw:commentRss>http://strata.oreilly.com/2013/05/becoming-a-code-artist.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Visualization of the Week: Real-time Wikipedia edits</title>
		<link>http://strata.oreilly.com/2013/05/visualization-of-the-week-real-time-wikipedia-edits.html</link>
		<comments>http://strata.oreilly.com/2013/05/visualization-of-the-week-real-time-wikipedia-edits.html#comments</comments>
		<pubDate>Wed, 15 May 2013 21:00:51 +0000</pubDate>
		<dc:creator>Jenn Webb</dc:creator>
				<category><![CDATA[Data]]></category>
		<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[data]]></category>
		<category><![CDATA[data visualization]]></category>
		<category><![CDATA[Wikipedia]]></category>

		<guid isPermaLink="false">http://strata.oreilly.com/?p=57376</guid>
		<description><![CDATA[Stephen LaPorte and Mahmoud Hashemi have put together an addictive visualization of real-time edits on Wikipedia, mapped across the world. Every time an edit is made, the user&#8217;s location and the entry they edited are listed along with a corresponding &#8230; ]]></description>
				<content:encoded><![CDATA[<p>Stephen LaPorte and Mahmoud Hashemi have put together <a href="http://rcmap.hatnote.com/#en">an addictive visualization</a> of real-time edits on Wikipedia, mapped across the world. Every time an edit is made, the user&#8217;s location and the entry they edited are listed along with a corresponding dot on the map.</p>
<p><div id="attachment_57377" class="wp-caption aligncenter" style="width: 610px"><a href="http://rcmap.hatnote.com/#en"><img src="http://s.radar.oreilly.com/wp-files/5/2013/05/Wikipedia-Recent-Changes-Map.png" alt="Wikipedia-Recent-Changes-Map" width="600" height="614" class="size-full wp-image-57377" /></a><p class="wp-caption-text"><em>Click here for the full visualization.</em></p></div><br />
<span id="more-57376"></span></p>
<p>On the <a href="http://rcmap.hatnote.com/#en">Wikipedia Recent Changes Map</a>, edits can be viewed on 11 Wikipedia language versions, including English, German, Russian and Japanese. Every time an unregistered user edits a Wikipedia entry, his or her IP address is recorded and translated into an approximate location. The about section of the visualization explains that registered users don&#8217;t have associated IP information, so registered user edits are not shown on this map.</p>
<p>In <a href="http://blog.hatnote.com/post/49342528753/wikipedia-recent-changes-map">a blog post</a> about the map visualization, LaPorte and Hashemi note that the map was built with several libraries and services, including <a href="http://d3js.org/">d3</a>, <a href="http://datamaps.github.io/">DataMaps</a>, and <a href="http://freegeoip.net/">freegeoip.net</a>, and that the &#8220;map listens to live feeds of Wikipedia revisions, broadcast using <a href="https://github.com/hatnote/wikimon">wikimon</a>.&#8221; The map&#8217;s code is open source and available <a href="https://github.com/hatnote/rcmap">on GitHub</a>.</p>
<p><em>Hat tip to <a href="http://arstechnica.com/business/2013/05/live-map-of-recent-changes-to-wikipedia-articles-is-mesmerizing/">Megan Geuss at Ars Technica</a> and to <a href="http://www.theatlanticcities.com/technology/2013/05/live-map-manic-ways-people-edit-wikipedia/5547/">Emily Badger at The Atlantic</a> for highlighting LaPorte&#8217;s and Hashemi&#8217;s work.</em></p>
<p><strong>More visualizations:</strong></p>
<ul>
<li><a href="http://strata.oreilly.com/2013/05/visualization-of-the-week-building-collapse-rescue-efforts.html">Building collapse rescue efforts</a></li>
<li><a href="http://strata.oreilly.com/2013/04/visualization-of-the-week-every-recorded-u-s-terror-attack-1970-2011.html">Visualization of the Week: Every recorded U.S terror attack 1970-2011</a></li>
<li><a href="http://strata.oreilly.com/2013/04/visualization-of-the-week-commuting-through-paris-metropolitain-io.html">Commuting Paris</a></li>
<li><a href="http://strata.oreilly.com/2013/04/visualization-of-the-week-a-day-in-the-life-of-a-bus-line.html">Visualization of the Week: A day in the life of a bus line</a></li>
<li><a href="http://strata.oreilly.com/2013/04/visualization-of-the-week-block-level-electricity-use-in-los-angeles.html">Block-level electricity use in Los Angeles</a></li>
</ul>
<div style="float: left;border-top: thin gray solid;border-bottom: thin gray solid;padding: 20px;margin: 20px 2px;clear: both"><a href="http://strataconf.com/?intcmp=il-strata-stny13-blog-promo"><img style="float: left;border: none;padding-right: 10px" src="http://cdn.oreilly.com/radar/images/promos/2013-strata-rx-london-ny.gif" /></a><a href="http://strataconf.com/?intcmp=il-strata-stny13-blog-promo"><strong>O&#8217;Reilly Strata Conference</strong></a> &mdash; Strata brings together the leading minds in data science and big data &mdash; decision makers and practitioners driving the future of their businesses and technologies. Get the skills, tools, and strategies you need to make data work.</p>
<p><a href="http://strataconf.com/rx2013?intcmp=il-strata-strx13-strata-blog-banner-148x178">Strata Rx Health Data Conference</a>: September 25-27 |  Boston, MA<br />
<a href="http://strataconf.com/stratany2013?intcmp=il-strata-stny13-blog-promo">Strata + Hadoop World</a>: October 28-30 |  New York, NY<br /><a href="http://strataconf.com/strataeu2013/?intcmp=il-strata-steu13-blog-promo">Strata in London</a>: November 15-17  |  London, England
</div>
]]></content:encoded>
			<wfw:commentRss>http://strata.oreilly.com/2013/05/visualization-of-the-week-real-time-wikipedia-edits.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Big data, cool kids</title>
		<link>http://strata.oreilly.com/2013/05/big-data-cool-kids.html</link>
		<comments>http://strata.oreilly.com/2013/05/big-data-cool-kids.html#comments</comments>
		<pubDate>Tue, 14 May 2013 16:38:13 +0000</pubDate>
		<dc:creator>Edd Dumbill</dc:creator>
				<category><![CDATA[Data]]></category>
		<category><![CDATA[Events]]></category>
		<category><![CDATA[data]]></category>
		<category><![CDATA[strata]]></category>
		<category><![CDATA[strataconf]]></category>

		<guid isPermaLink="false">http://strata.oreilly.com/?p=57360</guid>
		<description><![CDATA[The big data world is a confusing place. We’re no longer in a market dominated mostly by relational databases, and the alternatives have multiplied in a baby boom of diversity. These child prodigies of the data scene show great promise &#8230; ]]></description>
				<content:encoded><![CDATA[<p>The big data world is a confusing place. We’re no longer in a market dominated mostly by relational databases, and the alternatives have multiplied in a baby boom of diversity.</p>
<div id="attachment_57364" class="wp-caption alignright" style="width: 160px"><a href="http://s.radar.oreilly.com/wp-files/5/2013/05/schoolyard2.jpg"><img class="size-thumbnail wp-image-57364" alt="My data is bigger than yours." src="http://s.radar.oreilly.com/wp-files/5/2013/05/schoolyard2-150x150.jpg" width="150" height="150" /></a><p class="wp-caption-text">My data is bigger than yours.</p></div>
<p>These child prodigies of the data scene show great promise but spend a lot of time knocking each other around in the schoolyard. Their egos can sometimes be too big to accept that everybody has their place, and eyeball-seeking media certainly doesn’t help.</p>
<p><strong>POPULAR KID:</strong> Look at me! Big data is the hotness!<br />
<strong>HADOOP:</strong> My data’s bigger than yours!<br />
<strong>SCIPY:</strong> Size isn’t everything, Hadoop! The bigger they come, the harder they fall. And aren’t you named after a toy elephant?<br />
<strong>R:</strong> Backward sentences mine be, but great power contains large brain.<br />
<strong>EVERYONE:</strong> Huh?<br />
<strong>SQL:</strong> Oh, so you all want to be friends again now, eh?!<br />
<strong>POPULAR KID:</strong> Yeah, what SQL said! Nobody really needs big data; it’s all about small data, dummy.</p>
<p><span id="more-57360"></span></p>
<p>The fact is that we’re fumbling toward the adolescence of big data tools, and we’re at an early stage of understanding how data can be used to create value and increase the quality of service people receive from government, business and health care. Big data is trumpeted in mainstream media, but many businesses are better advised to take baby steps with small data.</p>
<p>Data skeptics are not without justification. Our use of “small data” hasn’t exactly worked out uniformly well so far, crude numbers often being misused either knowingly or otherwise. For example, over-reliance by bureaucrats on the results of testing in schools is shaping educational institutions toward a tragically homogeneous mediocrity.</p>
<p>The promise and the gamble of big data is this: that we can advance past the primitive quotas of today’s small data into both a sophisticated statistical understanding of an entire system and insight that focuses down to the level of an individual. Data gives us both telescope and microscope, in detail we’ve never had before.</p>
<p>Inside this tantalizing vision lies many of the debates in today’s data world: the need for highly skilled data scientists to effect this change, and the worry that we’ll inadvertently enslave ourselves to Big Brother, even with the best of intentions.</p>
<p>So, as the data revolution moves forward, it’s important to take the long view. The foment of tools and job titles and algorithms is significant, but ultimately it’s background to our larger purposes as people, businesses and government. That’s one reason why, at O’Reilly, we’ve taken the motto “Making Data Work” for Strata. Data, not technology, is the heartbeat of our world because it relates directly to ourselves and the problems we want to solve.</p>
<p>This is also the reason that the <a href="http://strataconf.com/stratany2013/">Strata and Hadoop World</a> conferences take a broad view of the subject: ranging from the business topics to the tools and data science. If you talk to Hadoop’s most seasoned advocates, they don’t speak only about the tech; they talk about the problems they’re able to solve. The tools alone are never enough; the real enabler is the framework of people and understanding in which they’re used.</p>
<p>Our mission is to help people make sense of the state of the data world and use this knowledge to become both more competitive and more creative. We believe that’s best served by creating context in which we think about our use of data as well as serving the growing specialist communities in data.</p>
<p>Enjoy the noise and the energy from the growing data ecosystem, but keep your eyes on the problems you want to solve.</p>
<p><em>The Strata and Hadoop World <a href="http://strataconf.com/stratany2013/public/cfp/264">Call for Proposals</a> is open until midnight EDT, Thursday May 16.</em></p>
<h5><em>This post was originally published on <a href="http://radar.oreilly.com/2013/05/big-data-cool-kids.html">Radar</a>.</em></h5>
]]></content:encoded>
			<wfw:commentRss>http://strata.oreilly.com/2013/05/big-data-cool-kids.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Steering the ship that is data science</title>
		<link>http://strata.oreilly.com/2013/05/steering-the-ship-that-is-data-science.html</link>
		<comments>http://strata.oreilly.com/2013/05/steering-the-ship-that-is-data-science.html#comments</comments>
		<pubDate>Tue, 14 May 2013 13:00:39 +0000</pubDate>
		<dc:creator>Q Ethan McCallum</dc:creator>
				<category><![CDATA[Data]]></category>
		<category><![CDATA[data]]></category>
		<category><![CDATA[data science]]></category>

		<guid isPermaLink="false">http://strata.oreilly.com/?p=57252</guid>
		<description><![CDATA[Mike Loukides recently recapped a conversation we’d had about leading indicators for data science efforts in an organization. We also pondered where the role of data scientist is headed and realized we could treat software development as a prototype case. It’s easy (if &#8230; ]]></description>
				<content:encoded><![CDATA[<p><a href="https://www.twitter.com/mikeloukides">Mike Loukides</a> recently recapped a conversation we’d had about <a href="http://strata.oreilly.com/2013/05/leading-indicators.html">leading indicators for data science efforts</a> in an organization. We also pondered where the role of data scientist is headed and realized we could treat software development as a prototype case.</p>
<p>It’s easy (if not eerie) to draw parallels between the Internet boom of the mid 1990s and the Big Data boom of the present day: in addition to the exuberance in the press and the new business models, a particular breed of technical skill became a competitive advantage and a household name. Back then, this was the software developer. Today, it’s the data scientist.</p>
<p>The time in the sun improved software development in some ways, but it also brought its share of problems. Some companies were short on the skill and discipline required to manage custom software projects, and they were equally ill-equipped to discern the true technical talent from the pretenders. That combination led to low-quality software projects that simply failed to deliver business value. (A number of these survive today as “repair-ware” that requires constant, expensive upkeep.)</p>
<p><span id="more-57252"></span></p>
<p>How will the data science field avoid software development’s pitfalls? (As an aside, we shudder to think what would be the data science equivalent of “repair-ware.”) We started to explore some ideas but realized they were all rooted in education, business value, and openness:</p>
<p>Company leaders must educate themselves in order to understand how data analysis can improve their firm. That knowledge will guide them in building out a data science team and establishing its mission. Leaders otherwise risk trivializing the data scientist role or overindulging in analytics for the sake of analytics.</p>
<p>In turn, data scientists must understand how their work is meant to improve the business. That will serve as their compass when they explore new ideas so they can aim to deliver solid value. Without that guidance, it’s too easy to get stuck in rabbit-holes and yak-shaving.</p>
<p>Both parties must be vigilant of any needless barriers forming around the data science team, especially after the initial novelty fades. Open communication between the data science group and the rest of the business will ensure the former doesn’t land in a separate silo, marginalized out of the company mission.</p>
<p>These ideas may be a start, but are they enough? Probably not. What would you recommend to steer data science clear of pitfalls?</p>
<p>This post originally appeared on <a href="http://radar.oreilly.com/2013/05/steering-the-ship-that-is-data-science.html">http://radar.oreilly.com/</a>.</p>
<p><strong>Related:</strong></p>
<ul>
<li><a href="http://strata.oreilly.com/2013/05/leading-indicators.html">Leading indicators</a></li>
<li><a href="http://strata.oreilly.com/2013/04/every-leader-has-their-how-i-got-here-story.html">Every leader has their &#8220;how I got here&#8221; story</a></li>
<li><a href="http://shop.oreilly.com/product/0636920024422.do">Bad Data Handbook &#8211; Book</a></li>
</ul>
<div style="float: left;border-top: thin gray solid;border-bottom: thin gray solid;padding: 20px;margin: 20px 2px;clear: both"><a href="http://strataconf.com/?intcmp=il-strata-stny13-blog-promo"><img style="float: left;border: none;padding-right: 10px" alt="" src="http://cdn.oreilly.com/radar/images/promos/2013-strata-rx-london-ny.gif" /></a><a href="http://strataconf.com/?intcmp=il-strata-stny13-blog-promo"><strong>O&#8217;Reilly Strata Conference</strong></a> — Strata brings together the leading minds in data science and big data — decision makers and practitioners driving the future of their businesses and technologies. Get the skills, tools, and strategies you need to make data work.<a href="http://strataconf.com/rx2013?intcmp=il-strata-strx13-strata-blog-banner-148x178">Strata Rx Health Data Conference</a>: September 25-27 | Boston, MA<br />
<a href="http://strataconf.com/stratany2013?intcmp=il-strata-stny13-blog-promo">Strata + Hadoop World</a>: October 28-30 | New York, NY<br />
<a href="http://strataconf.com/strataeu2013/?intcmp=il-strata-steu13-blog-promo">Strata in London</a>: November 15-17 | London, England</div>
]]></content:encoded>
			<wfw:commentRss>http://strata.oreilly.com/2013/05/steering-the-ship-that-is-data-science.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Evaluating machine learning systems: Kaggle&#8217;s not enough</title>
		<link>http://strata.oreilly.com/2013/05/different-evaluation-criteria-for-machine-learning-systems.html</link>
		<comments>http://strata.oreilly.com/2013/05/different-evaluation-criteria-for-machine-learning-systems.html#comments</comments>
		<pubDate>Mon, 13 May 2013 16:00:55 +0000</pubDate>
		<dc:creator>Beau Cronin</dc:creator>
				<category><![CDATA[Data]]></category>
		<category><![CDATA[data]]></category>
		<category><![CDATA[data science]]></category>
		<category><![CDATA[machine learning]]></category>

		<guid isPermaLink="false">http://strata.oreilly.com/?p=57168</guid>
		<description><![CDATA[There is a tremendous amount of commercial attention on machine learning (ML) methods and applications. This includes product and content recommender systems, predictive models for churn and lead scoring, systems to assist in medical diagnosis, social network sentiment analysis, and &#8230; ]]></description>
				<content:encoded><![CDATA[<p>There is a tremendous amount of commercial attention on machine learning (ML) methods and applications. This includes product and content recommender systems, predictive models for churn and lead scoring, systems to assist in medical diagnosis, social network sentiment analysis, and on and on. ML often carries the burden of extracting value from big data.</p>
<p>But getting good results from machine learning still requires much art, persistence, and even luck. An engineer can&#8217;t yet treat ML as just another well-bahaved part of the technology stack. There are many underlying reasons for this, but for the moment I want to focus on how we measure or evaluate ML systems.</p>
<p>Reflecting their academic roots, machine learning methods have traditionally been evaluated in terms of narrow quantitative metrics: precision, recall, RMS error, and so on. The data-science-as-competitive-sport site Kaggle has adopted these metrics for many of its competitions. They are objective and reassuringly concrete.</p>
<p><span id="more-57168"></span></p>
<p>But scaled, production systems have very different requirements than proof-of-concept academic implementations or prize-winning models. Adopting these metrics from the research world <em>incentivizes one-off, specialized, brittle solutions</em> rather than the <em>reliable, reusable, composable subsystems that form the foundation of good software engineering</em>.</p>
<p>So I&#8217;d like to propose some different evaluation criteria for ML systems, with the hope that we raise our collective expectations of what they should provide and, eventually, build them differently.</p>
<ul>
<li><strong>Encapsulation &amp; abstraction</strong>. An ML system should behave well as a component in a large software system. It should provide an elegant programming interface, use standard data formats, and hide as much complexity as possible from developer.</li>
<li><strong>Safety &amp; conservatism</strong>. An ML system shouldn&#8217;t place the burden of avoiding overfitting on the user. It should be willing and able to communicate uncertainty about its results, including the possibility of &#8220;shrugging its shoulders&#8221; when the data is insufficient.</li>
<li><strong>Simple and transparent controls</strong>. An ML system should expose its configuration and parameters in a clear, transparent way. The user should not have to perform heuristic searches through the parameter space, and there should be no art or mystery involved. The system should require as little tuning as possible from the user, with sensible defaults that handle the common cases.</li>
<li><strong>Handling messy, real-world data.</strong> Real data has duplicates and missing values; is full of noise, errors, and surprises; and is composed of mixed of numerical, categorical, text, geospatial data, etc. ML systems should handle datasets as they are found in the wild, rather than forcing the user to perform significant cleanup and heuristic &#8220;feature engineering&#8221;.</li>
</ul>
<p>ML methods and systems that are evaluated and perform well along these lines will help tame the complexity in smart software systems. As a result, more developers will be able to use them successfully and the resulting systems will be more resilient. ML will eventually transition from its current role as the the high-maintenance prima donna of the data stack to a workhorse component.</p>
<p>I would love to see more effort devoted to improvements on these fronts, even if that means less emphasis on capturing incremental accuracy improvements on specific problems. These criteria represent a significant change in focus, though, and they may disturb some ML experts because they embrace a more black-box approach in which the internals of the system are less accessible. I would argue that it is exactly this sort of encapsulation &#8211; wisely performed and tastefully engineered &#8211; that is needed for machine learning to have the wide impact that, say, relational databases have achieved in the past decades.</p>
<p><strong>Machine Learning Related Resources:</strong></p>
<ul>
<li>
<a href="http://oreillynet.com/pub/e/2532">Machine Learning for Hackers Webcast</a>
</li>
<li>
<a href="http://shop.oreilly.com/product/0636920018483.do">Machine Learning for Hackers &#8211; Case Studies</a>
</li>
<li>
<a href="http://shop.oreilly.com/product/0636920017493.do">An Introduction to Machine Learning with Web Data Video</a>
</li>
</ul>
<div style="float: left;border-top: thin gray solid;border-bottom: thin gray solid;padding: 20px;margin: 20px 2px;clear: both">
<p><a href="http://strataconf.com/?intcmp=il-strata-stny13-blog-promo"><img style="float: left;border: none;padding-right: 10px" alt="" src="http://cdn.oreilly.com/radar/images/promos/2013-strata-rx-london-ny.gif" /></a><a href="http://strataconf.com/?intcmp=il-strata-stny13-blog-promo"><strong>O&#8217;Reilly Strata Conference</strong></a> — Strata brings together the leading minds in data science and big data — decision makers and practitioners driving the future of their businesses and technologies. Get the skills, tools, and strategies you need to make data work.</p>
<p><a href="http://strataconf.com/rx2013?intcmp=il-strata-strx13-strata-blog-banner-148x178">Strata Rx Health Data Conference</a>: September 25-27 | Boston, MA<br />
<a href="http://strataconf.com/stratany2013?intcmp=il-strata-stny13-blog-promo">Strata + Hadoop World</a>: October 28-30 | New York, NY<br />
<a href="http://strataconf.com/strataeu2013/?intcmp=il-strata-steu13-blog-promo">Strata in London</a>: November 15-17 | London, England</p>
</div>
]]></content:encoded>
			<wfw:commentRss>http://strata.oreilly.com/2013/05/different-evaluation-criteria-for-machine-learning-systems.html/feed</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>11 Essential Features that Visual Analysis Tools Should Have</title>
		<link>http://strata.oreilly.com/2013/05/11-essential-features-that-visual-analysis-tools-should-have.html</link>
		<comments>http://strata.oreilly.com/2013/05/11-essential-features-that-visual-analysis-tools-should-have.html#comments</comments>
		<pubDate>Sun, 12 May 2013 16:00:16 +0000</pubDate>
		<dc:creator>Ben Lorica</dc:creator>
				<category><![CDATA[Data]]></category>
		<category><![CDATA[data visualization]]></category>
		<category><![CDATA[visualization]]></category>

		<guid isPermaLink="false">http://strata.oreilly.com/?p=57269</guid>
		<description><![CDATA[After recently playing with SAS Visual Analytics, I&#8217;ve been thinking about tools for visual analysis. By visual analysis I mean the type of analysis most recently popularized by Tableau, QlikView, and Spotfire: you encounter a data set for the first &#8230; ]]></description>
				<content:encoded><![CDATA[<p>After recently <a href="https://twitter.com/bigdata/status/329328698140553217">playing with SAS Visual Analytics</a>, I&#8217;ve been thinking about tools for visual analysis. By <i>visual analysis</i> I mean the type of analysis most recently popularized by <a href="http://www.tableausoftware.com/">Tableau</a>, <a href="http://www.qlikview.com/">QlikView</a>, and <a href="http://spotfire.tibco.com/">Spotfire</a>: you encounter a data set for the first time, conduct <a href="http://en.wikipedia.org/wiki/Exploratory_data_analysis">exploratory data analysis</a>, with the goal of discovering interesting patterns and associations. Having used a few visualization tools myself, here&#8217;s a quick <span style="text-decoration: underline">wish-list</span> of features (culled from tools I&#8217;ve used or have seen in action).</p>
<p><b>Requires little (to no) coding</b><br />
The viz tools I currently use require programming skills. Coding means switching back-and-forth between a visual (chart) and text (code). It&#8217;s nice<sup>1</sup> to be able to customize charts via code, but when you&#8217;re in the exploratory phase not having to think about code syntax is ideal. Plus GUI-based tools allow you to collaborate with many more users.</p>
<p><span id="more-57269"></span></p>
<p><b>Includes an expanded set of basic charts</b><br />
Aside from statistical graphics (line, bar, scatter, histogram, bubble, boxplot,&#8230;), these days the ability to visualize hierarchies (<a href="http://en.wikipedia.org/wiki/Treemapping">treemap</a>), financial (stock charts), longitudinal, geospatial (maps) and <a href="http://en.wikipedia.org/wiki/Graph_drawing">network</a> data are essential.</p>
<p><b>Charts are easy to customize</b><br />
It should be easy to tweak labels, colors, and other elements. There are times when default labels need to be resized or repositioned, to make them legible. You should also be able to adjust coloring schemes to your liking (colors are usually assigned based on category or, in the case of heat maps, value).</p>
<p><b>Templates can be created</b><br />
Once you create a chart with your preferred color and labeling scheme, you should be able to templatize it for future projects. [Ideally templates support rule-based formatting ("if negative, color = red"), but this starts to involve some coding.]</p>
<p><b>Visual <i>summaries</i> are easy to generate</b> (histograms, <a href="http://en.wikipedia.org/wiki/Distance_correlation">association matrix</a>)<br />
You&#8217;ll be exploring data sets that contain many observations (rows) and variables (columns). SAS Visual Analytics produces a quick <i>summary</i> (average, min/max, histogram) for <i>each</i> variable and displays the results in a compact, scrollable format. This is done entirely through a GUI and doesn&#8217;t require any coding.</p>
<p><b>Drill-down to source points: identify, isolate, and fix minor data errors</b><br />
Visual summaries<sup>2</sup> alert you to potential problems with your data (outliers or errors). A few tools give you the ability to isolate outliers or fix simple data problems through a GUI. More generally, it&#8217;s nice to be able to drill-down from the chart to <i>examine</i> (via dynamic rollover or other method) the underlying data.</p>
<p><b>In-place filtering</b><br />
While exploring data, you need to be able to quickly filter by value or category &#8211; using checkboxes, drop-downs, sliders, &#8230;</p>
<p><b>Support for <i>visual pivoting</i></b><br />
Many business analysts are heavy users of <a href="http://en.wikipedia.org/wiki/Pivot_table">pivot tables</a> &#8211; a tabular summarization technique found in spreadsheets and reporting tools. Visual pivoting replaces tabular presentation with charts. My first experience using this type of visual exploration was through the <a href="http://stat.bell-labs.com/project/trellis/display.examples.html">Trellis graphs</a> introduced in S/S-Plus. Thanks to <a href="http://www.tableausoftware.com/">Tableau&#8217;s easy-to-use interface</a>, this form of visual analysis has become a popular way to explore data.</p>
<p><b>Support for analytics</b><br />
Many visualization tools lack analytic capabilities. From simple (<a href="http://en.wikipedia.org/wiki/Error_bar">error bar</a>, <a href="http://en.wikipedia.org/wiki/Quantile">quantiles</a>) to advanced (clustering, forecasting, multidimensional scaling<sup>3</sup>), analytic tools expand what users can do. Case in point, SAS Visual Analytics has tools for conducting <i>sensitivity analysis and forecasting</i> (GUI-based, no coding required). An example is to take a given <i>time-series</i> (unit sales), plot a forecast of its behavior for the next six time periods, and study how the forecast varies when other <i>key variables</i> (customer satisfaction) change.</p>
<p><b>Tools for sharing, collaboration, and replication</b><br />
Several tools let you publish<sup>4</sup> your static or interactive charts, and some tools even let you <i>subscribe</i><sup>5</sup> to the work of other users. For sharing, collaboration, and documentation, it should be possible to annotate your work. Being able to collaborate with others would be nice, at a minimum one should at least be able to copy (<span style="text-decoration: underline">and</span> modify) the work of another user.</p>
<p><b>Big Data: Volume and <i>Variety</i></b><sup>6</sup><br />
A tool should produce charts <i>quickly</i> even when it&#8217;s hitting massive data sets. Simply put, it should be truly interactive<sup>7</sup>. Several new tools target larger data sets, some are geared specifically for Hadoop users (a partial list includes <a href="http://www.datameer.com">Datameer</a>, <a href="http://www.platfora.com">Platfora</a>, <a href="http://www.sisense.com">SiSense</a>, and <a href="http://www.sas.com/software/visual-analytics/overview.html">SAS Visual Analytics</a>). But there will be occasions when you&#8217;ll be working with small data sets (or be offline). To that end you should be able to visually explore small data (locally using your laptop) without having to connect to a more powerful environment (such as a cluster or a beefy server).</p>
<p>I haven&#8217;t come across great viz tools for exploring unstructured data, so I&#8217;ll interpret <i>variety</i> in a different way. <i>Co-existence</i> (usually of Hadoop &amp; data warehouses) means data will continue to reside in different systems. Being able to connect to a variety of data sources is essential. (Among startups, Datameer does a <a href="http://www.datameer.com/product/data-integration.html">good job</a> of this.) Some tools include public data sets (e.g., US Census) and use them to generate examples.</p>
<hr />
<p><small><br />
(0) Thanks to <a href="http://www.ghostweather.com/bio.html">Lynn Cherny</a> for reviewing an early draft of this post and for suggesting a few features.<br />
(1) Unless of course you have killer programming tools, a la <a href="http://www.youtube.com/watch?v=PUv66718DII">Bret Victor</a>. You can do some of the things described in the post using <a href="http://www.revolutionanalytics.com/products/enterprise-big-data.php">ScaleR from Revolution Analytics</a> &#8211; but it&#8217;s a tool that requires coding in R.<br />
(2) A good example: SAS Visual Analytics displays the number of distinct values of categorical variables. If the number of distinct values is unusually large, you likely have a data quality issue.<br />
(3) Or other tools for handling high-dimensional data sets. Still waiting for a next-gen <a href="http://en.wikipedia.org/wiki/GGobi">ggobi</a>!<br />
(4) Datameer takes this a step further: it has <a href="http://www.datameer.com/apps">an app market</a>.<br />
(5) Some tools even send you realtime alerts when data for charts you&#8217;ve subscribed to have changed.<br />
(6) I omitted <i>Velocity</i> &#8211; the ability to handle streaming data. I consider that a nice, but not a must-have feature for a visual <i>exploration</i> tool. Having said that, I do think the ability to handle realtime updates is essential when you share your work with others. See (5).<br />
(7) When working with truly massive data sets it&#8217;s natural to have some latency. Rather than having users idle while waiting, visual analysis tools should support multiple tabs or workspaces. Most database query tools have this feature: you can work on other queries while a query is still running.<br />
</small></p>
<div style="float: left;border-top: thin gray solid;border-bottom: thin gray solid;padding: 20px;margin: 20px 2px;clear: both"><a href="http://strataconf.com/?intcmp=il-strata-stny13-blog-promo"><img style="float: left;border: none;padding-right: 10px" src="http://cdn.oreilly.com/radar/images/promos/2013-strata-rx-london-ny.gif" /></a><a href="http://strataconf.com/?intcmp=il-strata-stny13-blog-promo"><strong>O&#8217;Reilly Strata Conference</strong></a> &mdash; Strata brings together the leading minds in data science and big data &mdash; decision makers and practitioners driving the future of their businesses and technologies. Get the skills, tools, and strategies you need to make data work.</p>
<p> <a href="http://strataconf.com/rx2013?intcmp=il-strata-strx13-strata-blog-banner-148x178">Strata Rx Health Data Conference</a>: September 25-27 |  Boston, MA<br /> <a href="http://strataconf.com/stratany2013?intcmp=il-strata-stny13-blog-promo">Strata + Hadoop World</a>: October 28-30 |  New York, NY<br /><a href="http://strataconf.com/strataeu2013/?intcmp=il-strata-steu13-blog-promo">Strata in London</a>: November 15-17  |  London, England </div>
]]></content:encoded>
			<wfw:commentRss>http://strata.oreilly.com/2013/05/11-essential-features-that-visual-analysis-tools-should-have.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>