<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Strata &#187; Alex Howard</title>
	<atom:link href="http://strata.oreilly.com/alexh/feed" rel="self" type="application/rss+xml" />
	<link>http://strata.oreilly.com</link>
	<description>Making Data Work</description>
	<lastBuildDate>Fri, 24 May 2013 20:31:52 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.5.1</generator>
		<item>
		<title>Finding and telling data-driven stories in billions of tweets</title>
		<link>http://strata.oreilly.com/2013/04/finding-and-telling-data-driven-stories-in-billions-of-tweets.html</link>
		<comments>http://strata.oreilly.com/2013/04/finding-and-telling-data-driven-stories-in-billions-of-tweets.html#comments</comments>
		<pubDate>Thu, 18 Apr 2013 21:00:20 +0000</pubDate>
		<dc:creator>Alex Howard</dc:creator>
				<category><![CDATA[Data]]></category>
		<category><![CDATA[data]]></category>
		<category><![CDATA[data editor]]></category>
		<category><![CDATA[data journalism]]></category>
		<category><![CDATA[guardian]]></category>
		<category><![CDATA[Twitter]]></category>

		<guid isPermaLink="false">http://strata.oreilly.com/?p=56521</guid>
		<description><![CDATA[Twitter has hired its first data editor. Simon Rogers, one of the leading practitioners of data journalism in the world, will join Twitter in May. He will be moving his family from London to San Francisco and applying his skills to &#8230; ]]></description>
				<content:encoded><![CDATA[<div id="attachment_56534" class="wp-caption alignright" style="width: 85px"><a href="http://s.radar.oreilly.com/wp-files/5/2013/04/1212-simon-rogers.jpg"><img class="size-full wp-image-56534" alt="GD*15341872" src="http://s.radar.oreilly.com/wp-files/5/2013/04/1212-simon-rogers.jpg" width="75" height="100" /></a><p class="wp-caption-text">Simon Rogers</p></div>
<p>Twitter has hired its first <a>data editor</a>. <a href="https://twitter.com/smfrogers">Simon Rogers</a>, one of the leading practitioners of data journalism in the world, will <a href="http://www.pressgazette.co.uk/content/guardian-news-editor-simon-rogers-joins-twitter-us-first-data-editor">join Twitter</a> in May. He will be moving his family from London to San Francisco and applying his skills to telling data-driven stories using tweets. <a href="http://www.guardian.co.uk/profile/jamesball">James Ball</a> will replace him as the Guardian’s new data editor.</p>
<p>As a data editor, will Rogers keep editing and producing something that we’ll recognize as journalism? Will his work at Twitter be different than what <a href="http://www.google.com/think/">Google Think</a> or <a href="http://www.facebookstories.com/'">Facebook Stories</a> delivers? Different in terms of how he tells stories with data? Or is the difference that Twitter has a lot more revenue coming in or sees data-driven storytelling as core to driving more business? (Rogers wouldn’t comment on those counts.)</p>
<p><span id="more-56521"></span></p>
<p>The gig clearly has potential and Rogers clearly has demonstrable capacity. As he related to me today, in an interview, “what I’m good at is explaining data, simplifying it and making it accessible.”</p>
<p>That&#8217;s a critical set of skills in business, government, or media today. Data-driven journalists have to understand data sources, quality, context, and underlying biases. That&#8217;s equally true of Twitter.  Pew Research reminded us in 2013 that Twitter is not representative of everyone and is <a href="http://www.pewresearch.org/2013/03/04/twitter-reaction-to-events-often-at-odds-with-overall-public-opinion/">often at odds with public opinion</a>.</p>
<p>Tweets aren’t always a reliable source to understand everything that happens in the world but it’s undeniable that useful insights can be found there. It has become a core component of the  set of digital tools and platforms that journalists apply in their work, connected to smartphone phones, pens, water bottles and notebooks. News frequently breaks on Twitter first and is shared by millions of users independently of any media organization. Journalists now use Twitter to apply a trade that&#8217;s well over a century old: gather and fact-check reports, add context, and find the truth of what’s happening. (Picking up the phone and going to location still matter, naturally.) The amount of <a href="http://blog.sfgate.com/techchron/2013/04/17/should-twitter-care-about-all-the-misinformation-in-its-product/">misinformation</a> on Twitter during major news events puts a high premium on the media debunking rumors and sharing accurate facts.</p>
<p>Will the primary difference in Rogers’ ability to find truth and meaning in the tweets be access to Twitter’s full firehose, developers, and processing power? His work will have to be judged on its own merits. Until he starts his new gig in May, the following interview offers more insight into why he joined Twitter and how he’s thinking about what he’ll be doing there.</p>
<p><span style="color: #000000;font-size: 1.8em;line-height: 1.5em">Why leave the paper now?</span></p>
<p><strong>Simon Rogers:</strong> I love the Guardian and have always wanted to work here. I grew up in a house where we read two papers: The Guardian during the week and the Observer on Sundays. I&#8217;ve had offers but this is the first job where it&#8217;s become a serious possibility.</p>
<p>There are a few reasons.</p>
<p>Firstly, Twitter is an amazing phenomenon. It&#8217;s changed every level of how we work as reporters. We really saw that during the &#8220;<a href="http://www.guardian.co.uk/uk/series/reading-the-riots">Reading the Riots</a>&#8221; project. There we had 1.6 million riot-related tweets which Twitter gave us to analyze.</p>
<div id="attachment_56522" class="wp-caption aligncenter" style="width: 610px"><a href="http://s.radar.oreilly.com/wp-files/5/2013/04/london-riots.jpg"><img class="size-full wp-image-56522 " alt="london-riots" src="http://s.radar.oreilly.com/wp-files/5/2013/04/london-riots.jpg" width="600" height="369" /></a><p class="wp-caption-text">London Riots</p></div>
<p>That was important because politicians were agitating about the &#8216;role&#8217; of Twitter during the disturbances. The work that our team did with academics at Manchester and the subsequent <a href="http://www.guardian.co.uk/uk/interactive/2011/dec/07/london-riots-twitter">interactive</a> produced by Alastair Dant and the interactive team here opened my eyes to the facts that:</p>
<ul>
<li>Twitter and the way it&#8217;s used tells us a lot about every aspect of life</li>
<li>The data behind those tweets can really shine a light on the big stories of the moment</li>
<li>If you can combine that data with brilliant developers you have a really powerful tool</li>
</ul>
<p>Secondly, Twitter is an amazing place from what I&#8217;ve seen so far. There&#8217;s a real energy about the place and some brilliant people doing fascinating things. I love the idea of being part of that team.</p>
<p>Thirdly, I&#8217;ve been at the Guardian nearly 15 years. I am so comfortable and confident in what I do there that I need a new challenge. This all just came together at the right time.</p>
<h2>As a data-driven journalist, you&#8217;ve had to understand data sources, quality, context, and underlying biases. How does that apply to Twitter?</h2>
<p><strong>Simon Rogers: </strong>Absolutely. Mark Twain said &#8220;a lie can be halfway around the world before the truth has got its boots on.&#8221; All social media encourages that.</p>
<p>I think the work we did with the riot tweets shows how the truth can catch up fast. What interested me about <a href="http://www.bostonglobe.com/metro/specials/boston-marathon-explosions">Boston</a> was the way that people were tweeting calmness, if you like.</p>
<p>I think we&#8217;ve seen this with the <a href="http://www.guardian.co.uk/news/datablog">Datablog</a> in general: that people used to worry that the masses weren&#8217;t clever enough to understand the data that we were publishing. In fact, the community self-rights itself, correcting errors other readers or even ourselves had perpetrated. That&#8217;s really interesting to me.</p>
<h2>What will you be able to do at Twitter with data that you couldn&#8217;t do at the Guardian data desk?</h2>
<p><strong>Simon Rogers: </strong>Just to be there, in the midst of that data will be amazing. I think it will make me better at what I do. And I hope I have something to offer them too.</p>
<h2>Will you be using the same tools as you&#8217;ve been applying at the Guardian?</h2>
<p><strong>Simon Rogers: </strong>I&#8217;m looking forward to learning some new ones. I&#8217;m comfortable with what I know. It&#8217;s about time I became uncomfortable.</p>
<h2>Twitter has some of the world&#8217;s best data scientists. What makes being a data editor different from being a data scientist?</h2>
<p><strong>Simon Rogers: </strong>I&#8217;m not the world&#8217;s best statistician. I&#8217;m not even very good at maths. I guess what I&#8217;ve been doing at The Guardian is acting as a human bridge between data that&#8217;s tricky to understand; and a wider audience that wants to understand it. Isn&#8217;t that what all data journalism is?</p>
<p>My take on being a data editor at the Guardian was that I used it as a way to make data more accessible &#8211; crucially the understanding of it. I need to understand it, to make it clear to others and I want to explain that data in ways that I can understand. Is that the difference between data editors and data scientists? I don&#8217;t know &#8211; I think a lot of these definitions are artificial anyway.</p>
<p>It&#8217;s like people getting data journalism and data visualization mixed up. I think they are probably different things and involve different processes, but in the end, does it matter anyway?</p>
<p><em>This interview was edited and condensed.</em></p>
<div style="float: left;border-top: thin gray solid;border-bottom: thin gray solid;padding: 20px;margin: 20px 2px;clear: both">
<p><a href="http://strataconf.com/?intcmp=il-strata-stny13-blog-promo"><img style="float: left;border: none;padding-right: 10px" alt="" src="http://cdn.oreilly.com/radar/images/promos/2013-strata-rx-london-ny.gif" /></a><a href="http://strataconf.com/?intcmp=il-strata-stny13-blog-promo"><strong>O&#8217;Reilly Strata Conference</strong></a> — Strata brings together the leading minds in data science and big data — decision makers and practitioners driving the future of their businesses and technologies. Get the skills, tools, and strategies you need to make data work.</p>
<p><a href="http://strataconf.com/rx2013?intcmp=il-strata-strx13-strata-blog-banner-148x178">Strata Rx Health Data Conference</a>: September 25-27 | Boston, MA<br />
<a href="http://strataconf.com/stratany2013?intcmp=il-strata-stny13-blog-promo">Strata + Hadoop World</a>: October 28-30 | New York, NY<br />
<a href="http://strataconf.com/strataeu2013/?intcmp=il-strata-steu13-blog-promo">Strata in London</a>: November 15-17 | London, England</p>
</div>
]]></content:encoded>
			<wfw:commentRss>http://strata.oreilly.com/2013/04/finding-and-telling-data-driven-stories-in-billions-of-tweets.html/feed</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Untangling algorithmic illusions from reality in big data</title>
		<link>http://strata.oreilly.com/2013/03/untangling-algorithmic-illusions-from-reality-in-big-data.html</link>
		<comments>http://strata.oreilly.com/2013/03/untangling-algorithmic-illusions-from-reality-in-big-data.html#comments</comments>
		<pubDate>Wed, 06 Mar 2013 15:30:53 +0000</pubDate>
		<dc:creator>Alex Howard</dc:creator>
				<category><![CDATA[Data]]></category>
		<category><![CDATA[Big Data]]></category>
		<category><![CDATA[data]]></category>
		<category><![CDATA[data collection]]></category>
		<category><![CDATA[data ethics]]></category>
		<category><![CDATA[research]]></category>

		<guid isPermaLink="false">http://strata.oreilly.com/?p=55532</guid>
		<description><![CDATA[Microsoft principal researcher Kate Crawford (@katecrawford) gave a strong talk at last week&#8217;s Strata Conference in Santa Clara, Calif. about the limits of big data. She pointed out potential biases in data collection, questioned who may be excluded from it, &#8230; ]]></description>
				<content:encoded><![CDATA[<p>Microsoft principal researcher Kate Crawford (<a href="http://twitter.com/katecrawford">@katecrawford</a>) gave a strong talk at last week&#8217;s Strata Conference in Santa Clara, Calif. about the limits of big data. She pointed out potential biases in data collection, questioned who may be excluded from it, and hammered home the constant need for context in conclusions. Video of her talk is embedded below:</p>
<p><iframe width="640" height="360" src="http://www.youtube.com/embed/irP5RCdpilc?feature=oembed" frameborder="0" allowfullscreen></iframe></p>
<p>Crawford explored many of these same topics in our interview, which follows.</p>
<p><span id="more-55532"></span></p>
<h2>What research are you working on now, following up on your <a href="http://www.scribd.com/doc/98034770/Six-Provocations-for-Big-Data-Danah-Boyd-Kate-Crawford">paper on big data</a>?</h2>
<p><strong>Kate Crawford:</strong> I&#8217;m currently researching how big data practices are affecting different industries, from news to crisis recovery to urban design. This talk was based on that upcoming work, touching on questions of smartphones as sensors, on dealing with disasters (like Hurricane Sandy), and new epistemologies &mdash; or ways we understand knowledge &mdash; in an era of big data. </p>
<p>When &#8220;<a href="http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1926431&amp;http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1926431">Six Provocations for Big Data</a>&#8221; came out in 2011, we were critiquing the very early stages of big data and social media. In the two years since, the issues we raised are even more prominent. </p>
<p>I&#8217;m now looking beyond social media to a range of other areas where big data is raising questions of social justice and privacy. I&#8217;m also editing a special issue on critiques of big data, which will be coming out later this year in the International Journal of Communications.</p>
<h2>As more nonprofits and governments look to data analysis in governing or services, what do they need to think about and avoid?</h2>
<p><strong>Kate Crawford:</strong> Governments have a responsibility to serve all citizens, so it&#8217;s important that big data doesn&#8217;t become a proxy for &#8220;data about everyone.&#8221; There are two problems here: first is the question of who is visible and who isn&#8217;t represented; the second is privacy, or what I call &#8220;privacy practices&#8221; &mdash; because privacy means different things depending on where and who you are. </p>
<p>For example, the <a href="http://streetbump.org/">Streetbump</a> app is brilliant. What city wouldn&#8217;t want to passively draw on data from all those smartphones out there, a constantly moving network of sensors? But, as we know, there are significant percentages of Americans who don&#8217;t have smartphones, particularly older citizens and those with lower disposable incomes. What happens to their neighborhoods if they generate no data? They fall off the map. To be invisible when governments make resource decisions is dangerous. </p>
<p>Then, of course, there&#8217;s the whole issue of people signing up to be passively tracked wherever they go. People may happily opt into it, but we&#8217;d want to be very careful about who gets that data, and how it is protected over the long term &mdash; not just five years, but 50 years and beyond. Governments might be tempted to use that data for other purposes, even civic ones, and this has significant implications for privacy and the expectations citizens have for the use of their data.</p>
<h2>Where else could such biases apply?</h2>
<p><strong>Kate Crawford:</strong> There are many areas where big data bias is a problem from a social equity perspective. One of the key ones at the moment is law enforcement. I&#8217;m concerned by some of the work that seeks to &#8220;profile&#8221; areas, and even people, as likely to be involved in crime. It&#8217;s called &#8220;predictive policing&#8221; (<a href="http://ctolabs.com/wpcontent/uploads/2012/06/120627HadoopForLawEnforcement.pdf">more here</a>). We&#8217;ve already seen some problematic outcomes when profiling was introduced for plane travel. Now, imagine what happens if you or your neighborhood falls on the wrong side of a predictive model. How do you even begin to correct the record? Which algorithm do you appeal to?</p>
<h2>What are the things, as <a href="http://www.nytimes.com/2013/02/19/opinion/brooks-what-data-cant-do.html">David Brooks listed</a> recently, that big data can&#8217;t do?</h2>
<p><strong>Kate Crawford:</strong> There are lots of things that big data can&#8217;t do. It&#8217;s useful to consider the history of knowledge, and then imagine what it would look like if we only used one set of tools, one methodology for getting answers. </p>
<p>This is why I find people like <a href="http://en.wikipedia.org/wiki/Gabriel_Tarde">Gabriel Tarde</a> so interesting &mdash; he was grappling with ideas of method, big data and small data, back in the late 1800s. </p>
<p>He reminds us of what we can lose sight of when we go up orders of magnitude and try to leave small-scale data behind &mdash; like interviewing people, or observing communities, or running limited experiments. Context is key, and it is much easier to be attentive to context when we are surrounded by it. When context is dissolved into so many aggregated datasets, we can start getting mistaken impressions. </p>
<p>When <a href="http://www.nature.com/news/when-google-got-flu-wrong-1.12413">Google Flu Analytics mistakenly predicted</a> that 11% of the US had flu this year, that points to how relying on a big data signal alone may give us an exaggerated or distorted result (in that case, more than double the actual figure, which was between 4.5-4.8%). Now, imagine how much worse it would be if that data was all that health agencies had to work with.</p>
<p>I&#8217;m really interested in how we might best combine computational social science with traditional qualitative and ethnographic methods. With a range of tools and perspectives, we&#8217;re much more likely to get a three-dimensional view of a problem and be less prone to serious error. This goes beyond tacking on a few focus groups to big datasets, but conjoining deep, ethnographically-informed research with rich data sources.</p>
<h2>What can the history of statistics in social science tell us about correlation vs causation? Does big data change that dynamic?</h2>
<p><strong>Kate Crawford:</strong> This is a gigantic question, and one that could be its own talk! With big datasets, it&#8217;s very tempting for researchers to engage in apophenia &mdash; seeing patterns where none actually exist &mdash; because massive quantities of data can point to a range of correlative possibilities.</p>
<p>For example, <a href="http://www.haas.berkeley.edu/faculty/leinweber">David Leinweber</a> showed back in 2007 that data mining techniques could show a strong but spurious correlation between the changes in the S&amp;P 500 stock index and butter production in Bangladesh. There&#8217;s<br />
another <a href="http://www.businessweek.com/magazine/correlation-or-causation-12012011-gfx.html">great correlation</a> between the use of Facebook and the rise of the Greek debt crisis.</p>
<p>With big data techniques, some people argue you can get much closer to being able to predict causal relations. But even here, big data tends to need several steps of preparation (data &#8220;cleaning&#8221; and pre-processing) and several steps in interpretation (deciding which of many analyses shows a positive result versus a null-result). </p>
<p>Basically, humans are still in the mix, and thus it&#8217;s very hard to escape false positives, strained correlations and cognitive bias.</p>
<p><em>This interview originally appeared on <a href="http://radar.oreilly.com/2013/03/untangling-algorithmic-illusions-from-reality-in-big-data.html">O&#8217;Reilly Radar</a>.</em></p>
]]></content:encoded>
			<wfw:commentRss>http://strata.oreilly.com/2013/03/untangling-algorithmic-illusions-from-reality-in-big-data.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>On the power and perils of &#8220;preemptive government&#8221;</title>
		<link>http://strata.oreilly.com/2013/02/preemptive-government-predictive-data.html</link>
		<comments>http://strata.oreilly.com/2013/02/preemptive-government-predictive-data.html#comments</comments>
		<pubDate>Thu, 28 Feb 2013 16:00:52 +0000</pubDate>
		<dc:creator>Alex Howard</dc:creator>
				<category><![CDATA[Data]]></category>
		<category><![CDATA[data]]></category>
		<category><![CDATA[government]]></category>
		<category><![CDATA[predictive analysis]]></category>

		<guid isPermaLink="false">http://strata.oreilly.com/?p=55367</guid>
		<description><![CDATA[The last time I spoke with Stephen Goldsmith, he was the Deputy Mayor of New York City, advocating for increased use of &#8220;citizensourcing,&#8221; where government uses technology tools to tap into the distributed intelligence of residents to understand – and &#8230; ]]></description>
				<content:encoded><![CDATA[<p>The last time I spoke with Stephen Goldsmith, he was the Deputy Mayor of New York City, advocating for increased use of &#8220;<a href="http://radar.oreilly.com/2011/03/nyc-smart-government.html">citizensourcing</a>,&#8221; where government uses technology tools to tap into the distributed intelligence of residents to understand – and fix – issues around its streets, on its services and even within institutions. In the years since, as a professor at the Ash Center for Democratic Governance and Innovation<br />
at the John F. Kennedy School of Government at Harvard University, the former mayor of Indianapolis has advanced the notion of &#8220;<a href="http://www.governing.com/blogs/bfc/preemptive-government-cross-agency-data-prevent-problems.html">preemptive government</a>.&#8221;</p>
<p><span id="more-55367"></span></p>
<p>That focus caught my attention, given that my colleague, <a href="http://strata.oreilly.com/alistairc">Alistair Croll</a>, had published several posts on Radar looking at the <a href="http://strata.oreilly.com/2012/08/follow-up-on-big-data-and-civil-rights.html">ethics around big data</a>. The increasing use of data mining and algorithms by government to make decisions based upon pattern recognition and assumptions regarding future actions is a trend worth watching. Will guaranteeing government data quality become mission-critical, once high profile mistakes are made? Any assessment of the <a href="http://bits.blogs.nytimes.com/2013/02/25/the-promise-and-peril-of-the-data-driven-society">perils and promise of a data-driven society</a> will have to include a reasoned examination of the growing power of these applied algorithms to the markets and the lives of our fellow citizens.</p>
<p><a href="http://s.radar.oreilly.com/wp-files/5/2012/10/DCHomicideMap.png"><img class="size-full wp-image-52688 alignnone" alt="DCHomicideMap" src="http://s.radar.oreilly.com/wp-files/5/2012/10/DCHomicideMap.png" width="600" height="318" /></a></p>
<p>Given some of those concerns, I called Goldsmith up this winter to learn more about what he meant.</p>
<p>Our interview, lightly edited for content and clarity, follows.</p>
<h2>When you say &#8220;preemptive government,&#8221; what do you mean?</h2>
<p><strong>Stephen Goldsmith:</strong> I’m thinking about the intersection of trends here. One is what we might talk about as big data and data analytics. Inside, government has massive amounts of information located in all sorts of different places that if one looked at analytically, we could figure out which restaurants are most likely to have problems; which contractors are most likely to build bad buildings and the like.</p>
<p>For the first time, through the combination of digital processes, mobile tools and big data analytics, government can make preemptive solutions. Government generally responds to problems and <em>then</em> measures its performance by the number of activities it conducts, as contrasted to the problems it solves. New York City and Chicago have begun to take the lead in this area in specific places. When I was in New York City, we were trying to figure out how to set up a data analytics center. <a href="http://strata.oreilly.com/2012/11/3-big-ideas-for-big-data-in-the-public-sector.html">New York has started</a> to do some of that, so that we can predict where the next event’s going to occur and then solve it. That eventually needs to be merged with community sentiment mining, but it’s a slightly different issue.</p>
<h2>What substantive examples exist of this kind of approach making a difference?</h2>
<p><strong>Stephen Goldsmith:</strong> We are now operating a <a href="http://www.ash.harvard.edu/Home/Programs/Innovations-in-Government/Mayoral-Performance-Analytics-Initiative">mayoral performance analytics initiative</a> at the Kennedy School, trying to create energy around the solutions. We are featuring people who are doing it well, bringing folks together.</p>
<p>New York City, through a fellow named <a href="http://strata.oreilly.com/2012/11/3-big-ideas-for-big-data-in-the-public-sector.html">Mike Flowers</a>, has begun to solve specific problems in building violations and restaurant inspections. He&#8217;s overcoming the obstacles where all of the legacy CIOs say they can’t share data. He’s showing that you can.</p>
<p><a href="http://radar.oreilly.com/2011/08/chicago-data-apps-open-government.html">Chicago</a> is just doing remarkable stuff in a range of areas, from land use planning to crime control, like deciding how to intervene earlier in solving crimes.</p>
<p>Indiana has begun working on child welfare using analytics to figure out best outcomes for children in tough circumstances. They&#8217;re using analytics to measure which providers are best for which young adults that are in trouble, what type of kid is most successful with what type of mental health treatment, drug treatment , mentoring or the like.</p>
<p>I think these are all just scratch the surface. They need to be highlighted so that city and state leaders can understand how they can have dramatically better returns on their investments by identifying these issues in advance.</p>
<h2>Who else is sharing and consuming data to do predictive data analytics in the way that you&#8217;re describing?</h2>
<p><strong>Stephen Goldsmith:</strong> A lot of well-known staff programs, like ComStat or CityStat, do a really good job of measuring activities. When combined with analytics, they’ll be able to move into managing outcomes more effectively. There are a lot of folks like San Francisco beginning to look at these issues, but I think really, New York City and Chicago are in the lead.</p>
<h2>Based upon their example, what kinds of budget, staffing and policy decisions would other cities need to make to realize similar returns?</h2>
<p><strong>Stephen Goldsmith:</strong> The most restrictive element in government today is the no-longer-accurate impression that legacy data can’t be easily integrated. Every agency has a CIO who often believes it’s his or her job to protect that data. I’m not talking about privacy; I’m just talking about data integrity. We know that there’s a range of tools that will allow relatively easy integration and data mining.</p>
<p>Another lesson is that this really needs to be driven by the mayor or the governor. The answers to problems come from picking up data across agencies, not just managing the data inside your agency. Without city hall or gubernatorial leadership, it’s very difficult to drive data analytics.</p>
<h2>What about the risks of &#8216;preemptive government&#8217; leading to false positives or worse?</h2>
<p><strong>Stephen Goldsmith: </strong>There is a risk, but let me talk about it in the following way: Government can no longer afford to operate the way it operates. You cannot afford to regulate every business as if it’s equally bad or equally good. Every restaurant is not equally good or equally bad. Every contractor’s not bad or good. There are bad guys and good guys, and good performers and bad performers. There are families that need help and families that don’t need help. We need to allocate our resources most effectively to create solutions. That means we need to look at which solutions work for which problems.</p>
<p>What do we know about which contractors have a history of being bad? I don’t mean &#8220;bad&#8221; like just how they build &#8212; I mean have they paid their taxes right, do they discriminate in the marketplace, whatever those factors are in order to target our resources.</p>
<p>That means that when Flowers did this in New York, we got several hundred buildings that were the most likely to burn down. We knew that from analytics. We’re going to go out and mitigate those buildings. Could we make a mistake and say that ten of those 300 buildings really aren’t that bad? Absolutely, but it’s a much better targeting of resources and it’s the only way government can afford to effectively operate.</p>
<p>There are other issues, too, like personalization, where we have a lot of privacy issues, and &#8220;opt-in&#8221; and &#8220;opt out&#8221; where people may want a personal relationship with their government. That’s a little different than predictive analytics, but it raises privacy issues.</p>
<p>Then we have a fascinating question, one that social work communities and criminal justice communities worry about, which is, “Okay, you can predict the likelihood that somebody can be hurt, or that somebody will commit a crime and adjust resources accordingly &#8211; but we better be pretty careful because it raises a lot of ethics questions and profiling questions.”</p>
<p>My short answer is that these are important, legitimate questions. We can&#8217;t ignore them, but if we continue to do business the way we do it has more negative tradeoffs than not.</p>
<h2>Speaking of personalization and privacy, has mining social media for situational awareness during national disasters or other times of crisis become a part of the toolkit for cities?</h2>
<p><strong>Stephen Goldsmith: </strong>The conversation we&#8217;ve had has been about how to use enterprise data to make better decisions, right? That’s basically going to open up a lot of insight, but that model is pretty arrogant. It basically ignores crowd sourcing. It assumes that really smart people in government with a lot of data will make better solutions. But we know that people in communities can co-produce information together. They can crowd source solutions.</p>
<p>In New York City, we actually had some experience with this. One thread was the work that Rachel [Haot] was doing to communicate, but we were also using social media on the operations side. I think we’re barely getting started on how to mine community sentiment, how to integrate that with 311 data for better solutions, how to prioritize information where people have problems, and how to anticipate the problems early.</p>
<p>You may know that Indianapolis, in the 2012 Super Bowl, had a group of college students and a couple of local providers looking at Twitter conversations in order to intervene earlier. They were geotagged by name and curated to figure out where there was a crime problem, where somebody needed parking, where they were looking for tickets and where there’s too much trash on the ground. It didn’t require them to call government. Government was watching the conversation, participating in it and solving the problem.</p>
<p>I think that where we are has lots of potential and a little bit immature. The work now is to incorporate the community sentiment into the analytics and the mobile tools.</p>
]]></content:encoded>
			<wfw:commentRss>http://strata.oreilly.com/2013/02/preemptive-government-predictive-data.html/feed</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>U.S. House makes legislative data more open to the people in XML</title>
		<link>http://strata.oreilly.com/2013/01/u-s-house-open-data-open-government.html</link>
		<comments>http://strata.oreilly.com/2013/01/u-s-house-open-data-open-government.html#comments</comments>
		<pubDate>Fri, 11 Jan 2013 18:36:34 +0000</pubDate>
		<dc:creator>Alex Howard</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Congress]]></category>
		<category><![CDATA[open data]]></category>
		<category><![CDATA[open government]]></category>

		<guid isPermaLink="false">http://strata.oreilly.com/?p=53897</guid>
		<description><![CDATA[It was a good week for open government data in the United States Congress. On Tuesday, the Clerk of the House made House floor summaries available in bulk XML format. Yesterday, the House of Representatives announced that it will make &#8230; ]]></description>
				<content:encoded><![CDATA[<p>It was a good week for open government data in the United States Congress. On Tuesday, the Clerk of the House made <a href="http://techpresident.com/news/23345/house-floor-summaries-available-bulk-xml-format">House floor summaries available in bulk XML format</a>. Yesterday, the House of Representatives announced that it will make all of its <a href="http://go.usa.gov/4cEx">legislation available for bulk download</a> in a machine-readable format, XML, in cooperation with the U.S. Government Printing Office. As Nick Judd observes at TechPresident, such <a href="http://techpresident.com/news/23355/house-republicans-release-more-data-catnip-developers">data is catnip for developers</a>. While full <a href="http://radar.oreilly.com/2009/03/bulk-data-downloads-government-transparency-breakthrough.html">bulk data from THOMAS.gov</a> is still not available, this incremental progress deserves mention.</p>
<p><span id="more-53897"></span></p>
<p align="center"><img alt="image" src="http://www.speaker.gov/sites/speaker.house.gov/files/styles/spkr_node_full/public/opengov_130108_floorsummary.png" /></p>
<p>This change has been a long time coming, although more needs to be done to fully open the People&#8217;s House to the People. In April 2011, Speaker of the House John Boehner and Majority Leader Eric Cantor sent a letter to the House Clerk regarding <a href="http://sunlightfoundation.com/blog/2011/04/29/speaker-boehner-and-majority-leader-cantor-on-legislative-data-release/">legislative data release</a>. In September 2011, a <a href="http://sunlightlabs.com/blog/2011/house-revamps-floor-feed/">live XML feed for the House floor</a> went online. In September 2012, Congress launched a beta version of <a href="http://Congress.gov">Congress.gov</a> but <a href="http://radar.oreilly.com/2012/09/congress-launches-congress-gov-in-beta-doesnt-open-the-data.html">failed to open the data</a>.</p>
<p>&#8220;Thanks to GPO, all House bills for this Congress will be available in one XML file that can be downloaded by anyone,&#8221; said Speaker Boehner, in a joint statement with Majority Leader Cantor at Speaker.gov. &#8220;This is a win for every American who believes in open government. Making legislative data easier to use for third parties, developers, and anyone interested in how Congress is tackling current challenges is a priority for House leaders. We&#8217;re going to keep working to make the legislative process more transparent and to better connect lawmakers with the people we serve.&#8221;</p>
<p>In a <a>post</a> on Tuesday at Speaker.gov, Don Seymour, digital communications director for the Speaker of the House, detailed the progress made during the 112th Congress:</p>
<blockquote><p>This project is the first of several to be rolled out in the 113th Congress that were coordinated or initiated by the <a href="http://www.speaker.gov/general/house-moving-forward-bulk-legislative-info">Legislative Branch Bulk Data Task Force</a>. The task force was created to <a href="http://www.speaker.gov/press-release/house-leaders-back-bulk-access-legislative-information">expedite the process of providing bulk access to legislative information</a> and to increase transparency for the American people. It includes the House Clerk, legislative branch agencies such as the Government Printing Office and Library of Congress, representatives from House leadership and key committees, and the House Chief Administrative Officer.</p>
<p>Open government is and has been a priority for House leaders. In fact, the Clerk began offering <a href="http://www.speaker.gov/general/open-house-updated-clerk-website-offers-better-access-real-time-info">real-time updates on House floor proceedings</a> in XML back in 2011. The <a href="http://clerk.house.gov/floorsummary/floor.aspx">feed of real-time information</a> complemented <a href="http://houselive.gov/">HouseLive.gov</a>, a new video streaming feature they set up for <a href="http://www.speaker.gov/general/house-floor-now-streams-your-mobile-device-houselivegov">desktop and mobile devices</a>. The House also began utilizing new <a href="http://www.speaker.gov/Blog/Default.aspx?postid=249349">low-cost video conferencing tools</a>, <a href="http://www.speaker.gov/general/new-one-stop-site-live-video-house-committee-hearings">streaming committee hearings online</a>, working with <a href="http://majorityleader.gov/Facebook/">developers</a> and <a href="http://cha.house.gov/about/contact-us/legislative-data-conference">transparency advocates</a>, and more.</p>
</blockquote>
<p>As Speaker Boehner said, this is good news for every American. Despite the abysmal public perception of Congress, genuine institutional changes in the House of Representatives driven by the <a href="http://www.nationaljournal.com/congress/will-the-gop-embrace-innovation-and-transparency--20101125">GOP embracing innovation and transparency</a> have been happening over the last three years. Open government in the House also enjoys a rare status in Washington these days: bipartisan comity. These improvements built upon bipartisan progress made while Representative Nancy Pelosi held the Speaker&#8217;s gavel, including putting committee hearings online, putting <a href="disbursements.house.gov/">expenditures</a> and lobbying disclosure online, and changed the franking rules to allow for the use of new media, like YouTube, Facebook and Twitter. </p>
<p>Democratic Whip Steny Hoyer <a href="http://www.democraticwhip.gov/content/democratic-whip-hoyer-praises-gpo-and-house-clerk-providing-bulk-access-house-bills-and-floo">praised the GPO and House Clerk</a> for providing bulk access to House bills in XML:</p>
<blockquote><p>&#8220;The actions this week by the GPO and the House Clerk are significant steps towards making the legislative branch more open and transparent,&#8221; said Whip Hoyer. &#8221;Congress has a duty to share information about legislation being developed and deliberated, and this new effort will allow the public to follow and engage with Congress in innovative new ways.  I commend GPO and the House Clerk for their actions, and hope that other legislative branch entities like the Library of Congress and the Senate will follow suit by including additional legislative information that is already publicly available, yet not accessible, on Thomas.gov.&#8221;</p>
</blockquote>
<h2>A more open road ahead?</h2>
<p>As Tim O&#8217;Reilly observed in 2011, the current <a href="http://www.cato-at-liberty.org/house-leaderships-transparency-leadership/">leadership of the House</a> do seem to be doing a better job on transparency and open data than their predecessors. Jim Harper&#8217;s <a href="http://www.cato.org/publications/policy-analysis/grading-governments-data-publication-practices">analysis of the government&#8217;s data publication process</a> substantiates that progress. Writing at the Cato Institute, where he is the director of information policy studies, Harper <a href="http://www.cato.org/blog/very-good-house-keep-it-comin">praised the House</a> for this step forward:</p>
<blockquote>
<p>I believe the public has an Internet-fueled expectation that they should understand what happens in Congress. It&#8217;s one explanation for rock-bottom esteem for government in opinion polls. Access to good data help produce better public understanding of what goes on in Washington and also, I believe, more felicitous policy outcomes &mdash; not only reduced demand for government, but better administered government in the areas the public wants it.&#8221;</p>
</blockquote>
<p>Harper also offered some constructive criticism for improvement:</p>
<blockquote>
<p>That I&#8217;ve been able to find, the XML is not well documented. What each of the technical codes means is understood by several people in Washington&#8217;s transparency community, but the idea is to make it available very broadly, so the documentation should be very strong. The information at <a href="http://xml.house.gov/"><span class="s1">xml.house.gov</span></a> should be updated, tightened up, and made easily available to the people gathering bill data on FDsys.</p>
<p class="p1">The XML data structures put in bills are limited in terms of what they convey. There is rudimentary information about who introduced and cosponsored bills, what committees they were referred to, and other procedural information. That&#8217;s good. But the effects of bills&mdash;on agencies, existing law, programs, places&mdash;this is not available in machine-readable code. That would be great.</p>
</blockquote>
<p>Josh Tauberer, the author of <em><a href="http://opengovdata.io/">Open Government Data</a></em>, added some caveats on the House&#8217;s move to <a href="http://razor.occams.info/blog/2013/01/10/on-the-new-bulk-bill-xml-from-gpo/">bulk bill XML</a>. Tauberer is the civic hacker behind <a href="http://Govtrack.us">Govtrack.us</a>, which has been scraping and making legislative data more open for years. He also contributed a chapter to <em><A href="http://www.amazon.com/Open-Government-Collaboration-Transparency-Participation/dp/0596804350">Open Government</a></em> in 2010.</p>
<p>In his comments, excerpted below, he notes that &#8220;there&#8217;s no new data here, and thus not the data that the bulk legislative data advocates have been asking for.&#8221; In other words, this is evolutionary change, not revolutionary change.</p>
<blockquote>
<p>What we&#8217;re seeing with the bills bulk data project is how the wave of culture change is moving through government. Over the last two years the House Republican leadership has embraced open government in many ways (<a href="http://razor.occams.info/blog/2013/01/04/transparency-in-the-112th-house/">my 112th Congress recap</a> | the new <a href="http://www.speaker.gov/general/opengov-xml-house-floor-summaries-now-available-bulk">House floor feed</a>). With this bills XML project, we&#8217;re seeing more legislative support agencies being involved in how the House does open government.</p>
<p>This isn&#8217;t a technical feat by any means, but it is a cultural feat. The House and GPO worked together to institutionalize a new way for the House to publish bulk data.</p>
<p>Because of the way Data.gov is managed in the executive branch, we&#8217;ve become accustomed to big announcements. The bills bulk data project and the other recent projects show that the House is taking a different approach, an incremental approach, to open government data: publish early and often, gather feedback, then go on to bigger projects. This is something open government advocates have been asking for.</p>
</blockquote>
<p>Daniel Schuman, the legislative counsel of the Sunlight Foundation, echoed Harper and Tauberer&#8217;s balanced praise for <a href="http://sunlightfoundation.com/blog/2013/01/10/access-to-legislation-gets-better-promise-of-more-to-come/">improved access</a>, including a <a href="http://sunlightfoundation.com/blog/2013/01/10/access-to-legislation-gets-better-promise-of-more-to-come/">recent history</a> of progress and suggestions for what remains to do:</p>
<blockquote><p>&#8220;Ultimately, this path should take us past the point where all legislative information published on THOMAS (or its successor Congress.gov) is available online, in real time, as structured data that is capable of being downloaded in bulk. The most requested data is legislative status information, which is held by the Library of Congress and still is not available today as structured data, bulk or otherwise. That includes when the bill was introduced, who co-sponsored it, a summary of the legislation, and so on.</p>
<p>Status information is prepared by the Library of Congress, which has been historically recalcitrant to make this information available to the public in any other formats besides a series of web pages. But we know based on a <a href="http://sunlightfoundation.com/blog/2012/05/18/two-steps-forward-on-improving-public-access-to-legislative-information/">March 2008 memo</a> that the hurdle here is political will, not technology. That&#8217;s why this new announcement is encouraging. The task force is starting to crack open the vault. Let&#8217;s hope that the Senate and the Library of Congress are coming to share the House&#8217;s enthusiasm for transparency.&#8221;</p>
</blockquote>
<p>As we head further into 2013, here&#8217;s hoping that the entire Congress takes more steps to make the content and status of proposed laws more accessible to the hundreds of millions of people its members represent around the country. As more progress is made toward freeing the data, it will enable the nation to track the progress of legislation in real time in the <a href="http://radar.oreilly.com/tag/data-journalism">media</a> and <a href="http://radar.oreilly.com/tag/data-econonmy">civic entrepreneurs</a> to build better interfaces for understanding the proposal before Congress.</p>
<p><strong>Related Stories:</strong></p>
<ul>
<li><a href="http://radar.oreilly.com/2012/09/congress-launches-congress-gov-in-beta-doesnt-open-the-data.html">Congress launches Congress.gov in beta, doesn&#8217;t open the data</a></li>
<li><a href="http://radar.oreilly.com/2011/12/congressional-hackathon-2011.html">Can the People&#8217;s House become a platform for the People?</a></li>
<li><a href="http://radar.oreilly.com/2009/03/bulk-data-downloads-government-transparency-breakthrough.html">Bulk data downloads: a breakthrough for government transparency</a></li>
</ul>
<p><em>This post is part of our series investigating <a href="http://radar.oreilly.com/2012/12/making-dollars-and-sense-of-the-open-data-economy.html">open data</a>. An earlier version of this post appeared on the <a href="http://oreillyradar.tumblr.com/post/40186606375/open-data-of-u-s-house-legislation-now-available-in">O&#8217;Reilly Radar Tumblr.</em></p>
]]></content:encoded>
			<wfw:commentRss>http://strata.oreilly.com/2013/01/u-s-house-open-data-open-government.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Panjiva uses government data to build a global search engine for commerce</title>
		<link>http://strata.oreilly.com/2012/12/panjiva-government-data-platform.html</link>
		<comments>http://strata.oreilly.com/2012/12/panjiva-government-data-platform.html#comments</comments>
		<pubDate>Thu, 06 Dec 2012 18:00:53 +0000</pubDate>
		<dc:creator>Alex Howard</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[data economy]]></category>
		<category><![CDATA[ecommerce]]></category>
		<category><![CDATA[government as a platform]]></category>
		<category><![CDATA[government data]]></category>
		<category><![CDATA[open data]]></category>
		<category><![CDATA[Panjiva]]></category>
		<category><![CDATA[search]]></category>
		<category><![CDATA[trade]]></category>

		<guid isPermaLink="false">http://strata.oreilly.com/?p=53286</guid>
		<description><![CDATA[&#8220;If you go back to how we got started,&#8221; mused Josh Green, &#8220;government data really is at the heart of that story.&#8221; Green, who co-founded Panjiva with Jim Psota in 2006, was demonstrating the newest version of Panjiva.com to me &#8230; ]]></description>
				<content:encoded><![CDATA[<p>&#8220;If you go back to how we got started,&#8221; mused Josh Green, &#8220;government data really is at the heart of that story.&#8221; Green, who co-founded <a href="http://panjiva.com">Panjiva</a> with Jim Psota in 2006, was demonstrating the newest version of Panjiva.com to me over the web, thinking back to the startup&#8217;s origins in Cambridge, Mass.</p>
<p><a href="http://s.radar.oreilly.com/wp-files/5/2012/12/1.Global-Search-Box.png"><img class="size-medium wp-image-53287 alignright" src="http://s.radar.oreilly.com/wp-files/5/2012/12/1.Global-Search-Box-300x156.png" alt="" width="300" height="156" /></a>At first blush, the search engine for products, suppliers and shipping services didn&#8217;t have a clear connection to the open data movement I&#8217;d been chronicling over the past several years. His account of the back story of the startup is a case study that aspiring civic entrepreneurs, Congress and the <a href="http://strata.oreilly.com/2012/05/us-cto-seeks-to-scale-agile-te.html">White House</a> should take to heart.</p>
<p>&#8220;I think there are a lot of entrepreneurs who start with datasets,&#8221; said Green, &#8220;but it&#8217;s hard to start with datasets and build business. You&#8217;re better off starting with a problem that needs to be solved and <em>then</em> going hunting for the data that will solve it. That&#8217;s the experience I had.&#8221;</p>
<p>The problem that the founders of Panjiva wanted to help address was one that many other entrepreneurs face: how do you connect with companies in far away places? Green came to the realization that a better solution was needed in the same way that many people who come up with an innovative idea do: he had a frustrating experience and wanted to scratch his own itch. When he was working at an electronics company earlier in his career, his boss asked him to find a supplier they could do business with in China.</p>
<p>&#8220;I thought I could do that, but I was stunned by the lack of reliable information,&#8221; said Green. &#8220;At that moment, I realized we were talking about a problem that should be solvable. At a time when people are interested in doing business globally, there should be reliable sources of information. So, let&#8217;s build that.&#8221;</p>
<p>Today, Panjiva has created a <a href="http://boss.blogs.nytimes.com/2012/02/28/a-higher-tech-way-to-find-overseas-suppliers/">higher tech way to find overseas suppliers</a>. The way they built it, however, deserves more attention.</p>
<p><span id="more-53286"></span></p>
<h2>Government data as a platform</h2>
<p>By 2009, the startup had an initial product they could bring to market and launched a search engine that used <a href="//venturebeat.com/2009/07/20/panjiva-using-government-data-as-a-platform-for-international-trade/'">government data as a platform for international trade</a>. An importer could type in &#8220;patio furniture&#8221; and<br />
determine who shipped it and who their customers were. The company chose a freemium model, where search is available for free but relationships between suppliers are only available to subscribers. The mapping of relationships between buyers and suppliers is where Panjiva delivered added value on top of public data.</p>
<p>That <em>added</em> value is crucial, given that competitors can also request and use the dataset. &#8220;Companies have been packaging and reselling this data in one way or another for years, &#8221; said Green. &#8220;If you looked at this data, people are going to find value. It&#8217;s typically folks in the shipping industry, who want to know what&#8217;s going into ports or moving on different shipping lines. For us, the central purpose of the data was something different and required more work.&#8221;</p>
<p>That work paid off. In 2010, Panjiva built a <a href="http://money.cnn.com/2010/02/19/smallbusiness/global_ecommerce_search_engine/index.htm">search engine for global commerce</a> that worked. Today, they have more than 100,000 users in 190 countries using its free service and some 3,700 companies subscribing to the paid version, including 42 Fortune 500 companies.</p>
<p style="text-align: center"><a href="http://s.radar.oreilly.com/wp-files/5/2012/12/2.Search-Results.Objective-+Web-Data.png"><img src="http://s.radar.oreilly.com/wp-files/5/2012/12/2.Search-Results.Objective-+Web-Data-1024x851.png" alt="" width="640" height="531" /></a></p>
<p>Notably, the Department of Homeland Security (DHS) itself is also a paying subscriber. Green declined to disclose the terms of relationships with all of Panjiva&#8217;s partners or data suppliers, some of which include nonprofits. Some users &#8220;do a revenue share, some are paying for data, others are providing data because they think there&#8217;s public good for that data being on the platform,&#8221; he said.</p>
<p>Panjiva competes with <a href="http://techcrunch.com/2008/05/28/importgenius-the-disruptive-shipping-database/">ImportGenius</a>, <a href="http://www.zepol.com/">Zepol</a>, <a href="http://www.alibaba.com/">AliBaba</a> and <a href="http://www.piers.com/">PIERS</a>. Green credits PIERS for extracting similar value from customs datasets.</p>
<h2>The turning point</h2>
<p>When they started, the first approach that Green and his co-founder decided to take was to build a &#8220;Yelp for global trade&#8221; that would be based on feedback from people who work with companies. Unfortunately for the young startup, they couldn&#8217;t get off the starting blocks in generating reviews, much less reach critical mass.</p>
<p>They also encountered a new problem: even if they were able to get ratings of exporters and suppliers, how would they ensure the reviews came from people who had actually done business with the entities being rated? In retrospect, that focus was a bit silly, said Green, because they couldn&#8217;t get engagement, but talking about how to solve it led them to an unexpected answer: government data.</p>
<p>That direction came from a meeting where a staffer for a trade promotion organization told them it was straightforward to get shipping data on what&#8217;s coming into the country from the United States Customs Agency, which is now part of the Department of Homeland Security.</p>
<p>&#8220;It was a turning point for the company,&#8221; said Green. &#8220;We realized there was a dataset available to the public for a fee. They make available data about shipments that enter the U.S. While not all data is made available to the public and there are a bunch of limitations, the data that is made available is amazing. There&#8217;s about 10 million shipping records every year, typically including who is sending goods, who is receiving goods, what&#8217;s inside, and how much is inside a container.&#8221;</p>
<p>While useful, these government datasets do come with inherent limitations, cautioned Green. For one, they only contain data about shipments coming into the United States, not what&#8217;s going into Europe or Asia. For another, the data made available to the public only covers shipments made by boat, which is about about half of the shipments that come into the United States.</p>
<p>&#8220;It&#8217;s unfortunate that government cannot make available data on other modes of transport,&#8221; observed Green, with a hint of frustration in his voice. &#8220;That leaves out truck, rail, and air. Congress actually attempted to clarify that the regulations that govern this data weren&#8217;t just about boats but applied to air. Thus far, DHS hasn&#8217;t acted.&#8221;</p>
<p>Given the lens that has been focused on trade deficits between other countries and the United States in recent decades, there&#8217;s also a political angle to the market intelligence Panjiva provides that Congress and taxpayers may find of interest. For instance, <a href="http://www.logisticsmgmt.com/article/panjiva_data_shows_global_trade_growth_from_may_to_june_at_a_reduced_rate/">Panjiva data showed global trade growth</a> slowing in the first part of 2012.</p>
<p>&#8220;What we&#8217;ve organized, by its nature, gives us insight on companies around the world that serve the U.S. market,&#8221; said Green, &#8220;We&#8217;re helping people find overseas suppliers. Why not help find suppliers here at home? It turns out there&#8217;s a similar story on export data that&#8217;s supposed to be made to the public as well. DHS has a hard time with that as well. We can&#8217;t get the data.&#8221;</p>
<p>Data availability is also affected by the actions of the companies themselves, which have the ability to petition the government to hide shipments that are coming to them. &#8220;In about a third of the cases, you cannot see who is sending and receiving the goods,&#8221; said Green. &#8220;Government can see, but what&#8217;s released to the public has information pulled from it.&#8221;</p>
<h2>This government data comes at a cost</h2>
<p>Accessing this public data comes at a cost of some $100 per day, which is the service fee DHS charges for providing a daily CD-ROM. Each disc includes one day&#8217;s worth of shipments, which is generally around 30,000 shipping records. Panjiva started requesting data on July 1, 2007, and now has a little over five years of records.</p>
<p>&#8220;This data, on a record-by-record basis, is interesting,&#8221; said Green. &#8220;If you can organize, it&#8217;s phenomenal. If you can associate with companies, can say this company has experience with these supplies and this company has experience with these customers, it&#8217;s very useful in deciding if a company is a good fit. You can see by customers if they&#8217;re reasonably high quality.&#8221;</p>
<p>Making those CD-ROMs into a useful, searchable resource, however, was far from a simple matter of just inserting them into an optical drive and moving their contents into a structured database.</p>
<p>&#8220;Jim and a team of engineers went to work organizing the datasets initially,&#8221; said Green. &#8220;They were very hard to work with &mdash; absurdly messy. Think about the number of ways you can misname a Chinese factory. It was really problematic. You need to build company profiles, correct for misspellings and variations on names. We spent years getting that right.&#8221; Eventually, Panjiva was able to automate the process of ingesting the data from the CD-ROMs, building an algorithm to take the data and clean it up.</p>
<h2>Making data a strategic asset</h2>
<p>Panjiva&#8217;s initial foray, which created a search engine for customs data, didn&#8217;t meet with strong demand out of the gate. As they refined the product, it generated what Green described as a &#8220;nice business.&#8221; The startup was profitable, in other words, but its leadership aspired to build something bigger.</p>
<p>The direction they took was driven by user feedback. When Panjiva also asked its users about how they were making buying decisions, they saw a pattern emerge that looked like a bigger opportunity.</p>
<p>&#8220;Users started with Panjiva then went to search for additional information on B2B sites or on Google,&#8221; said Green. &#8220;We heard this process and it sounded a lot like the experience consumers had searching for flights before search engines or Kayak.com &mdash; except that instead of airline sites, people are going to B2B sites. The difference is it&#8217;s not just every airline. It&#8217;s like every flight has its own website.&#8221;</p>
<p>The founders now have raised just under $10 million from Battery Ventures and Harrison Metal, and invested it in technology and data acquisition. They&#8217;ve now grown their engineering team to 10 people, out of a total of 50 or so current employees. The engineering team is focused on improving search and enriching Panjiva&#8217;s data with other sources, beyond government data.</p>
<p>This October, the startup <a href="http://pandodaily.com/2012/10/02/panjiva-launches-global-search-so-customers-can-stop-dancing-the-google-shuffle/">relaunched Panjiva.com</a> with another layer: data supplied by the companies themselves.</p>
<p style="text-align: center"><a href="http://s.radar.oreilly.com/wp-files/5/2012/12/4.Product-Detail.Objective+Web-Data.png"><img src="http://s.radar.oreilly.com/wp-files/5/2012/12/4.Product-Detail.Objective+Web-Data-803x1024.png" alt="" width="640" height="816" /></a></p>
<p>&#8220;We have a database of six million companies spread around the world and contact information on four million companies,&#8221; said Green. &#8220;We have product photos for 34 million products. There was a lot of investment required to do that, but none of this would be possible if we hadn&#8217;t had a backbone of data that came from the U.S. government.&#8221;</p>
<p>Since <a href="http://techcrunch.com/2012/10/04/panjiva-adds-global-search-to-its-social-network-for-world-manufacturing/">Panjiva added global search</a>, Green said that traffic to the search engine has gone up 50%.</p>
<p>The <a href="http://panjiva.com/why-panjiva/multiple-data-sources">data sources</a> that Panjiva integrated were also driven by customer interest. As the founders shared their product with potential subscribers, they kept hearing the same thing: 1) &#8220;that&#8217;s awesome&#8221; and 2) &#8220;I&#8217;d like more data.&#8221;</p>
<p>&#8220;We loved the first one and hated the second,&#8221; said Green. &#8220;In retrospect, we should have loved both. The second one was a roadmap for us to build them a really great differentiated product.&#8221;</p>
<p>When they asked users exactly which kinds of data would make the service more useful, a map to the future of the company emerged.</p>
<p>The first was <strong>operational data</strong>. &#8220;Customs data is a perfect example,&#8221; said Green. &#8220;It gives you a sense of what companies have done and their track record.&#8221;</p>
<p>The second was <strong>financial data</strong>. &#8220;Sure, a company has experience, but are they financially healthy?&#8221; asked Green. &#8220;Some of that you can infer, but there&#8217;s other things you can use. We&#8217;ve partnered with Dun &amp; Bradstreet and Experian to pull that data into our platform.&#8221;</p>
<p>The third was <strong>positive and negative data</strong> about a company. &#8220;That includes getting certified as financially responsible,&#8221; said Green. &#8220;We&#8217;ve partnered with nonprofits and added that data, showing you information about companies doing wrong, including a blacklist of illicit global trade.&#8221;</p>
<p>The key insight that anyone interested in building a business on top of government data should take away here is to go <em>beyond</em>.</p>
<h2>What happens if government data becomes open?</h2>
<p>Green thinks that Panjiva is well-positioned to be both competitive and profitable, even if DHS decided to start publishing customs data online. &#8220;We don&#8217;t worry that much about data becoming more accessible,&#8221; he said, &#8220;even if government data becomes free. It&#8217;s not the $36,500 per year to buy the data &mdash; it&#8217;s the engineering talent to clear it up. That&#8217;s a massive problem, and it wouldn&#8217;t be as simple as getting the data.&#8221;</p>
<p>Panjiva is betting that the investments they&#8217;ve made in technology, talent and &mdash; crucially &mdash; combining so many different data sources have created a differentiated product that solves a problem for its customers.</p>
<p>&#8220;We&#8217;re not trying to build out a data business where we&#8217;re reselling government data,&#8221; said Green. &#8220;We&#8217;re trying to build a platform where serious buyers and sellers can connect. We&#8217;re now going to the world&#8217;s most important buyers. We have two revenue streams: selling premium access to data and selling access to suppliers who want it. The starting point for customers is $99 per month, going up to $10,000 per month for unlimited access for an unlimited number of users, then services that we sell on the top.&#8221;</p>
<p>The experience that Panjiva has had with government data and building a business using it has left Green with a strong perspective on what works &mdash; and what doesn&#8217;t.</p>
<p>&#8220;We don&#8217;t think there are infinite numbers of possibilities in terms of ways to build sustainable value with public data,&#8221; he said. &#8220;One is to take datasets that are commoditizeable and add value. Another is to feed the creation of more data. Another is to build a service. Another is to create network effects, where the data is the honey that attracts the bees.&#8221;</p>
<p>Most important, Green suggested, is to <strong>use public data to solve a problem that&#8217;s both hard and important</strong>. For Panjiva, that means making global trade more efficient and more transparent.</p>
<p>&#8220;There is a future where information is consolidated and accessible to people making key decisions, from a buying or regulatory standpoint,&#8221; he said. &#8220;Once that happens &mdash; and we&#8217;re close &mdash; there&#8217;s potentially a place where there&#8217;s a race to the top instead of the bottom, in terms of supply chain records. That will make a difference when you&#8217;re under scrutiny. Right now, the fragmentation of data is the ally of bad behavior. Our hope is to change that reality.&#8221;</p>
<div style="float: left;border-top: thin gray solid;border-bottom: thin gray solid;padding: 20px;margin: 20px 2px;clear: both"><a href="http://strataconf.com/strata2013?_discount=STRATA20&amp;intcmp=il-strata-stsc13-panjiva-profile"><img style="float: left;border: none;padding-right: 10px" src="http://cdn.oreilly.com/radar/images/promos/strataca13-148x178.jpg" /></a><a href="http://strataconf.com/strata2013?_discount=STRATA20&amp;intcmp=il-strata-stsc13-panjiva-profile"><strong>Strata Conference Santa Clara</strong></a> &mdash;  Strata Conference Santa Clara, being held Feb. 26-28, 2013 in California, gives you the skills, tools, and technologies you need to make data work today.</p>
<p><a href="http://strataconf.com/strata2013?_discount=STRATA20&amp;intcmp=il-strata-stsc13-panjiva-profile"><strong>Save 20% on registration with the code STRATA20</strong></a></div>
]]></content:encoded>
			<wfw:commentRss>http://strata.oreilly.com/2012/12/panjiva-government-data-platform.html/feed</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>3 big ideas for big data in the public sector</title>
		<link>http://strata.oreilly.com/2012/11/3-big-ideas-for-big-data-in-the-public-sector.html</link>
		<comments>http://strata.oreilly.com/2012/11/3-big-ideas-for-big-data-in-the-public-sector.html#comments</comments>
		<pubDate>Thu, 15 Nov 2012 20:49:56 +0000</pubDate>
		<dc:creator>Alex Howard</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Big Data]]></category>
		<category><![CDATA[criminal justice]]></category>
		<category><![CDATA[distributed intelligence]]></category>
		<category><![CDATA[New York City]]></category>
		<category><![CDATA[public good]]></category>
		<category><![CDATA[public sector]]></category>
		<category><![CDATA[UN Global Pulse]]></category>
		<category><![CDATA[United Nations]]></category>

		<guid isPermaLink="false">http://strata.oreilly.com/?p=52928</guid>
		<description><![CDATA[If you&#8217;re going to try to apply the lessons of &#8220;Moneyball&#8221; to New York City,&#8217; you&#8217;ll need to get good data, earn the support of political leaders and build a team of data scientists. That&#8217;s precisely what Mike Flowers has &#8230; ]]></description>
				<content:encoded><![CDATA[<p>If you&#8217;re going to try to apply the lessons of <a href="http://strataconf.com/stratany2012/public/schedule/detail/26619">&#8220;Moneyball&#8221; to New York City</a>,&#8217; you&#8217;ll need to get good data, earn the support of political leaders and build a team of data scientists. That&#8217;s precisely what Mike Flowers has done in the Big Apple, and his team has helped to <a href="http://strata.oreilly.com/2012/06/predictive-data-analytics-big-data-nyc.html">save lives and taxpayers dollars</a>. At the Strata + Hadoop World conference held in New York in October, Flowers, the director of analytics for the Office of Policy and Strategic Planning in the Office of the Mayor of New York City, gave a keynote talk about how <a href="http://strata.oreilly.com/2012/06/predictive-data-analytics-big-data-nyc.html">predictive data analytics</a> have made city government more efficient and productive.</p>
<p><iframe width="640" height="360" src="http://www.youtube.com/embed/_M_20UjRvr0?feature=oembed" frameborder="0" allowfullscreen></iframe></p>
<p>While the story that Flowers told is a compelling one, the role of big data in the public sector was in evidence in several other sessions at the conference. Here are three more ways that big data is relevant to the public sector that stood out from my trip to New York City.</p>
<p><span id="more-52928"></span></p>
<h2>There&#8217;s justice in the data</h2>
<p>&#8220;Data and technology will change how criminal justice works in America,&#8221; said Anne Milgram. Milgram, the former attorney general of New Jersey, is now the <a href="http://www.arnoldfoundation.org/our-team">vice president of criminal justice</a> at the Arnold Foundation, where she&#8217;s working on building better risk assessment tools for courts, including a pilot in Manhattan. </p>
<p>&#8220;You can&#8217;t fix a problem you don&#8217;t know you don&#8217;t have,&#8221; said Migram. &#8220;We need to understand when and where violent crime is happening.&#8221; Most institutional actors, she asserted, are not using tech effectively. They need access to information other actors have and face cultural hurdles to data sharing. </p>
<p>When she asked a judge which criminals his court was actually putting in jails, he said it was the most dangerous and violent offenders. The data shows that the majority of people being incarcerated are low-level, <em>non</em>-violent offenders, said Milgram. The judges she talked with, however, couldn&#8217;t believe that outcome, given that their intention in sentencing was otherwise. What she showed them also highlighted an important system error: highly violent people were ending up out on the streets, not incarcerated.</p>
<p>Applying a data-driven lens to improve the efficiency and effectiveness of the criminal justice system has a potentially huge incentive: improving the bottom lines of state budgets around the country. According to Milgram, corrections ranks among the highest expenditures in state budgets. (As of 2012, corrections was No. 4 in <a href="http://www.ebudget.ca.gov/pdf/Enacted/BudgetSummary/SummaryCharts.pdf">California</a>, for instance.) It costs $45,000 p er year in New Jersey to incarcerate someone, she said, and the cost of local systems nationally is estimated by the Department of Justice at more than $130 billion dollars a year. The costs of incarceration just prior to trial alone is $9 billion.</p>
<p>Milgram spoke about <a href="http://m.theatlantic.com/national/archive/2012/06/moneyballing-criminal-justice/258703/">&#8220;moneyballing&#8221; criminal justice</a> earlier this year at the Code for America Summit. Video of that talk is embedded below:</p>
<p><a href="http://www.youtube.com/watch?v=_XN5RaInhD64">http://www.youtube.com/watch?v=_XN5RaInhD64</a></p>
<p>The returns for better applying technology in criminal justice extend far beyond reducing crime or costs to something that government officials are sworn to uphold: justice. </p>
<h2>&#8220;It&#8217;s too expensive&#8221; is no longer an excuse</h2>
<p>While Milgram is focused on getting a pilot up and running in New York City, <a href="http://radar.oreilly.com/2011/08/chicago-data-apps-open-government.html">Chicago&#8217;s data-driven approach to open government</a> has been underway since Mayor Rahm Emanuel was elected in February 2011. At Strata + Hadoop World, Q Ethan McCallum, the author of the newly published <a href="http://shop.oreilly.com/product/0636920024422.do"><em>Bad Data Handbook</em></a>, joined Chicago chief information officer and chief data officer Brett Goldstein to talk about <a href="http://strataconf.com/stratany2012/public/schedule/detail/25129">text mining and civic engagement</a>. </p>
<blockquote class="twitter-tweet"><p>.@<a href="https://twitter.com/chicagocdo">chicagocdo</a>: how could we gather data to complement existing inputs? <a href="https://twitter.com/search/%23strataconf">#strataconf</a> cc @<a href="https://twitter.com/oreillymedia">oreillymedia</a> <a href="http://t.co/Shl6A6PB" title="http://twitter.com/digiphile/status/261495345098420226/photo/1">twitter.com/digiphile/stat…</a></p>
<p>&mdash; Alex Howard (@digiphile) <a href="https://twitter.com/digiphile/status/261495345098420226">October 25, 2012</a></p></blockquote>
<p></p>
<p>There&#8217;s much that can and should be said about what Chicago has accomplished with data in the past years, from <a href="http://open311.org/2012/09/the-launch-of-open311-in-chicago/">the launch of Open 311 in Chicago</a>, which creates public data infrastructure for civic apps like this <a href="http://www.cityofchicago.org/city/en/depts/cdph/iframe/scc_app.html">flu shot finder</a>, to the city&#8217;s embrace of its civic hacking community, which creates useful apps like <a href="http://wasmycartowed.com/">Was my car towed?</a>. </p>
<p>What jumped out for me at Strata however, wasn&#8217;t the quality of Chicago&#8217;s data, how it has consumed it internally or the ecosystem of nonprofits, developers and startups that the city has worked to create around it. Instead, it was the importance of political leadership, results and the pursuit of internal capacity, not just the number of open datasets published online. &#8220;Mayor Emanuel wanted Chicago to be the standard for open data, analytics and prediction,&#8221; said Goldstein. </p>
<p>Chicago is a political place, he allowed with a laugh at the press conference at the Strata Conference, but the mandate from the mayor is to do it right. &#8220;We&#8217;re not creating pretty pictures,&#8221; he said. &#8220;We&#8217;re building a solid foundation.&#8221; </p>
<p>Goldstein, who added &#8220;CIO&#8221; to his title earlier this year, has spent a lot of time working on architecture since.  The question has been how to bring the data together. Chicago has taken an open source approach to doing so, trying to make it universally easier to use and standardizing across the enterprise. He&#8217;s also been working with the community: Chicago is not only making municipal data available, but it&#8217;s also sponsoring R classes to help people understand how to put it to good use.</p>
<p>Goldstein&#8217;s team is dealing with short-term deliverables. Traditionally in IT projects, cities send out a request for proposals and then spend money on a big box solution &mdash; and that can take years. &#8220;My mandate is to give our residents every value for their taxes,&#8221; he said at the press conference. &#8220;By having an agile team, we could stand up in weeks what would take months or years. By having an agile mentality, you can get a rapid return. There are classic IT things that should go to RFP &mdash; like an ERP system &mdash; but why not build other things?&#8221;</p>
<p>For Goldstein, &#8220;showing that you can use R in a municipal government is a game changer.&#8221; As a result of his team&#8217;s work in Chicago, &#8220;it&#8217;s too expensive&#8221; to use big data in the public sector is no longer an excuse not to so so.</p>
<p>To help other cities use Hadoop, MongoDb and R, Goldstein is collaborating with Michael Flowers on <a href="http://www.g-analytics.org/">G-Analytics</a>, a group focused on building capacity in this nascent field of urban data analytics around the United States and beyond.</p>
<p>&#8220;I have a close relationship with Flowers,&#8221; said Goldstein, at the press conference. &#8220;We trade code. If I write something, I want someone to be able to download and use it.&#8221;</p>
<h2>Balance public good with human rights protection</h2>
<p>In August, my colleague Alistair Croll provocatively wrote that <a href="http://radar.oreilly.com/2012/08/big-data-is-our-generations-civil-rights-issue-and-we-dont-know-it.html">big data is our generation&#8217;s civil rights issue</a>. </p>
<p>Robert Kirkpatrick, director of U.N. Global Pulse, broadened his concerns when he delivered remarks at Strata: &#8220;Big data is a human rights issue,&#8221; he said. &#8220;We must never analyze personally identifiable information, never analyze confidential data, and never seek to re-identify data.&#8221; </p>
<p>He described three big opportunities in big data for the United Nations and governments in general:</p>
<ol>
<li> better early warning, to enable faster response</li>
<li> real-time awareness, to know what&#8217;s happening on the ground <em>now</em></li>
<li> real-time feedback, perhaps &#8220;most important,&#8221; to can see what&#8217;s not happening versus what was intended</li>
</ol>
<p>You can view Kirkpatrick&#8217;s <a href="http://www.slideshare.net/unglobalpulse/strata-14934034">presentation</a> at Slideshare. In his talk on <a href="http://www.unglobalpulse.org/strataconf2012">big data and development</a>, Kirkpatrick appealed to a packed room for help on the challenges and big questions that U.N. Global Pulse faces, a need he articulated in an op-ed in the <a href="http://www.liebertpub.com/mcontent/files/Big%20Data%20Preview%20Issue.pdf">first issue of &#8220;Big Data&#8221;</a>, a peer-reviewed journal that launched at Strata:</p>
<blockquote><p>How does the United Nations gain access to the data it needs in order to do the research necessary to answer those other questions? We believe the answer to the latter, crucial question is what we call &#8220;data philanthropy,&#8221; where data-rich companies donate to research projects. For example, I have been spending a lot of time lately talking to private sector companies about how they can safely and anonymously share with Global Pulse some of what they know about customers, to help give us a badly needed leg-up in our quest to better protect the vulnerable. The companies that are most open to the message are the ones that recognize that data philanthropy is not charity. These companies know that population well-being is key to the growth and  continuity of  business. No company wants to invest in a promising emerging market only to find out it is being threatened by a food crisis that could leave customers unable to afford products and services. And it would be sadly ironic if it turned out that expert analysis of patterns in a company’s own data could have revealed that people were headed for trouble while there was still time to act.</p></blockquote>
<p>Kirkpatrick hopes that the data science community and corporations will donate their time, expertise and data for the <a href="http://strata.oreilly.com/2012/02/data-public-good.html">public good</a>, enabling better <a href="http://techpresident.com/news/wegov/23075/crisis-tracker-open-source-map-curates-crowdsourced-information">crisis tracking using crowdsourced information</a>.</p>
<p>He&#8217;s particularly interested in the potential of social and mobile data for their predictive value. For instance, said Kirkpatrick, <a href="http://www.guardian.co.uk/global-development/2012/jan/12/haiti-twitter-tracked-cholera-outbreak">Twitter data accurately predicted the cholera outbreak in Haiti</a> two weeks earlier than official records. Mobile networks can also act as drought sensors in the Sahel Desert in Africa. </p>
<p>The tools keep improving, too: the U.S. Geological Survey&#8217;s (USGS) <a href="http://earthquake.usgs.gov/earthquakes/ted/">Twitter earthquake detector</a> (<a href="https://twitter.com/usgsted">@USGSted</a>) has a less than 10% false positive rate. </p>
<p>Such systems complement existing systems, not replace them, said Kirkpatrick. After an alert, the USGS can wake up seismologists in that part of the world to go check their data centers. By doing so, they can reduce the time to detect the epicenter of a quake from 20 minutes to four minutes.</p>
<p>This kind of data analysis has considerable potential for more than natural disasters or epidemics. In Jakarta, the &#8220;tweetingest city on Earth,&#8221; with more than 9 million tweets sent daily, UN Global Pulse analyzed 14 million tweets during a period of inquiry and found that people talk differently about food when they&#8217;re not getting it, leading to potential predictive value for food security. Social media can be used to predict food price inflation, though Kirkpatrick warned that signals in the data are culturally contextual. For instance, for every 5,000 more tweets from Indonesians about eggs, UN Global Pulse found a 2-3% <em>decrease</em> in food <a href="http://www.ers.usda.gov/topics/food-markets-prices/consumer-price-index-%28cpi%29.aspx">consumer price index</a>.<br />
signals in data.</p>
<p>UN Global Pulse is now working on building its networks around the world, partnering with government, the private sectors, agencies and academia on doing research. When they find something that works, they turn to the open source paradigm to help local populations tap into distributed intelligence. </p>
<p><iframe width="640" height="360" src="http://www.youtube.com/embed/JhB2fRCyD7M?feature=oembed" frameborder="0" allowfullscreen></iframe></p>
<p>We expect to see more map layers of proxy indicators, said Kirkpatrick. &#8220;The signals are getting stronger, with an increase in the volume of relevant conversations,&#8221; he said. &#8220;The social media food index and food price index started matching up in October 2011. We believe more people using social media to talk about basic needs is leading to a correlation with official statistics. We think there&#8217;s a huge opportunity to have socioeconomic weather stations that show trends in poverty and food in every community in the world.&#8221;</p>
<div style="float: left;border-top: thin gray solid;border-bottom: thin gray solid;padding: 20px;margin: 20px 2px;clear: both"><a href="http://strataconf.com/strata2013?_discount=STRATA20&amp;intcmp=il-strata-stsc13-public-sector-big-data-3-ideas"><img style="float: left;border: none;padding-right: 10px" src="http://cdn.oreilly.com/radar/images/promos/strataca13-148x178.jpg" /></a><a href="http://strataconf.com/strata2013?_discount=STRATA20&amp;intcmp=il-strata-stsc13-public-sector-big-data-3-ideas"><strong>Strata Conference Santa Clara</strong></a> &mdash;  Strata Conference Santa Clara, being held Feb. 26-28, 2013 in California, gives you the skills, tools, and technologies you need to make data work today.</p>
<p><a href="http://strataconf.com/strata2013?_discount=STRATA20&amp;intcmp=il-strata-stsc13-public-sector-big-data-3-ideas"><strong>Save 20% on registration with the code STRATA20</strong></a></div>
]]></content:encoded>
			<wfw:commentRss>http://strata.oreilly.com/2012/11/3-big-ideas-for-big-data-in-the-public-sector.html/feed</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>In the 2012 election, big data-driven analysis and campaigns were the big winners</title>
		<link>http://strata.oreilly.com/2012/11/2012-election-big-data-journalism-obama-data-campaign.html</link>
		<comments>http://strata.oreilly.com/2012/11/2012-election-big-data-journalism-obama-data-campaign.html#comments</comments>
		<pubDate>Thu, 08 Nov 2012 18:38:35 +0000</pubDate>
		<dc:creator>Alex Howard</dc:creator>
				<category><![CDATA[Data]]></category>
		<category><![CDATA[Big Data]]></category>
		<category><![CDATA[campaign tech]]></category>
		<category><![CDATA[data journaiism]]></category>
		<category><![CDATA[Dreamcatchers]]></category>
		<category><![CDATA[election 2012]]></category>
		<category><![CDATA[Obama campaign]]></category>
		<category><![CDATA[poliitcal data science]]></category>
		<category><![CDATA[quant]]></category>
		<category><![CDATA[Romney Campaign]]></category>

		<guid isPermaLink="false">http://strata.oreilly.com/?p=52873</guid>
		<description><![CDATA[On Tuesday night, President Barack Obama was elected to a second term in office. In a world of technology and political punditry, the big winner is Nate Silver, the New York Times blogger at Five Thirty Eight. (Break out your &#8230; ]]></description>
				<content:encoded><![CDATA[<p>On Tuesday night, President Barack Obama was elected to a second term in office. In a world of technology and <a href="http://www.slate.com/articles/news_and_politics/politics/2012/11/pundit_scorecard_checking_pundits_predictions_against_the_actual_results.html">political punditry</a>, the <a href="http://www.huffingtonpost.com/2012/11/07/nate-silver-obama-reelection_n_2086556.html">big winner</a> is <a href="http://en.wikipedia.org/wiki/Nate_Silver">Nate Silver</a>, the New York Times blogger at <a href="http://fivethirtyeight.blogs.nytimes.com/">Five Thirty Eight</a>. (Break out your dictionaries: a <a href="http://en.wikipedia.org/wiki/Psephology">psephologist</a> is a national figure.)</a></p>
<p>After he <a href="http://venturebeat.com/2012/11/07/nate-silver/">correctly called all 50 states</a>, <a href="http://www.huffingtonpost.com/2012/11/07/nate-silver-obama-reelection_n_2086556.html">Silver is being celebrated</a> as the &#8220;<a href="http://news.cnet.com/8301-13510_3-57546161-21/obamas-win-a-big-vindication-for-nate-silver-king-of-the-quants/">king of the quants</a><a href="http://www.huffingtonpost.com/2012/11/07/nate-silver-obama-reelection_n_2086556.html">&#8221; by CNET and the &#8220;</a><a href="http://www.wired.com/underwire/2012/11/nate-silver-facts-election/">nerdy Chuck Norris</a>&#8221; by Wired. The combined success of statistical models from Silver, <a href="http://polltracker.talkingpointsmemo.com/">TPM PollTracker</a>, <a href="http://elections.huffingtonpost.com/2012/romney-vs-obama-electoral-map">HuffPost Pollster</a>, <a href="http://www.realclearpolitics.com/epolls/2012/president/2012_elections_electoral_college_map_no_toss_ups.html">RealClearPolitics Average</a>, and the <a href="http://election.princeton.edu/">Princeton Election Consortium</a> all make traditional &#8220;horse race journalism&#8221; that uses insider information from the campaign trail to explain what&#8217;s <em>really</em> going on look a bit, well, antiquated. With the <a href="http://www.huffingtonpost.com/tarun-wadhwa/nate-silver-election-predictions_b_2090909.html">rise of political data science</a>, the Guardian even went so far as to say that <a href="http://www.guardian.co.uk/media-network/media-network-blog/2012/nov/07/big-data-punditary-us-election">big data may sound the death knell for punditry</a>.</p>
<p>This election season should serve, in general, as a <a href="http://onlinejournalismblog.com/2012/11/07/the-us-election-was-a-wake-up-call-for-data-illiterate-journalists/">wake-up call for data-illiterate journalists</a>. It was certainly a triumph of <a href="http://readwrite.com/2012/11/07/nate-silvers-model-proves-to-be-stunning-portrait-of-logic-over-punditry">logic over punditry</a>. At this point, it&#8217;s fair to &#8220;predict&#8221; that Silver&#8217;s reputation and the role of data analysis will continue to endure, long after 2012.</p>
<div class="wp-caption aligncenter" style="width: 460px"><a href="http://xkcd.com/1131/"><img src="http://imgs.xkcd.com/comics/math.png" alt="XKCD on math" width="450" height="158" /></a><p class="wp-caption-text">&#8220;As of this writing, the only thing that&#8217;s &#8216;razor-thin&#8217; or &#8216;too close to call&#8217; is the gap between the consensus poll forecast and the result&#8221; &mdash; Randall Munroe</p></div>
<h2>The data campaign</h2>
<p>The other big tech story to emerge from the electoral fray, however, is the how the campaigns themselves used technology. What social media was to 2008, data-driven campaigning was in 2012. In the wake of this election, people who understand <a href="http://techpresident.com/news/23092/revenge-math-nerds">math</a>, programming and data science will be in even higher demand as a strategic advantage in campaigns, from <a href="http://www.slate.com/articles/news_and_politics/victory_lab/2012/11/obama_s_get_out_the_vote_effort_why_it_s_better_than_romney_s.html">getting out the vote</a> to <a href="http://www.slate.com/articles/news_and_politics/victory_lab/2012/10/obama_s_secret_weapon_democrats_have_a_massive_advantage_in_targeting_and.html">targeting and persuading voters</a>.</p>
<p>For political scientists and campaign staff, the story of the <a href="http://swampland.time.com/2012/11/07/inside-the-secret-world-of-quants-and-data-crunchers-who-helped-obama-win/print/">quants and data crunchers who helped President Obama win</a> will be pored over and analyzed for years to come. For those wondering how the first big data election played out, Sarah Lai Stirland&#8217;s analysis of how <a href="http://techpresident.com/news/23104/help-digital-infrastructure-obama-wins-re-election">Obama&#8217;s digital infrastructure helped him win re-election</a> is a must-read, as is Nick Judd&#8217;s breakdown of former Massachusetts governor <a href="http://techpresident.com/news/23106/romneys-digital-campaign-second-place-finish">Mitt Romney&#8217;s digital campaign</a>. The <a href="http://www.nytimes.com/2012/11/08/us/politics/obama-campaign-clawed-back-after-a-dismal-debate.html">Obama campaign found voters</a> in battleground states that their opponents apparently didn&#8217;t know existed. The <a href="http://www.washingtonpost.com/wp-srv/special/politics/2012-exit-polls/">exit polls</a> suggest that finding and turning out the winning coalition of young people, minorities and women was critical &mdash; and data-driven campaigning clearly played a role.</a></p>
<p>For added insight on the role of data in this campaign, watch O&#8217;Reilly Media&#8217;s special <a href="http://oreillynet.com/pub/e/2311">online conference on big data and elections</a>, from earlier this year. (It&#8217;s still quite relevant.) The archive is embedded below:</p>
<p>For more resources and analysis of the growing role of big data in elections and politics, read on.</p>
<p><span id="more-52873"></span></p>
<p>If you&#8217;re new to the topic, this list of videos and articles should be useful.</p>
<p>In October, I joined other journalists and digital media experts to discuss Al Jazeera&#8217;s &#8220;The Stream&#8221; for a show on &#8220;<a href="http://stream.aljazeera.com/story/datamining-us-election">data mining the U.S. Election</a>.&#8221;</p>
<p>The PBS News Hour collaborated with Frontline to produce a feature on this year&#8217;s <a href="http://www.pbs.org/newshour/rundown/2012/10/digitial-campaigns-may-decide-the-election.html">digital campaigns</a>, including an interactive on <a href="http://www.pbs.org/wgbh/pages/frontline/campaign-targeting/">targeting the electorate</a>. </p>
<p>For a look back at some of the best news and commentary on big data and politics, read the following links:</p>
<p>Patrick Ruffini: &#8220;<a href="http://www.engagedc.com/2010/11/10/goodbye-polling-hello-big-data/">Goodbye, polling. Hello, Big Data.</a>&#8221;</p>
<p>Tech President: &#8220;<a href="http://techpresident.com/blog-entry/election-2012-its-not-facebook-its-data-stupid">Election 2012: It&#8217;s Not Facebook. It&#8217;s The Data, Stupid.</a>&#8221;</p>
<p>Slate: &#8220;<a href="http://www.slate.com/articles/news_and_politics/victory_lab/2012/01/project_dreamcatcher_how_cutting_edge_text_analytics_can_help_the_obama_campaign_determine_voters_hopes_and_fears_.html">Project Dreamcatcher: How cutting-edge text analytics can help the Obama campaign determine voters’ hopes and fears.</a>&#8221;</p>
<p>Guardian: &#8220;<a href="http://www.guardian.co.uk/world/2012/feb/17/obama-digital-data-machine-facebook-election">Obama, Facebook and the power of friendship: the 2012 data election</a>&#8221;</p>
<p>Campaigns &amp; Elections: &#8220;<a href="http://www.campaignsandelections.com/print/315777/big-data-is-a-big-factor-in-2012-by-brett-bell-.thtml">Big Data Is A Big Factor in 2012</a>.&#8221;</p>
<p>Guardian: &#8220;<a href="http://www.guardian.co.uk/world/2012/apr/04/obama-campaign-romney-trails-november-election">Obama campaign leaves Mitt Romney trailing as focus shifts to November</a>&#8221;</p>
<p>Politico: &#8220;<a href="http://dyn.politico.com/printstory.cfm?uuid=0223FCC7-01F8-4999-90BD-644EEB503F1B">Obama&#8217;s Data Advantage</a>&#8221;</p>
<p>Guardian: &#8220;<a href="http://www.guardian.co.uk/world/2012/jun/14/romney-campaign-digital-data-obama">Mitt Romney&#8217;s campaign closing gap on Obama in digital election race</a>&#8221;</p>
<p>Tech President: &#8220;<a href="http://techpresident.com/news/22345/pdf12-zac-moffatt-talks-digital-strategy">Zac Moffatt Talks Digital Strategy</a>&#8221;</p>
<p>New York Times: &#8220;<a href="http://www.nytimes.com/2012/10/14/us/politics/campaigns-mine-personal-lives-to-get-out-vote.html?pagewanted=all">Campaigns Mine Personal Lives to Get Out Vote</a>&#8221;</p>
<p>Slate: &#8220;<a href="http://www.slate.com/articles/news_and_politics/victory_lab/2012/10/obama_campaign_its_secret_for_getting_voters_to_the_polls_.html">Are You Going To Vote? Do You Promise?</a>&#8221;</p>
<p>Slate: &#8220;<a href="http://www.slate.com/articles/news_and_politics/victory_lab/2012/10/obama_s_secret_weapon_democrats_have_a_massive_advantage_in_targeting_and.html">Obama does it better</a>.&#8221;</p>
<div style="float: left;border-top: thin gray solid;border-bottom: thin gray solid;padding: 20px;margin: 20px 2px;clear: both"><a href="http://strataconf.com/strata2013?_discount=STRATA20&amp;intcmp=il-strata-stsc13-data-in-2012-election"><img style="float: left;border: none;padding-right: 10px" src="http://cdn.oreilly.com/radar/images/promos/strataca13-148x178.jpg" alt="" /></a><a href="http://strataconf.com/strata2013?_discount=STRATA20&amp;intcmp=il-strata-stsc13-data-in-2012-election"><strong>Strata Conference Santa Clara</strong></a> &mdash;  Strata Conference Santa Clara, being held Feb. 26-28, 2013 in California, gives you the skills, tools, and technologies you need to make data work today.</p>
<p> <a href="http://strataconf.com/strata2013?_discount=STRATA20&amp;intcmp=il-strata-stsc13-data-in-2012-election"><strong>Save 20% on registration with the code STRATA20</strong></a></div>
]]></content:encoded>
			<wfw:commentRss>http://strata.oreilly.com/2012/11/2012-election-big-data-journalism-obama-data-campaign.html/feed</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Tracking the data storm around Hurricane Sandy</title>
		<link>http://strata.oreilly.com/2012/10/real-time-data-storm-in-hurricane-sandy-open-data.html</link>
		<comments>http://strata.oreilly.com/2012/10/real-time-data-storm-in-hurricane-sandy-open-data.html#comments</comments>
		<pubDate>Mon, 29 Oct 2012 17:20:55 +0000</pubDate>
		<dc:creator>Alex Howard</dc:creator>
				<category><![CDATA[Data]]></category>
		<category><![CDATA[crisis data]]></category>
		<category><![CDATA[Gov 2.0]]></category>
		<category><![CDATA[government as a platform]]></category>
		<category><![CDATA[Hurricane Sandy]]></category>
		<category><![CDATA[open data]]></category>
		<category><![CDATA[open government data]]></category>

		<guid isPermaLink="false">http://strata.oreilly.com/?p=52720</guid>
		<description><![CDATA[Just over fourteen months ago, social, mapping and mobile data told the story of Hurricane Irene. As a larger, more unusual late October storm churns its way up the East Coast, the people in its path are once again acting &#8230; ]]></description>
				<content:encoded><![CDATA[<p>Just over fourteen months ago, <a href="http://radar.oreilly.com/2011/08/social-mapping-and-crisis-data.html">social, mapping and mobile data told the story of Hurricane Irene</a>. As a larger, more unusual late October storm churns its way up the East Coast, the people in its path are once again acting as <a href="http://radar.oreilly.com/2011/08/social-mapping-and-crisis-data.html">sensors</a> and media, creating crisis data as this &#8220;<a href="http://writerzim.com/2012/10/25/who-coined-the-phrase-frankenstorm/">Frankenstorm</a>&#8221; moves over them.</p>
<p><div id="attachment_52728" class="wp-caption alignright" style="width: 510px"><a href="http://s.radar.oreilly.com/wp-files/5/2012/10/nasa-handout-satellite-image-10-29-2012.jpeg"><img class="size-full wp-image-52728" src="http://s.radar.oreilly.com/wp-files/5/2012/10/nasa-handout-satellite-image-10-29-2012.jpeg" alt="Hurricane Sandy is seen on the east coast of the United States in this NASA handout satellite image taken at 0715 GMT, October 29, 2012." width="500" height="417" /></a><p class="wp-caption-text">[Photo Credit: NASA}</p></div>As citizens look for hurricane information online, government websites are under high demand. In late 2012, media, government, the private sector and citizens all now will play an important role in sharing information about what&#8217;s happening and providing help to one another.</p>
<p>In that context, it&#8217;s key to understand that it&#8217;s government weather data, gathered and shared from satellites high above the Earth, that&#8217;s being used by a huge number of infomediaries to forecast, predict and instruct people about what to expect and what to do. In perhaps the most impressive mashup of social and government data now online, an interactive <a href="http://google.org/crisismap/2012-sandy">Google Crisis Map for Hurricane Sandy</a> pictured below <a href="http://io9.com/5955612/google-maps-predict-the-future-of-frankenstorm-in-real-time">predicts the future of the &#8216;Frankenstorm&#8217;</a> in real-time, including a <a href="http://google.org/crisismap/2012-sandy-nyc">NYC-specific version</a>.</p>
<p><a href="http://s.radar.oreilly.com/wp-files/5/2012/10/google-sandy-mashup.jpg"><img class="alignnone size-full wp-image-52745" src="http://s.radar.oreilly.com/wp-files/5/2012/10/google-sandy-mashup.jpg" alt="" width="600" height="373" /></a></p>
<p>If you&#8217;re looking for a great example of <a href="http://strata.oreilly.com/2012/02/data-public-good.html">public data for public good</a>, these maps like the <a href="http://www.wunderground.com/wundermap/?lat=35.90000&amp;lon=-70.50000&amp;zoom=5&amp;type=hybrid&amp;units=english&amp;rad=0&amp;sat=1&amp;sat.num=8&amp;sat.spd=25&amp;sat.opa=85&amp;sat.gtt1=109&amp;sat.gtt2=109&amp;sat.type=IR4&amp;stormreports=0&amp;svr=0&amp;pix=0&amp;cams=0&amp;tor=1&amp;tor.show=now&amp;riv=0&amp;wxsn=0&amp;ski=0&amp;tfk=0&amp;mm=0&amp;ndfd=0&amp;fire=0&amp;firewfas=0&amp;pep=0&amp;extremes=0&amp;nycevac=0&amp;hurrevac=0&amp;dir=0&amp;hur=1&amp;hur.wr=0&amp;hur.cod=1&amp;hur.fx=1&amp;hur.obs=1&amp;hur.hd=0&amp;hur.mdl=0&amp;hur.gpce=0&amp;hur.img=0&amp;hur.opa=70&amp;hur.opa2=40">Weather Underground&#8217;s interactive</a> are a canonical example of what&#8217;s possible.</p>
<p><a href="http://s.radar.oreilly.com/wp-files/5/2012/10/wunderground-map.jpg"><img class="alignnone size-full wp-image-52741" src="http://s.radar.oreilly.com/wp-files/5/2012/10/wunderground-map.jpg" alt="" width="600" height="364" /></a></p>
<p><span id="more-52720"></span></p>
<p>Matt Lira, the director of digital for the Majority Leader in the U.S. House of Representatives, made an important, clear connection between open government, weather data and the a <a href="http://hint.fm/wind/">gorgeous wind visualization</a> that has been getting passed around today.</p>
<blockquote class="twitter-tweet"><p>This dynamic wind map is an example of how open government data can be utilized in effective &amp; creative ways: <a title="http://hint.fm/wind/" href="http://t.co/0ZVCainI">hint.fm/wind/</a> <a href="https://twitter.com/search/%23opengov">#opengov</a></p>
<p>— Matt Lira (@MattLira) <a href="https://twitter.com/MattLira/status/262921756514324480">October 29, 2012</a></p></blockquote>
<p>In the context of the utility of weather data, it will be interesting to see if Congress takes action to fund <a href="http://www.pacificnewscenter.com/index.php?option=com_content&amp;view=article&amp;id=28514:cnmi-congressman-sablan-renews-call-for-weather-satellite-replacements&amp;catid=45:guam-news&amp;Itemid=156">weather satellite replacements.)</a></p>
<p>In New York City, as the city&#8217;s websites faced heavy demand when residents went to its <a href="http://gis.nyc.gov/oem/he/search.htm">hurricane evacuation finder</a> on Sunday, residents could also go and consult WYNC&#8217;s beautiful evacuation map. (Civiguard also activated an instant <a href="http://civiguard.com/sandy">evacuation zone checker</a> for smartphones and modern browsers.) WNYC <a href="http://strata.oreilly.com/2012/05/profile-of-the-data-journalist-10.html">data news editor</a> John Keefe is responsible for the map below that puts the city&#8217;s open government data in action.</p>
<p><a href="http://s.radar.oreilly.com/wp-files/5/2012/10/WYNC-evac-zones-map.jpg"><img class="alignnone size-full wp-image-52746" src="http://s.radar.oreilly.com/wp-files/5/2012/10/WYNC-evac-zones-map.jpg" alt="" width="600" height="311" /></a></p>
<p>By releasing open data for uses in these apps, New York City and the U.S. federal government are acting as a platform for public media, civic entrepreneurs and nonprofits to enable people to help themselves and one another at a crucial time. When natural disasters loom, public data feeds can become critical infrastructure.</p>
<p>For one more example of how this looks in practice, look at <a href="http://project.wnyc.org/storm-surge/">WNYC&#8217;s storm surge map</a> for New York and New Jersey.</p>
<p><a href="http://s.radar.oreilly.com/wp-files/5/2012/10/sandy-storm-surge-flood-zones.jpg"><img class="alignnone size-full wp-image-52747" src="http://s.radar.oreilly.com/wp-files/5/2012/10/sandy-storm-surge-flood-zones.jpg" alt="" width="600" height="501" /></a></p>
<p>If you&#8217;re a coder interested in working with the tech community, MIT Media Lab Director <a href="http://twitter.com/joi">Joi</a> Ito is helping to coordinate #HurricaneHackers working on <a href="http://bit.ly/hurricanehackers-gdoc">projects and resources for Hurricane Sandy</a>. The group has made a <a href="http://bit.ly/hurricanehackers-sandytimeline-test">timeline of events</a>, a <a href="http://sandystreamsmap.tirl.org">list of livestreams</a>, along with aggregating links to official data and social streams, like <a href="http://instacane.com/">Instacane</a>, a site that aggregates Instagram images about the hurricane.</p>
<h2>Stay safe, keep informed</h2>
<p><a href="http://www.theatlantic.com/technology/archive/2012/10/why-sandy-has-meteorologists-scared-in-4-images/264198/">Hurricane Sandy has meteorologists scared</a>, and for good reason. The federal government is providing information on Hurricane Sandy at <a href="http://Hurricanes.gov">Hurricanes.govs</a> and <a href="http://www.nhc.noaa.gov/graphics_at3.shtml?5-daynl#contents">NOAA</a> and sharing news and advisories in real-time on the radio, television, mobile devices and online using social media channels like <a href="http://twitter.com">@FEMA</a>.</p>
<p>As the storm comes in, FEMA recommends <a href="http://m.fema.gov">m.fema.gov</a> to mobile users and <a href="http://ready.gov">ready.gov</a> for desktops. The <a href="http://stream.wsj.com/story/world-stream/SS-2-44156/SS-2-82812/">Wall Street Journal</a> and <a href="http://live.reuters.com/Event/Tracking_Storm_Sandy">Reuters</a> are both live-blogging the news. Like WNYC, the <a href="http://hosted.ap.org/interactives/2012/superstorm/">Associated Press</a> Reuters used weather data to populate interactive <a href="http://www.reuters.com/subjects/hurricanes/hurricane-tracker">Hurricane Tracker</a> maps.</p>
<p>People in the path of the storm can download smartphone apps from the <strong>RedCross</strong>: <a href="http://rdcrss.org/R4gjDV">http://rdcrss.org/R4gjDV</a> and <strong>FEMA</strong> on Android: <a href="http://bit.ly/ToDgqB">http://bit.ly/ToDgqB</a> iOS: <a href="http://rdcrss.org/R4gjDV">http://bit.ly/sNZNJI</a> or BlackBerry: <a href="http://bit.ly/wUiqHL">http://bit.ly/wUiqHL</a></p>
<p>If you do not have a smartphone, save 43362 (4FEMA) to your mobile phone and charge it up. If, after #Sandy, you cannot return home and have immediate housing needs, text SHELTER + ZIP code to 43362.</p>
<p><strong>UPDATE</strong>: Bob Rudis (<a href="http://twitter.com/hrbmstr">@hrbrmstr</a> wrote in to share his <a href="https://github.com/hrbrmstr/sandy">R code for live tracking</a> at Github, which he <a href="http://rud.is/b/">blogged about here</a>. David Smith has already put the code to use in <a href="http://blog.revolutionanalytics.com/2012/10/tracking-hurricane-sandy-with-open-data-and-r.html">tracking Hurricane Sandy with open data and R</a> at Revolution Analytics. </p>
<p>If you have more examples of data, maps, apps, code or services relevant to the hurricane or its aftermath, please share them in the comments or write to <a href="mailto:alex@oreilly.com">alex@oreilly.com</a>. And if you&#8217;re in the path of the storm, please stay safe.</p>
]]></content:encoded>
			<wfw:commentRss>http://strata.oreilly.com/2012/10/real-time-data-storm-in-hurricane-sandy-open-data.html/feed</wfw:commentRss>
		<slash:comments>8</slash:comments>
		</item>
		<item>
		<title>A startup takes on &#8220;the paper problem&#8221; with crowdsourcing and machine learning</title>
		<link>http://strata.oreilly.com/2012/10/captricity-digitizing-documents-crowdsource.html</link>
		<comments>http://strata.oreilly.com/2012/10/captricity-digitizing-documents-crowdsource.html#comments</comments>
		<pubDate>Fri, 05 Oct 2012 15:02:34 +0000</pubDate>
		<dc:creator>Alex Howard</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[character recognition]]></category>
		<category><![CDATA[crowdsource]]></category>
		<category><![CDATA[data]]></category>
		<category><![CDATA[data company]]></category>
		<category><![CDATA[data product]]></category>
		<category><![CDATA[disruption]]></category>
		<category><![CDATA[machine learning]]></category>
		<category><![CDATA[OCR]]></category>

		<guid isPermaLink="false">http://strata.oreilly.com/?p=52231</guid>
		<description><![CDATA[Unlocking data from paper forms is the problem that optical character recognition (OCR) software is supposed to solve. Two issues persist, however. First, the hardware and software involved are expensive, creating challenges for cash-strapped nonprofits and government. Second, all of &#8230; ]]></description>
				<content:encoded><![CDATA[<p>Unlocking data from paper forms is the problem that optical character recognition (OCR) software is supposed to solve. Two issues persist, however. First, the hardware and software involved are expensive, creating challenges for cash-strapped nonprofits and government. Second, <em>all</em> of the information on a given document is scanned into a system, including sensitive details like Social Security numbers and other personally identifiable information. This is a particularly difficult issue with respect to health care or <a href="http://radar.oreilly.com/2010/09/applying-open-government-to-th.html">bringing open government to courts</a>: privacy by obscurity will no longer apply.</p>
<p>The process of converting paper forms into structured data still hasn&#8217;t been significantly disrupted by rapid growth of the Internet, distributed computing and mobile devices. Fields that range from research science to medicine to law to education to consumer finance to government all need better, cheaper bridges from the analog to the digital sphere. </p>
<p><a href="http://s.radar.oreilly.com/wp-files/5/2012/10/captricity.jpg"><img class="size-medium wp-image-52232 alignright" src="http://s.radar.oreilly.com/wp-files/5/2012/10/captricity-300x281.jpg" alt="" width="300" height="281" /></a>Enter <a href="http://captricity.com">Captricity</a>. The startup, which was co-founded by Jeff J. Lin and <a href="http://strataconf.com/rx2012/public/schedule/speaker/140775">Kuang Chen</a>, has its roots in the fieldwork on rural health Chen did as part of his PhD program. </p>
<p>&#8220;I was looking at the information systems that were available to these low-resource organizations,&#8221; Chen said in a recent phone interview. &#8220;I saw that they&#8217;re very much bound in paper. There&#8217;s actually a lot of efforts to modernize the infrastructure and put in mobile phones. Now that there&#8217;s mobile connectivity, you can run a health clinic on solar panels and long distance Wi-Fi. At the end of the day, however, business processes are still on paper because they had to be essentially fail-proof. Technology fails all the time. From that perspective, paper is going to stick around for a very long time. If we&#8217;re really going to tackle the challenge of the availability of data, we shouldn&#8217;t necessarily be trying to change the technology infrastructure first &mdash; bringing mobile phones and iPads to where there&#8217;s paper &mdash; but really to start with solving the paper problem.&#8221;</p>
<p>When Chen saw that data entry was a chokepoint for digitizing health indicators, he started working on developing a better, cheaper way to ingest data on forms.<span id="more-52231"></span></p>
<p>Captricity&#8217;s approach to that paper problem is fascinating, as I saw in an online demo of the technology. They&#8217;ve found a way to use the Internet to quickly and cheaply digitize handwritten forms into structured data. </p>
<p>&#8220;Our target user is an office admin who&#8217;s really good at, let&#8217;s say, Excel and doing mail merges in Word,&#8221; said Chen. &#8220;We make sure that person or a school teacher or an existing database administrator can just go and do it themselves.&#8221;</p>
<p>The process looks relatively simple for the end user, at least on the surface. Scan a form and then outline fields on it to create a template. When you subsequently scan a batch of forms, the system breaks up the designated fields into images that crowdsourced workers can identify online. The fields from each form are then output as structured data, exportable as a CSV file or into Google documents. Each digitized document represents a row. The provenance of the data is preserved, with the original image connected to a given cell. Most jobs take 10 to 20 minutes. (The demo I saw took 11 minutes or so.) </p>
<p>&#8220;Under the covers, the approach is to take a page and cut it up into little pieces,&#8221; said Chen. &#8220;We reorganize each little piece and then we send it out to workers on the Internet. They don&#8217;t see the context of anything else. They give us a set of answers that we make sure are correct by essentially doing it in triplicate. Then we take a small set of these triple-verified values and build essentially a machine-learning vision engine that predicts the value for that box. &#8221;</p>
<p>Captricity has effectively positioned itself as a &#8220;digitization-as-a-service&#8221; provider. Instead of buying equipment, organizations can pay as they go.</p>
<p>&#8220;It&#8217;s a place on the Internet, like a tap, that can turn on digitization,&#8221; Chen said. &#8220;You pay for exactly how much you use. You don&#8217;t have to spend $300,000 and buy a very high-speed scanner and self-service on premise to get what you need to get done. What we have runs on Amazon AWS and uses its Elastic Computing Cloud to crowdsource labor from Mechanical Turk. We are in talks to use other more private and specialized crowds as well.&#8221; </p>
<p>The retail cost to use Captricity is about $0.20 per page, after the first 25 pages, which are free. Larger volume customers will pay a bit more, based on the type of data and the volume of data. By comparison, OCR-only solutions are in the ballpark of $0.01 to $0.03 per page, said Chen, but require an expensive software license and equipment.</p>
<p>After the recent launch of an <a href="http://captricity.com/2012/09/mobile-capture-application/">iOS app</a>, Captricity went mobile. It&#8217;s now possible to scan using an iPhone, iPad or iPod Touch, integrate with an existing template and then sync the information to a Dropbox account. An Android app is in development and will be launched early next year. They&#8217;ve also rolled out integration with Salesforce in the mobile app, along with Box.com, and Constant Contact. </p>
<p>The startup, however, holds the potential to be something bigger than just a better mobile digitization provider: Captricity&#8217;s application programming interface (API) will help tap into their digitized data more easily. </p>
<p>&#8220;We have an API that has been in private beta for about a month-and-a-half now,&#8221; said Chen. &#8220;We extract away complexity and allow an application developer to just say, &#8216;We have forms. Let&#8217;s enter in the data. Go.&#8217; And you&#8217;re up and running in a day.&#8221; </p>
<p>And now, developers can share that information. On Tuesday, Captricity announced a new <a href="http://captricity.com/opendata">open data platform</a> that will publicly share digitized, structured datasets. The first dataset that they&#8217;ve published comes from the U.S. Census, as a demonstration of the concept.</p>
<p>This technology is of particular interest to health care, which is full of forms. That&#8217;s why Chen will join representatives from West Health and ElationEMR at this month&#8217;s StrataRx Conference to talk about the <a href="http://strataconf.com/rx2012/public/schedule/detail/26275?intcmp=il-strata-strx12-captricity-interview">untapped potential of structured data on paper</a>. The ability for users to only scan certain parts of forms &mdash; which would enable fields containing personally identifiable information to be left out &mdash; could be a key component of health data infrastructure. There will be special challenges, given <a href="http://www.hhs.gov/ocr/privacy/hipaa/understanding/summary/index.html">HIPAA</a> rules that govern patient data, but selective digitization might be a way to address them.</p>
<p>&#8220;We have fairly strict and cautious controls over this process, so it&#8217;s not automatic, but rather handheld, to make sure that only data intended to be public becomes public,&#8221; said Chen.</p>
<p>The idea of breaking up jobs for many people to work on online has been around for a while, of course, with <a href="http://www.wired.com/wired/archive/14.06/crowds.html">crowdsourcing</a> breaking big in the middle of the last decade. The innovation here is applying it to something that machines still can&#8217;t do as well as humans &mdash; reading handwriting &mdash; and then solving a problem that has been a real chokepoint for digitizing data. Even if this particular startup doesn&#8217;t end up taking off, they&#8217;ve approached a critical need in a way that has implications for multiple industries. For the <a href="http://www.slate.com/articles/technology/future_tense/2012/09/brightscope_castlight_new_businesses_built_on_open_government_data_.html">data economy to grow, it will need more feedstock</a>.</p>
<p>&#8220;This roughly falls under this flag of human-guided machine learning,&#8221; said Chen. &#8220;I think that with the advent of crowdsourcing, this is going to be a really powerful force in improving what machine learning algorithms can do and the types of problems it can solve. In six months &mdash; plus the time I spent on my PhD &mdash; we built a production system to do OCR that is better than any OCR system out there today, by probably an order of magnitude. I have the most respect for the computer vision researchers that came up with those OCR algorithms. It&#8217;s just that we solved a different problem than they did. This is the approach that we&#8217;re taking to solve the other problems of this domain as well.&#8221;</p>
<div style="float: left; border-top: thin gray solid; border-bottom: thin gray solid; padding: 20px; margin: 20px 2px; clear: both;"><a href="https://en.oreilly.com/stratany2012/public/regwith/RADAR20?intcmp=il-strata-stny12-captricity-interview"><img style="float: left; border: none; padding-right: 10px;" src="http://cdn.oreilly.com/radar/images/promos/2012-strata-ny-promo.gif" /></a><a href="https://en.oreilly.com/stratany2012/public/regwith/RADAR20"><strong>Strata Conference + Hadoop World</strong></a> &mdash;  The O&#8217;Reilly Strata Conference, being held Oct. 23-25 in New York City, explores the changes brought to technology and business by big data, data science, and pervasive computing. This year, Strata has joined forces with Hadoop World.</p>
<p><a href="https://en.oreilly.com/stratany2012/public/regwith/RADAR20?intcmp=il-strata-stny12-captricity-interview"><strong>Save 20% on registration with the code RADAR20</strong></a></div>
]]></content:encoded>
			<wfw:commentRss>http://strata.oreilly.com/2012/10/captricity-digitizing-documents-crowdsource.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>DataMarket charges up with open energy data</title>
		<link>http://strata.oreilly.com/2012/10/datamarket-charges-up-with-open-energy-data.html</link>
		<comments>http://strata.oreilly.com/2012/10/datamarket-charges-up-with-open-energy-data.html#comments</comments>
		<pubDate>Mon, 01 Oct 2012 13:55:43 +0000</pubDate>
		<dc:creator>Alex Howard</dc:creator>
				<category><![CDATA[Data]]></category>
		<category><![CDATA[data market]]></category>
		<category><![CDATA[energy]]></category>
		<category><![CDATA[energy data]]></category>
		<category><![CDATA[Gov 2.0]]></category>
		<category><![CDATA[open data]]></category>

		<guid isPermaLink="false">http://strata.oreilly.com/?p=52194</guid>
		<description><![CDATA[Hjalmar Gislason commented earlier this year that open data has been all about apps. In the future, it should be about much more than consumer-facing tools. &#8220;Think also about the less sexy cases that can help a few people save &#8230; ]]></description>
				<content:encoded><![CDATA[<p>Hjalmar Gislason commented earlier this year that <a href="http://blog.datamarket.com/2012/05/22/why-open-data-is-all-about-apps-and-why-it-shouldnt/">open data has been all about apps</a>. In the future, it should be about much more than consumer-facing tools. &#8220;Think also about the less sexy cases that can help a few people save us millions of dollars in aggregate, generate new insights and improve decision making on various levels,&#8221; he suggested. </p>
<p>Today, the founder and CEO of DataMarket told the audience of the first White House Energy Datapalooza that his company would make energy data more discoverable and usable. In doing so, Datamarket will be be tapping into an <a href="http://techcrunch.com/2012/09/30/data-markets-the-emerging-data-economy/">emerging data economy</a> of <a href="http://www.slate.com/articles/technology/future_tense/2012/09/brightscope_castlight_new_businesses_built_on_open_government_data_.html">businesses using open government data</a>. </p>
<p>“We are honored to have been invited to take part in this fantastic initiative,&#8221; said Gislason in a prepared statement. &#8220;At DataMarket we focus on doing one thing well: aggregating vast amounts of heterogeneous data to help business users with their planning and decision-making. Our new energy portal applies this know-how to the US government’s energy data, for the first time enabling these valuable resources to be searched, visualized and shared through one gateway and in combination with other domestic and worldwide open data sources.&#8221;</p>
<p><a href="http://energy.datamarket.com/">Energy.datamarket.com</a>, which won&#8217;t go live officially until mid-October, will offer search for 10 thousand data sets, 2 million time series and 50 million energy facts. DataMarket.com is based upon data from thirteen different data providers including the U.S. Department of Energy&#8217;s Energy Information Agency (EIA), Oak Ridge National Laboratory, Energy Efficiency and Renewable Energy program, National Renewable Energy Laboratory, the Environmental Protection Agency (EPA), the Bureau of Transportation Statistics, the World Bank and United Nations.</p>
<p>Last week, I interviewed Gislason about his company and why they&#8217;re focusing on energy data.</p>
<p><span id="more-52194"></span></p>
<h2>What itch were you scratching when you founded DataMarket?</h2>
<p><strong>Gislason:</strong> When we wanted the best data to base sales plans and decision making upon, we would search on Google. We&#8217;d go to websites, where we found heterogeneous Excel files and Power Points. We had to spend time cleaning it up, manipulating it, before we could get a trend line. Wouldn&#8217;t it be great if there were a &#8220;Google for numbers?&#8221; We aggregate quantifiable data into a single database that enables comparisons. Most of the business is licensing the technology to other companies. It&#8217;s used by more than 40,000 people a month. </p>
<h2>Why add energy data?</h2>
<p><strong>Gislason:</strong> DataMarket is a &#8216;nice to have&#8217; for many people but not a must-have for anybody. We realized that we need to go for narrower audiences, starting to do with more vertical approach. Energy is the first target. The Energy Datapalooza is a great venue to kick off this first initiative. </p>
<h2>What service are you providing?</h2>
<p><strong>Gislason:</strong> This is an aggregation service, mostly based upon public data. In the energy space, a key data provider is the Energy Information Administration (EIA). They have their own systems and way of publishing. EIA has 8 different systems and no unified way to search through it. That&#8217;s the service that we&#8217;re selling: aggregating, making it super easy to search, download, and compare. All the providers out there have been super helpful in helping.</p>
<h2>What value are you adding to the open data?</h2>
<p><strong>Gislason:</strong> Lots of data has been made available already but there are two issues: discoverability and usability. If you&#8217;re a person that uses energy data, you have to be able to find it. You also have to spend a lot of time to clean it up. Now you can use our technology. We add value on top of it: we normalize it and provide services. We&#8217;re not taking away anything. We&#8217;re adding value for tens of thousands of people using this data every day.</p>
<h2>What are the value added services?</h2>
<p><strong>Gislason:</strong> The magic actually happens in search. Take &#8220;oil production in Colorado,&#8221; with 16 data sets. You can visualize data, do charts or export the data. You can download the chart or the data view. You can use the API. You can connect in R. Data from the EIR, EPA, or UN all comes through the same interface.</p>
<p>You can get all these data sources for free. The business case is to gather a large audience by having data for free. The business model is a subscription service, on an annual basis. Pricing is not set yet. For professional use, it will probably be a few thousand dollars a year. We make a lot more data usable. It&#8217;s not initially commercially available but people can sign up. There will be a free 2-week trial for anybody.</p>
]]></content:encoded>
			<wfw:commentRss>http://strata.oreilly.com/2012/10/datamarket-charges-up-with-open-energy-data.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>