<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Strata &#187; Edd Dumbill</title>
	<atom:link href="http://strata.oreilly.com/edd/feed" rel="self" type="application/rss+xml" />
	<link>http://strata.oreilly.com</link>
	<description>Making Data Work</description>
	<lastBuildDate>Sun, 16 Jun 2013 16:18:16 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.5.1</generator>
		<item>
		<title>Big data, cool kids</title>
		<link>http://strata.oreilly.com/2013/05/big-data-cool-kids.html</link>
		<comments>http://strata.oreilly.com/2013/05/big-data-cool-kids.html#comments</comments>
		<pubDate>Tue, 14 May 2013 16:38:13 +0000</pubDate>
		<dc:creator>Edd Dumbill</dc:creator>
				<category><![CDATA[Data]]></category>
		<category><![CDATA[Events]]></category>
		<category><![CDATA[data]]></category>
		<category><![CDATA[strata]]></category>
		<category><![CDATA[strataconf]]></category>

		<guid isPermaLink="false">http://strata.oreilly.com/?p=57360</guid>
		<description><![CDATA[The big data world is a confusing place. We’re no longer in a market dominated mostly by relational databases, and the alternatives have multiplied in a baby boom of diversity. These child prodigies of the data scene show great promise &#8230; ]]></description>
				<content:encoded><![CDATA[<p>The big data world is a confusing place. We’re no longer in a market dominated mostly by relational databases, and the alternatives have multiplied in a baby boom of diversity.</p>
<div id="attachment_57364" class="wp-caption alignright" style="width: 160px"><a href="http://s.radar.oreilly.com/wp-files/5/2013/05/schoolyard2.jpg"><img class="size-thumbnail wp-image-57364" alt="My data is bigger than yours." src="http://s.radar.oreilly.com/wp-files/5/2013/05/schoolyard2-150x150.jpg" width="150" height="150" /></a><p class="wp-caption-text">My data is bigger than yours.</p></div>
<p>These child prodigies of the data scene show great promise but spend a lot of time knocking each other around in the schoolyard. Their egos can sometimes be too big to accept that everybody has their place, and eyeball-seeking media certainly doesn’t help.</p>
<p><strong>POPULAR KID:</strong> Look at me! Big data is the hotness!<br />
<strong>HADOOP:</strong> My data’s bigger than yours!<br />
<strong>SCIPY:</strong> Size isn’t everything, Hadoop! The bigger they come, the harder they fall. And aren’t you named after a toy elephant?<br />
<strong>R:</strong> Backward sentences mine be, but great power contains large brain.<br />
<strong>EVERYONE:</strong> Huh?<br />
<strong>SQL:</strong> Oh, so you all want to be friends again now, eh?!<br />
<strong>POPULAR KID:</strong> Yeah, what SQL said! Nobody really needs big data; it’s all about small data, dummy.</p>
<p><span id="more-57360"></span></p>
<p>The fact is that we’re fumbling toward the adolescence of big data tools, and we’re at an early stage of understanding how data can be used to create value and increase the quality of service people receive from government, business and health care. Big data is trumpeted in mainstream media, but many businesses are better advised to take baby steps with small data.</p>
<p>Data skeptics are not without justification. Our use of “small data” hasn’t exactly worked out uniformly well so far, crude numbers often being misused either knowingly or otherwise. For example, over-reliance by bureaucrats on the results of testing in schools is shaping educational institutions toward a tragically homogeneous mediocrity.</p>
<p>The promise and the gamble of big data is this: that we can advance past the primitive quotas of today’s small data into both a sophisticated statistical understanding of an entire system and insight that focuses down to the level of an individual. Data gives us both telescope and microscope, in detail we’ve never had before.</p>
<p>Inside this tantalizing vision lies many of the debates in today’s data world: the need for highly skilled data scientists to effect this change, and the worry that we’ll inadvertently enslave ourselves to Big Brother, even with the best of intentions.</p>
<p>So, as the data revolution moves forward, it’s important to take the long view. The foment of tools and job titles and algorithms is significant, but ultimately it’s background to our larger purposes as people, businesses and government. That’s one reason why, at O’Reilly, we’ve taken the motto “Making Data Work” for Strata. Data, not technology, is the heartbeat of our world because it relates directly to ourselves and the problems we want to solve.</p>
<p>This is also the reason that the <a href="http://strataconf.com/stratany2013/">Strata and Hadoop World</a> conferences take a broad view of the subject: ranging from the business topics to the tools and data science. If you talk to Hadoop’s most seasoned advocates, they don’t speak only about the tech; they talk about the problems they’re able to solve. The tools alone are never enough; the real enabler is the framework of people and understanding in which they’re used.</p>
<p>Our mission is to help people make sense of the state of the data world and use this knowledge to become both more competitive and more creative. We believe that’s best served by creating context in which we think about our use of data as well as serving the growing specialist communities in data.</p>
<p>Enjoy the noise and the energy from the growing data ecosystem, but keep your eyes on the problems you want to solve.</p>
<p><em>The Strata and Hadoop World <a href="http://strataconf.com/stratany2013/public/cfp/264">Call for Proposals</a> is open until midnight EDT, Thursday May 16.</em></p>
<h5><em>This post was originally published on <a href="http://radar.oreilly.com/2013/05/big-data-cool-kids.html">Radar</a>.</em></h5>
]]></content:encoded>
			<wfw:commentRss>http://strata.oreilly.com/2013/05/big-data-cool-kids.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Get the best start for data in your business</title>
		<link>http://strata.oreilly.com/2013/02/get-the-best-start-for-data-in-your-business.html</link>
		<comments>http://strata.oreilly.com/2013/02/get-the-best-start-for-data-in-your-business.html#comments</comments>
		<pubDate>Wed, 06 Feb 2013 19:32:32 +0000</pubDate>
		<dc:creator>Edd Dumbill</dc:creator>
				<category><![CDATA[Data]]></category>
		<category><![CDATA[Big Data in Enterprise IT]]></category>
		<category><![CDATA[data]]></category>
		<category><![CDATA[data business]]></category>
		<category><![CDATA[enterprise]]></category>
		<category><![CDATA[IT]]></category>
		<category><![CDATA[strata]]></category>

		<guid isPermaLink="false">http://strata.oreilly.com/?p=54441</guid>
		<description><![CDATA[In a world where technology and business are evermore intertwined, IT leaders aspire to key roles in their organizations. Sadly, industry conferences can lag behind, assuming IT is all about making the right buying decisions. Not so at Strata. Our &#8230; ]]></description>
				<content:encoded><![CDATA[<p>In a world where technology and business are evermore intertwined, IT leaders aspire to key roles in their organizations. Sadly, industry conferences can lag behind, assuming IT is all about making the right buying decisions.</p>
<p>Not so at Strata.</p>
<div id="attachment_54459" class="wp-caption alignright" style="width: 330px"><a href="http://www.flickr.com/photos/imuttoo/2631466945/"><img class="size-full wp-image-54459" alt="Turning data into focused advantage requires strategy and planning over the whole business" src="http://s.radar.oreilly.com/wp-files/5/2013/02/2631466945_de1bbc2cfd_n.jpg" width="320" height="213" /></a><p class="wp-caption-text">Turning data into focused advantage requires strategy and planning over the whole business.<br />Photo credit: <a href="http://www.flickr.com/photos/imuttoo/2631466945/" target="_new">Ian Muttoo</a></p></div>
<p><a href="http://strataconf.com/strata2013/public/content/enterprise-it">Our approach</a> is to take a view of data for business that centers around the problems you need to solve. The excitement around big data isn&#8217;t really about large volumes of data, it&#8217;s about <b>smart use of data</b>. It&#8217;s about using data to make your products better, help you be significantly more efficient, and create new products and businesses.</p>
<p>Getting the most from big data and data science is a lot more than a software choice. The business aims come first, and a good understanding of the problems you want to solve. Then you need to understand the capabilities of the technology and where data science can be best applied. Finally, you need to know how to run successful data projects, and how to hire and manage data teams.</p>
<p>Working with analytics and BI expert <a href="http://strataconf.com/strata2013/public/schedule/speaker/1305">Mark Madsen</a>, I&#8217;ve compiled a day-long program at Strata called <a href="http://strataconf.com/strata2013/public/content/enterprise-it">Big Data in Enterprise IT</a> that will take you through big data strategy, the issues of managing data, and how data science can be used effectively in your organization.<span id="more-54441"></span></p>
<p>In this day-long session on <a href="http://strataconf.com/strata2013/public/schedule/grid/2013-02-26?schedule=public">February 26</a> in Santa Clara, Calif., we&#8217;ll cover:</p>
<ul>
<li> What&#8217;s different about big data in contrast to traditional BI and data warehousing</li>
<li> How big data contributes to a business, and how we can measure that</li>
<li> Why big data is more than an IT project alone</li>
<li> How headline-grabbing data science like IBM Watson can be made to work for you</li>
<li> Human productivity bottlenecks in data analysis</li>
<li> How to interview and manage data scientists</li>
<li> How to keep data science efforts from derailing</li>
</ul>
<p>Check out the <a href="http://strataconf.com/strata2013/public/content/enterprise-it">full program on the conference website</a>.</p>
<p>Speakers include Jeremy Howard, president and CEO of data science competition company Kaggle, leading data scientists Daniel Tunkelang and Joe Hellerstein, big data and BI consultants Marc Demarest, Krish Krishnan and Mark Madsen, engineering management expert Kate Matsudaira, Teradata&#8217;s EMEA Director of Data Science Duncan Ross, and data science consultants Marck Vaisman and Sean Murphy.</p>
<p><a href="http://strataconf.com/strata2013">Join us in Santa Clara this February</a> to talk about making data work in your business.</p>
]]></content:encoded>
			<wfw:commentRss>http://strata.oreilly.com/2013/02/get-the-best-start-for-data-in-your-business.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Five big data predictions for 2013</title>
		<link>http://strata.oreilly.com/2013/01/five-big-data-predictions-for-2013.html</link>
		<comments>http://strata.oreilly.com/2013/01/five-big-data-predictions-for-2013.html#comments</comments>
		<pubDate>Wed, 16 Jan 2013 18:00:49 +0000</pubDate>
		<dc:creator>Edd Dumbill</dc:creator>
				<category><![CDATA[Data]]></category>
		<category><![CDATA[2013 data]]></category>
		<category><![CDATA[analytics]]></category>
		<category><![CDATA[Big Data]]></category>
		<category><![CDATA[big data architecture]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[predictions]]></category>

		<guid isPermaLink="false">http://strata.oreilly.com/?p=54031</guid>
		<description><![CDATA[Here are some of the key big data themes I expect to dominate 2013, and of course will be covering in Strata. Emergence of a big data architecture The coming year will mark the graduation for many big data pilot &#8230; ]]></description>
				<content:encoded><![CDATA[<p>Here are some of the key big data themes I expect to dominate 2013, and of course will be covering in <a href="http://strataconf.com">Strata</a>.</p>
<h2>Emergence of a big data architecture</h2>
<p><a href="http://www.flickr.com/photos/mdpettitt/7802234258/"><img src="http://s.radar.oreilly.com/wp-files/5/2013/01/0113-skyscraper-under-construction.jpg" alt="Leadenhall Building skyscraper Under Construction by Martin Pettitt, on Flickr" width="400" height="263" class="alignright size-full wp-image-54043" /></a>The coming year will mark the graduation for many big data pilot projects, as they are put into production. With that comes an understanding of the practical architectures that work. These architectures will identify:</p>
<ul>
<li> best of breed tools for different purposes, for instance, <a href="https://github.com/nathanmarz/storm/wiki">Storm</a> for streaming data acquisition</li>
<li> appropriate roles for relational databases, Hadoop, NoSQL stores and in-memory databases</li>
<li> how to combine existing data warehouses and analytical databases with Hadoop</li>
</ul>
<p>Of course, these architectures will be in constant evolution as big data tooling matures and experience is gained.</p>
<p>In parallel, I expect to see increasing understanding of where big data responsibility sits within a company&#8217;s org chart. Big data is fundamentally a business problem, and some of the biggest challenges in taking advantage of it lie in the changes required to cross organizational silos and reform decision making.</p>
<p><strong>One to watch:</strong> it&#8217;s hard to move data, so look for a starring architectural role for HDFS for the foreseeable future.<span id="more-54031"></span></p>
<h2>Hadoop is not the only fruit</h2>
<p>Though deservedly the poster child for big data software, Hadoop is not the only way to process big data. Credible competitors are emerging, especially where specialized applications are concerned. For example, the <a href="https://amplab.cs.berkeley.edu/bdas/">Berkeley Data Analytics Stack</a> offers an alternative platform that performs much faster than Hadoop MapReduce for some applications focused on data mining and machine learning.</p>
<p>At the same time, Hadoop is reinventing itself. Hadoop distributions this year will embrace Hadoop 2.0, and in particular <a href="http://hortonworks.com/blog/apache-hadoop-yarn-background-and-an-overview/">YARN</a>, a replacement for the batch-oriented MapReduce part of Hadoop that will permit other kinds of workloads to be executed.</p>
<p>For any big data competitor to get traction, it will need to both be open source and also fully support SQL-like access to data, which has become an entry-level requirement over the course of 2012. Hadoop&#8217;s not going anywhere soon, but a pleasing diversity of tools is emerging.</p>
<p><strong>One to watch:</strong> expect to see one or more startups emerging to commercialize the Berkeley Data Analytics Stack.</p>
<h2>Turnkey big data platforms</h2>
<p>Hadoop has a lot of moving parts. A lot. Even with the administration tools from vendors such as Cloudera and Hortonworks, there&#8217;s still significant work required in setting up and running a Hadoop cluster. In our age of cloud services, there&#8217;s no reason that should be so, as demonstrated by Amazon&#8217;s Elastic MapReduce service.</p>
<p>Expect Hadoop vendors to focus on removing system administration overhead over the course of this year, and other companies providing integrated big data stacks. <a href="http://www.infochimps.com/">InfoChimps</a> offers a big data stack managed as a service within private data centers. For those content to run in the public cloud, <a href="http://www.qubole.com/">Qubole</a> takes the concept one level further, with a turnkey Hadoop and Hive analysis platform that runs on Amazon EC2.</p>
<p><strong>One to watch:</strong> new entries into enterprise Hadoop infrastructure will include <a href="http://www.wandisco.com/">WANdisco</a>, following their <a href="http://www.wandisco.com/altostor">acquisition of AltoStor</a>.</p>
<h2>Data governance comes into focus</h2>
<p>As big data goes into production, it will need to integrate with the rest of the enterprise. Many of the issues concerned with <a href="http://en.wikipedia.org/wiki/Data_governance">data governance</a> will rise to the fore, including:</p>
<ul>
<li> data security</li>
<li> data consistency</li>
<li> reducing data duplication</li>
<li> regulatory compliance</li>
</ul>
<p><strong>One to watch:</strong> data security will become a hot topic this year, including approaches to securing Hadoop and databases with fine-grained security, such as <a href="http://accumulo.apache.org/">Apache Accumulo</a>.</p>
<h2>End-to-end analytic solutions emerge</h2>
<p>There are far more people who want to access analytic capabilities than have the IT resource to set up their own Hadoop clusters and code for them. For many &#8220;big data&#8221; applications, the big data comes from outside sources such as Twitter, or GIS data, but the internal data might be reasonably manageable, such as customer or sales data.</p>
<p>This year will see the growth of SaaS analytics platforms, delivered in the cloud for the swipe of a credit card. Web analytics platforms have pioneered the way here. In 2013, Google intends to expand their analytics offering to address &#8220;<a href="http://cutroni.com/blog/2012/10/29/universal-analytics-the-next-generation-of-google-analytics/">universal analytics</a>,&#8221; a service currently in closed beta-test.</p>
<p>The Frankenstein nature of current big data and BI offerings, most often involving gluing Tableau to an underlying database and accompanying ETL work, means that there&#8217;s a clear gap in the market for compelling end-to-end analytic solutions, especially targeted at marketing applications.</p>
<p><strong>One to watch:</strong> the launch of <a href="http://clearstorydata.com/">ClearStory Data</a> into public availability in 2013 will provide dynamic competition for analytics incumbents.</p>
<div style="float: left;border-top: thin gray solid;border-bottom: thin gray solid;padding: 20px;margin: 20px 2px;clear: both"><a href="http://strataconf.com/strata2013?intcmp=il-strata-stsc13-5-big-data-predictions-2013"><img style="float: left;border: none;padding-right: 10px" src="http://cdn.oreilly.com/radar/images/promos/strataca13-148x178-2.jpg" /></a><a href="http://strataconf.com/strata2013?intcmp=il-strata-stsc13-5-big-data-predictions-2013"><strong>Strata Conference Santa Clara</strong></a> &mdash; Strata Conference Santa Clara, being held Feb. 26-28, 2013 in California, gives you the skills, tools, and technologies you need to make data work today. <a href="http://strataconf.com/strata2013?intcmp=il-strata-stsc13-5-big-data-predictions-2013"><strong>Learn more</strong></a></div>
<p><em>Photo: <a href="http://www.flickr.com/photos/mdpettitt/7802234258/" title="Leadenhall Building skyscraper Under Construction by Martin Pettitt, on Flickr">Leadenhall Building skyscraper Under Construction by Martin Pettitt, on Flickr</a></em></p>
<p><strong>Related</strong></p>
<ul>
<li> <a href="http://strata.oreilly.com/2011/12/5-big-data-predictions-2012.html">Five big data predictions for 2012</a></li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://strata.oreilly.com/2013/01/five-big-data-predictions-for-2013.html/feed</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Why big data is big: the digital nervous system</title>
		<link>http://strata.oreilly.com/2012/08/digital-nervous-system-big-data.html</link>
		<comments>http://strata.oreilly.com/2012/08/digital-nervous-system-big-data.html#comments</comments>
		<pubDate>Wed, 29 Aug 2012 18:30:48 +0000</pubDate>
		<dc:creator>Edd Dumbill</dc:creator>
				<category><![CDATA[Data]]></category>
		<category><![CDATA[@editpick]]></category>
		<category><![CDATA[@home]]></category>
		<category><![CDATA[Big Data]]></category>
		<category><![CDATA[data application]]></category>
		<category><![CDATA[data use]]></category>
		<category><![CDATA[digital nervous system]]></category>
		<category><![CDATA[exoskeleton]]></category>

		<guid isPermaLink="false">http://radar.oreilly.com/?p=51285</guid>
		<description><![CDATA[Where does all the data in &#8220;big data&#8221; come from? And why isn&#8217;t big data just a concern for companies such as Facebook and Google? The answer is that the web companies are the forerunners. Driven by social, mobile, and &#8230; ]]></description>
				<content:encoded><![CDATA[<p>Where does all the data in &#8220;<a title="What is big data?" href="http://radar.oreilly.com/2012/01/what-is-big-data.html">big data</a>&#8221; come from? And why isn&#8217;t big data just a concern for companies such as Facebook and Google? The answer is that the web companies are the forerunners. Driven by social, mobile, and cloud technology, there is an important transition taking place, leading us all to the data-enabled world that those companies inhabit today.</p>
<h2>From exoskeleton to nervous system</h2>
<p>Until a few years ago, the main function of computer systems in society, and business in particular, was as a digital support system. Applications digitized existing real-world processes, such as word-processing, payroll and inventory. These systems had interfaces back out to the real world through stores, people, telephone, shipping and so on. The now-quaint phrase &#8220;paperless office&#8221; alludes to this transfer of pre-existing paper processes into the computer. These computer systems formed a <em>digital exoskeleton</em>, supporting a business in the real world.</p>
<p>The arrival of the Internet and web has added a new dimension, bringing in an era of entirely digital business. Customer interaction, payments and often product delivery can exist entirely within computer systems. Data doesn&#8217;t just stay inside the exoskeleton any more, but is a key element in the operation. We&#8217;re in an era where business and society are acquiring a <em>digital nervous system</em>.</p>
<p><span id="more-51285"></span>As my sketch below shows, an organization with a digital nervous system is characterized by a large number of inflows and outflows of data, a high level of networking, both internally and externally, increased data flow, and consequent complexity.</p>
<p>This transition is why big data is important. Techniques developed to deal with interlinked, heterogenous data acquired by massive web companies will be our main tools as the rest of us transition to digital-native operation. We see early examples of this, from catching fraud in financial transactions, to debugging and <a title="People Analytics: Using Data to Drive HR Strategy and Action" href="http://strataconf.com/jumpstart2011/public/schedule/detail/21341">improving the hiring process in HR</a>: and almost everybody already pays attention to the massive flow of social network information concerning them.</p>
<p class="image-box-580"><img style="margin-bottom: 15px" src="http://radar.oreilly.com/wp-files/2/2012/08/image-11-620x465.jpg" alt="From digital exoskeleton to nervous system" width="580" border="0" /><br /><em>From digital exoskeleton to nervous system.</em></p>
<h2>Charting the transition</h2>
<p>As technology has progressed within business, each step taken has resulted in a leap in data volume. To people looking at big data now, a reasonable question is to ask why, when their business isn&#8217;t Google or Facebook, does big data apply to them?</p>
<p>The answer lies in the ability of web businesses to conduct 100% of their activities online. Their digital nervous system easily stretches from the beginning to the end of their operations. If you have factories, shops and other parts of the real world within your business, you&#8217;ve further to go in incorporating them into the digital nervous system.</p>
<p>But &#8220;further to go&#8221; doesn&#8217;t mean it won&#8217;t happen. The drive of the web, social media, mobile, and the cloud is bringing more of each business into a data-driven world. In the UK, the <a href="http://digital.cabinetoffice.gov.uk/">Government Digital Service</a> is unifying the delivery of services to citizens. The results are a radical improvement of citizen experience, and for the first time many departments are able to get a real picture of how they&#8217;re doing. For any retailer, companies such as <a href="https://squareup.com/">Square</a>, <a href="https://businessinsights.americanexpress.com/">American Express</a> and <a href="https://foursquare.com/business/">Foursquare</a> are bringing payments into a social, responsive data ecosystem, liberating that information from the silos of corporate accounting.</p>
<p>What does it mean to have a digital nervous system? The key trait is to make an organization&#8217;s feedback loop entirely digital. That is, a direct connection from sensing and monitoring inputs through to product outputs. That&#8217;s straightforward on the web. It&#8217;s getting increasingly easier in retail. Perhaps the biggest shifts in our world will come as sensors and robotics bring the advantages web companies have now to domains such as <a href="http://bits.blogs.nytimes.com/2011/11/21/81057/">industry</a>, <a href="http://blogs.wsj.com/cio/2012/03/30/union-pacific-using-predictive-software-to-reduce-train-derailments/">transport</a>, and the <a href="http://military.discovery.com/videos/drones-uavs-robots/">military</a>.</p>
<p>The reach of digital nervous system has grown steadily over the past 30 years, and each step brings gains in agility and flexibility, along with an order of magnitude more data. First, from specific application programs to general business use with the PC. Then, direct interaction over the web. Mobile adds awareness of time and place, along with instant notification. The next step, to cloud, breaks down data silos and adds storage and compute elasticity through cloud computing. Now, we&#8217;re integrating smart agents, able to act on our behalf, and connections to the real world through sensors and automation.</p>
<h2>Coming, ready or not</h2>
<p>If you&#8217;re not contemplating the advantages of taking more of your operation digital, you can bet your competitors are. As Marc Andreessen <a href="http://online.wsj.com/article/SB10001424053111903480904576512250915629460.html">wrote last year</a>, &#8220;software is eating the world.&#8221; Everything is becoming programmable.</p>
<p>It&#8217;s this growth of the digital nervous system that makes the techniques and tools of big data relevant to us today. The challenges of massive data flows, and the erosion of hierarchy and boundaries, will lead us to the statistical approaches, <a href="http://en.wikipedia.org/wiki/Systems_thinking">systems thinking</a> and machine learning we need to cope with the future we&#8217;re inventing.</p>
<div style="float: left;border-top: thin gray solid;border-bottom: thin gray solid;padding: 20px;margin: 20px 2px;clear: both"><a href="https://en.oreilly.com/stratany2012/public/regwith/RADAR20?intcmp=il-strata-stny12-why-big-data-is-big"><img style="float: left;border: none;padding-right: 10px" src="http://cdn.oreilly.com/radar/images/promos/2012-strata-ny-promo.gif" /></a><a href="https://en.oreilly.com/stratany2012/public/regwith/RADAR20?intcmp=il-strata-stny12-why-big-data-is-big"><strong>Strata Conference + Hadoop World</strong></a> &mdash;  The O&#8217;Reilly Strata Conference, being held Oct. 23-25 in New York City, explores the changes brought to technology and business by big data, data science, and pervasive computing. This year, Strata has joined forces with Hadoop World.</p>
<p><a href="https://en.oreilly.com/stratany2012/public/regwith/RADAR20?intcmp=il-strata-stny12-why-big-data-is-big"><strong>Save 20% on registration with the code RADAR20</strong></a></div>
]]></content:encoded>
			<wfw:commentRss>http://strata.oreilly.com/2012/08/digital-nervous-system-big-data.html/feed</wfw:commentRss>
		<slash:comments>12</slash:comments>
		</item>
		<item>
		<title>Now available: &#8220;Planning for Big Data&#8221;</title>
		<link>http://strata.oreilly.com/2012/03/planning-big-data.html</link>
		<comments>http://strata.oreilly.com/2012/03/planning-big-data.html#comments</comments>
		<pubDate>Wed, 14 Mar 2012 13:00:00 +0000</pubDate>
		<dc:creator>Edd Dumbill</dc:creator>
				<category><![CDATA[Data]]></category>
		<category><![CDATA[@home]]></category>
		<category><![CDATA[data product]]></category>
		<category><![CDATA[data science]]></category>
		<category><![CDATA[data scientists]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Planning for Big Data]]></category>
		<category><![CDATA[Radar Report]]></category>

		<guid isPermaLink="false">http://blogs.oreilly.com/radar/2012/03/planning-big-data.html</guid>
		<description><![CDATA[&#34;Planning for Big Data&#34; is a new book that helps you understand what big data is, why it matters, and where to get started.  ]]></description>
				<content:encoded><![CDATA[<p><a href="http://shop.oreilly.com/product/0636920025559.do?cmp=il-radar-ebooks-planning-for-big-data-radar-announcement"><img src="http://cdn.oreilly.com/radar/images/posts/0312-planning-big-data-cover.png" border="0" alt="Planning for Big Data" width="201" style="float: right;margin: 3px 0 10px 10px" /></a>Earlier this month, more than 2,500 people came together for the <a href="http://strataconf.com/strata2012">O&#8217;Reilly Strata Conference in Santa Clara, Calif</a>. Though representing diverse fields, from insurance to media and high-tech to healthcare, attendees buzzed with a new-found common identity: they are data scientists. Entrepreneurial and resourceful, combining programming skills with math, data scientists have emerged as a new profession leading the march toward data-driven business.</p>
<p>This new profession rides on the wave of big data. Our businesses are creating ever more data, and as consumers we are sources of massive streams of information, thanks to social networks and smartphones. In this raw material lies much of value: insight about businesses and markets, and the scope to create new kinds of hyper-personalized products and services.</p>
<p>Five years ago, only big business could afford to profit from big data: Walmart and Google, specialized financial traders. Today, thanks to an open source project called <a href="http://hadoop.apache.org/">Hadoop</a>, commodity Linux hardware and cloud computing, this power is in reach for everyone. A data revolution is sweeping business, government and science, with consequences as far reaching and long lasting as the web itself.</p>
<h2>Where to start?</h2>
<p> Every revolution has to start somewhere, and the question for many is &#8220;how can data science and big data help my organization?&#8221; After years of data processing choices being straightforward, there&#8217;s now a diverse landscape to negotiate. What&#8217;s more, to become data driven, you must grapple with changes that are cultural as well as technological.</p>
<p>Our aim with Strata is to help you understand what big data is, why it matters, and where to get started. In the wake the recent conference, we&#8217;re delighted to announce the publication of our &#8220;<a href="http://shop.oreilly.com/product/0636920025559.do?cmp=il-radar-ebooks-planning-for-big-data-radar-announcement">Planning for Big Data</a>&#8221; book. Available as a <a href="http://shop.oreilly.com/product/0636920025559.do?cmp=il-radar-ebooks-planning-for-big-data-radar-announcement">free download</a>, the book contains the best insights from O&#8217;Reilly Radar authors over the past three months, including myself, Alistair Croll, Julie Steele and Mike Loukides.</p>
<p>&#8220;Planning for Big Data&#8221; is for anybody looking to get a concise overview of the opportunity and technologies associated with big data. If you&#8217;re already working with big data, hand this book to your colleagues or executives to help them better appreciate the issues and possibilities.</p>
<div style="float: left;border-top: thin gray solid;border-bottom: thin gray solid;padding: 20px;margin: 20px 2px;clear: both"><a href="http://www.microsoft.com/sql"><img style="float: left;border: none;padding-right: 10px" src="http://s.radar.oreilly.com/wp-files/2/2011/12/sponsor-ms-sql-server.png" /></a><a href="http://www.microsoft.com/sql"><strong>Microsoft SQL Server</strong></a> is a comprehensive information platform offering enterprise-ready technologies and tools that help businesses derive maximum value from information at the lowest TCO. SQL Server 2012 launches next year, offering a cloud-ready information platform delivering mission-critical confidence, breakthrough insight, and cloud on your terms; find out more at <a href="http://www.microsoft.com/sql">www.microsoft.com/sql</a>.</div>
<p><strong>Related data reports and ebooks</strong></p>
<ul>
<li> <a href="http://radar.oreilly.com/2010/06/what-is-data-science.html">What is data science?</a></li>
<li> <a href="http://radar.oreilly.com/2011/09/building-data-science-teams.html">Building data science teams</a></li>
<li> <a href="http://radar.oreilly.com/2011/09/big-data-now-oreilly-data-ebook.html">The &#8220;Big Data Now&#8221; anthology</a></li>
<li> <a href="http://radar.oreilly.com/2011/09/evolution-of-data-products.html">The evolution of data products</a></li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://strata.oreilly.com/2012/03/planning-big-data.html/feed</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Data markets compared</title>
		<link>http://strata.oreilly.com/2012/03/data-markets-survey.html</link>
		<comments>http://strata.oreilly.com/2012/03/data-markets-survey.html#comments</comments>
		<pubDate>Wed, 07 Mar 2012 14:00:00 +0000</pubDate>
		<dc:creator>Edd Dumbill</dc:creator>
				<category><![CDATA[Data]]></category>
		<category><![CDATA[@home]]></category>
		<category><![CDATA[data market]]></category>
		<category><![CDATA[data platform]]></category>
		<category><![CDATA[data product]]></category>
		<category><![CDATA[DataMarket]]></category>
		<category><![CDATA[Factual]]></category>
		<category><![CDATA[Infochimps]]></category>
		<category><![CDATA[Planning for Big Data]]></category>
		<category><![CDATA[Windows Azure Data Marketplace]]></category>

		<guid isPermaLink="false">http://blogs.oreilly.com/radar/2012/03/data-markets-survey.html</guid>
		<description><![CDATA[Strata chair Edd Dumbill  provides an overview of the most mature data markets (Infochimps, Factual, Windows Azure Data Marketplace, DataMarket), and contrasts their different approaches and facilities. ]]></description>
				<content:encoded><![CDATA[<p>
<div style="float: right;margin: 3px 0 10px 10px;padding: 2px 4px 0 15px;border-left: 1px solid #ddd">
<p style="background: #990000;width: 250px;color: #fff;font-size: .9em;font-weight: bold;padding: 2px 0 2px 4px;margin: 0 0 3px 0">Sections</p>
<ul style="margin-top: 10px;padding-right: 4px">
<li> <a href="#sec-1-2">What do marketplaces do?</a></li>
<li> <a href="#sec-1-3">Infochimps</a></li>
<li> <a href="#sec-1-4">Factual</a></li>
<li> <a href="#sec-1-5">Windows Azure Data Marketplace</a></li>
<li> <a href="#sec-1-6">DataMarket</a></li>
<li> <a href="#compared">Data markets compared</a></li>
<li> <a href="#sec-1-8">Other data suppliers</a></li>
</ul></div>
<p>   The sale of data is a venerable business, and has existed since the middle of the 19th century, when Paul Reuter began providing telegraphed stock exchange prices between Paris and London, and New York newspapers founded the Associated Press.</p>
<p> The web has facilitated a blossoming of information providers. As the ability to discover and exchange data improves, the need to rely on aggregators such as Bloomberg or Thomson Reuters is declining. This is a good thing: the business models of large aggregators do not readily scale to web startups, or casual use of data in analytics. </p>
<p> Instead, data is increasingly offered through online marketplaces: platforms that host data from publishers and offer it to consumers. This article provides an overview of the most mature data markets, and contrasts their different approaches and facilities. </p>
<h2 id="sec-1-2">What do marketplaces do?</h2>
<p> Most of the consumers of data from today&#8217;s marketplaces are developers. By adding another dataset to your own business data, you can create insight. To take an example from web analytics: by mixing an IP address database with the logs from your website, you can understand where your customers are coming from, then if you add demographic data to the mix, you have some idea of their socio-economic bracket and spending ability. </p>
<p> Such insight isn&#8217;t limited to analytic use only, you can use it to provide value back to a customer. For instance, by recommending restaurants local to the vicinity of a lunchtime appointment in their calendar. While many datasets are useful, few are as potent as that of location in the way they provide context to activity. </p>
<p> Marketplaces are useful in three major ways. First, they provide a point of discoverability and comparison for data, along with indicators of quality and scope. Second, they handle the cleaning and formatting of the data, so it is ready for use (often 80% of the work in any data integration). Finally, marketplaces provide an economic model for broad access to data that would otherwise prove difficult to either publish or consume. </p>
<p> In general, one of the important barriers to the development of the data marketplace economy is the ability of enterprises to store and make use of the data. A principle of big data is that it&#8217;s often easier to move your computation to the data, rather than the reverse. Because of this, we&#8217;re seeing the increasing integration between cloud computing facilities and data markets: Microsoft&#8217;s data market is tied to its Azure cloud, and Infochimps offers hosted compute facilities. In short-term cases, it&#8217;s probably easier to export data from your business systems to a cloud platform than to try and expand internal systems to integrate external sources. </p>
<p> While cloud solutions offer a route forward, some marketplaces also make the effort to target end-users. Microsoft&#8217;s data marketplace can be accessed directly through Excel, and DataMarket provides online visualization and exploration tools. </p>
<p> The four most established data marketplaces are Infochimps, Factual, Microsoft Windows Azure Data Marketplace, and DataMarket. A table comparing these providers is presented at the end of this article, and a brief discussion of each marketplace follows. </p>
<h2 id="sec-1-3">Infochimps</h2>
<p> According to founder Flip Kromer, <a href="http://infochimps.com/">Infochimps</a> was created to give data life in the same way that code hosting projects such as SourceForge or GitHub give life to code. You can improve code and share it: Kromer wanted the same for data. The driving goal behind Infochimps is to connect every public and commercially available database in the world to a common platform. </p>
<p> Infochimps realized that there&#8217;s an important network effect of &#8220;data with the data,&#8221; that the best way to build a data commons and a data marketplace is to put them together in the same place. The proximity of other data makes all the data more valuable, because of the ease with which it can be found and combined. </p>
<p> The biggest challenge in the two years Infochimps has been operating is that of bootstrapping: a data market needs both supply and demand. Infochimps&#8217; approach is to go for a broad horizontal range of data, rather than specialize. According to Kromer, this is because they view data&#8217;s value as being in the context it provides: in giving users more insight about their own data. To join up data points into a context, common identities are required (for example, a web page view can be given a geographical location by joining up the IP address of the page request with that from the IP address in an IP intelligence database). The benefit of common identities and data integration is where hosting data together really shines, as Infochimps only needs to integrate the data once for customers to reap continued benefit: Infochimps sells datasets which are pre-cleaned and integrated mash-ups of those from their providers. </p>
<p> By launching a big data cloud hosting platform alongside its marketplace, Infochimps is seeking to build on the importance of data locality. </p>
<div style="float: left;border-top: thin gray solid;border-bottom: thin gray solid;padding: 20px;margin: 20px 2px;clear: both"><a href="http://www.microsoft.com/sql"><img style="float: left;border: none;padding-right: 10px" src="http://s.radar.oreilly.com/wp-files/2/2011/12/sponsor-ms-sql-server.png" /></a><a href="http://www.microsoft.com/sql"><strong>Microsoft SQL Server</strong></a> is a comprehensive information platform offering enterprise-ready technologies and tools that help businesses derive maximum value from information at the lowest TCO. SQL Server 2012 launches next year, offering a cloud-ready information platform delivering mission-critical confidence, breakthrough insight, and cloud on your terms; find out more at <a href="http://www.microsoft.com/sql">www.microsoft.com/sql</a>.</div>
<h2 id="sec-1-4">Factual</h2>
<p> <a href="http://factual.com/">Factual</a> was envisioned by founder and CEO Gil Elbaz as an open data platform, with tools that could be leveraged by community contributors to improve data quality. The vision is very similar to that of Infochimps, but in late 2010 Factual elected to concentrate on one area of the market: geographical and place data. Rather than pursue a broad strategy, the idea is to become a proven and trusted supplier in one vertical, then expand. With customers such as <a href="http://blog.factual.com/factual-facebook-spatialmatch-mytown">Facebook</a>, Factual&#8217;s strategy is paying off. </p>
<p> According to Elbaz, Factual will look to expand into verticals other than local information in 2012. It is moving one vertical at a time due to the marketing effort required in building quality community and relationships around the data. </p>
<p> Unlike the other main data markets, Factual does not offer reselling facilities for data publishers. Elbaz hasn&#8217;t found that the cash on offer is attractive enough for many organizations to want to share their data. Instead, he believes that the best way to get data you want is to trade other data, which could provide business value far beyond the returns of publishing data in exchange for cash. Factual offer incentives to their customers to share data back, improving the quality of the data for everybody. </p>
<h2 id="sec-1-5">Windows Azure Data Marketplace</h2>
<p> Launched in 2010, Microsoft&#8217;s <a href="https://datamarket.azure.com/">Windows Azure Data Marketplace</a> sits alongside the company&#8217;s Applications marketplace as part of the Azure cloud platform. Microsoft&#8217;s data market is positioned with a very strong integration story, both at the cloud level and with end-user tooling. </p>
<p> Through use of a standard data protocol, <a href="http://www.odata.org/">OData</a>, Microsoft offers a well-defined web interface for data access, including queries. As a result, programs such as Excel and PowerPivot can directly access marketplace data: giving Microsoft a strong capability to integrate external data into the existing tooling of the enterprise. In addition, OData support is available for a broad array of programming languages. </p>
<p> Azure Data Marketplace has a strong emphasis on connecting data consumers to publishers, and most closely approximates the popular concept of an &#8220;<a href="http://radar.oreilly.com/2011/04/itunes-for-data.html">iTunes for Data</a>.&#8221; Big name data suppliers such as Dun &amp; Bradstreet and ESRI can be found among the publishers. The marketplace contains a good range of data across many commercial use cases, and tends to be limited to one provider per dataset &mdash; Microsoft has maintained a strong filter on the reliability and reputation of its suppliers. </p>
<h2 id="sec-1-6">DataMarket</h2>
<p> Where the other three main data marketplaces put a strong focus on the developer and IT customers, <a href="http://datamarket.com/">DataMarket</a> caters to the end-user as well. Realizing that interacting with bland tables wasn&#8217;t engaging users, founder Hjalmar Gislason worked to add interactive visualization to his platform. </p>
<p> The result is a data marketplace that is immediately useful for researchers and analysts.  The range of DataMarket&#8217;s data follows this audience too, with a strong emphasis on country data and economic indicators. Much of the data is available for free, with premium data paid at the point of use. </p>
<p> DataMarket has recently made a significant play for data publishers, with the emphasis on publishing, not just selling data. Through a variety of <a href="http://datamarket.com/plans-and-pricing/">plans</a>, customers can use DataMarket&#8217;s platform to publish and sell their data, and embed charts in their own pages. At the enterprise end of their packages, DataMarket offers an interactive branded data portal integrated with the publisher&#8217;s own web site and user authentication system. Initial customers of this plan include Yankee Group and Lux Research. </p>
<h2 id="compared">Data markets compared</h2>
<table border="0" cellspacing="10" cellpadding="10">
<thead>
<tr>
<td width="20%">&nbsp;</td>
<td width="20%"><a href="https://datamarket.azure.com/"><strong>Azure</strong></a></td>
<td width="20%"><a href="https://datamarket.azure.com/"><strong>Datamarket</strong></a></td>
<td width="20%"><a href="http://factual.com/"><strong>Factual</strong></a></td>
<td width="20%"><a href="http://infochimps.com/"><strong>Infochimps</strong></a></td>
</tr>
<tbody>
<tr>
<td><strong>Data sources</strong></td>
<td>Broad range</td>
<td>Range, with a focus on country and industry stats</td>
<td>Geo-specialized, some other datasets</td>
<td>Range, with a focus on geo, social and web sources</td>
</tr>
<tr>
<td><strong>Free data</strong></td>
<td>Yes</td>
<td>Yes</td>
<td>-</td>
<td>Yes</td>
</tr>
<tr>
<td><strong>Free trials of paid data</strong></td>
<td>Yes</td>
<td>-</td>
<td>Yes, <a href="http://www.factual.com/pricing">limited free use of APIs</a></td>
<td>-</td>
</tr>
<tr>
<td><strong>Delivery</strong></td>
<td><a href="http://www.odata.org/">OData</a> API</td>
<td>API, downloads</td>
<td>API, downloads for heavy users</td>
<td>API, downloads</td>
</tr>
<tr>
<td><strong>Application hosting</strong></td>
<td><a href="http://www.windowsazure.com/en-us/">Windows Azure</a></td>
<td>-</td>
<td>-</td>
<td><a href="http://www.infochimps.com/how-it-works">Infochimps Platform</a></td>
</tr>
<tr>
<td><strong>Previewing</strong></td>
<td><a href="http://msdn.microsoft.com/en-us/library/ff717671.aspx">Service Explorer</a></td>
<td>Interactive visualization</td>
<td>Interactive search</td>
<td>-</td>
</tr>
<tr>
<td><strong>Tool integration</strong></td>
<td><a href="https://datamarket.azure.com/addin">Excel</a>, <a href="http://www.youtube.com/watch?v=n6OGyd63w-s">PowerPivot</a>, <a href="http://www.tableausoftware.com/about/blog/2010/10/welcome-windows-azure-datamarket">Tableau</a> and <a href="http://www.odata.org/consumers">other OData consumers</a></td>
<td>-</td>
<td><a href="http://wiki.developer.factual.com/w/page/12298852/start">Developer tool integrations</a></td>
<td>-</td>
</tr>
<tr>
<td><strong>Data publishing</strong></td>
<td>Via <a href="http://msdn.microsoft.com/en-us/library/hh563871.aspx">database connection or web service</a></td>
<td><a href="http://datamarket.com/tour/publish/">Upload or web/database connection</a>.</td>
<td><a href="http://www.factual.com/FAQ#contributeUpload">Via upload or web service</a>.</td>
<td>Upload</td>
</tr>
<tr>
<td><strong>Data reselling</strong></td>
<td>Yes, 20% commission on non-free datasets</td>
<td>Yes. <a href="http://datamarket.com/p/data_providers/">Fees and commissions vary</a>. Ability to create branded data market</td>
<td>-</td>
<td>Yes.  <a href="http://www.infochimps.com/faq#revshare">30% commission on non-free datasets</a>.</td>
</tr>
<tr>
<td><strong>Launched</strong></td>
<td>2010</td>
<td>2010</td>
<td>2007</td>
<td>2009</td>
</tr>
</tbody>
</table>
<h2 id="sec-1-8">Other data suppliers</h2>
<p> While this article has focused on the more general purpose marketplaces, several other data suppliers are worthy of note. </p>
<p> <strong>Social data</strong> &mdash; <a href="http://gnip.com/">Gnip</a> and <a href="http://datasift.com/">Datasift</a> specialize in offering social   media data streams, in particular Twitter.</p>
<p> <strong>Linked data</strong> &mdash; <a href="http://kasabi.com/">Kasabi</a>, currently in beta, is a marketplace that is   distinctive for hosting all its data as <a href="http://linkeddata.org/">Linked Data</a>, accessible via   web standards such as SPARQL and RDF. </p>
<p> <strong>Wolfram Alpha</strong> &mdash; Perhaps the most prolific integrator of diverse   databases, <a href="http://www.wolframalpha.com/">Wolfram Alpha</a> recently added a <a href="http://www.wolframalpha.com/pro/">Pro</a> subscription level   that permits the end user to download the data resulting from a   computation.</p>
<p><strong>Related:</strong></p>
<ul>
<li> <a href="http://radar.oreilly.com/2011/04/itunes-for-data.html">An iTunes model for data</a></li>
<li> <a href="http://radar.oreilly.com/2011/01/data-markets-resellers-gnip.html">Data markets aren&#8217;t coming. They&#8217;re already here</a></li>
<li> <a href="http://radar.oreilly.com/2012/01/big-data-ecosystem.html">Big data market survey: Hadoop solutions</a></li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://strata.oreilly.com/2012/03/data-markets-survey.html/feed</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Big data in the cloud</title>
		<link>http://strata.oreilly.com/2012/02/big-data-in-the-cloud-microsoft-amazon-google.html</link>
		<comments>http://strata.oreilly.com/2012/02/big-data-in-the-cloud-microsoft-amazon-google.html#comments</comments>
		<pubDate>Wed, 22 Feb 2012 15:00:00 +0000</pubDate>
		<dc:creator>Edd Dumbill</dc:creator>
				<category><![CDATA[Data]]></category>
		<category><![CDATA[@editpick]]></category>
		<category><![CDATA[@home]]></category>
		<category><![CDATA[@radaronly]]></category>
		<category><![CDATA[big data analytics]]></category>
		<category><![CDATA[cloud providers]]></category>
		<category><![CDATA[Planning for Big Data]]></category>
		<category><![CDATA[strata conference]]></category>

		<guid isPermaLink="false">http://blogs.oreilly.com/radar/2012/02/big-data-in-the-cloud-microsoft-amazon-google.html</guid>
		<description><![CDATA[Big data and cloud technology go hand-in-hand: but it&apos;s comparatively early days. Strata conference chair Edd Dumbill explains the cloud landscape and compares the offerings of Amazon, Google and Microsoft. ]]></description>
				<content:encoded><![CDATA[<p>
<div style="float: right;margin: 3px 0 10px 10px;padding: 2px 4px 0 15px;border-left: 1px solid #ddd">
<p style="background: #990000;width: 250px;color: #fff;font-size: .9em;font-weight: bold;padding: 2px 0 2px 4px;margin: 0 0 3px 0">Sections</p>
<ul style="margin-top: 10px;padding-right: 4px">
<li> <a href="#sec-2">IaaS and private clouds</a></li>
<li><a href="#sec-3">Platform solutions</a>
<li> <a href="#comparison_table">Big data cloud platforms compared</a></li>
<li> <a href="#sec-5">Conclusion</a></li>
</ul></div>
<p> Big data and cloud technology go hand-in-hand. Big data needs clusters of servers for processing, which clouds can readily provide. So goes the marketing message, but what does that look like in reality? Both &#8220;cloud&#8221; and &#8220;big data&#8221; have broad definitions, obscured by considerable hype. This article breaks down the landscape as simply as possible, highlighting what&#8217;s practical, and what&#8217;s to come. </p>
<h2 id="sec-2">IaaS and private clouds</h2>
<p> What is often called &#8220;cloud&#8221; amounts to virtualized servers: computing resource that presents itself as a regular server, rentable per consumption. This is generally called <a href="http://en.wikipedia.org/wiki/Infrastructure_as_a_service#Infrastructure">infrastructure as a service</a> (IaaS), and is offered by platforms such as Rackspace Cloud or Amazon EC2. You buy time on these services, and install and configure your own software, such as a Hadoop cluster or NoSQL database. Most of the solutions I described in my <a href="http://radar.oreilly.com/2012/01/big-data-ecosystem.html">Big Data Market Survey</a> can be deployed on IaaS services. </p>
<p> Using IaaS clouds doesn&#8217;t mean you must handle all deployment manually: good news for the clusters of machines big data requires. You can use orchestration frameworks, which handle the management of resources, and automated infrastructure tools, which handle server installation and configuration. <a href="http://www.rightscale.com/">RightScale</a> offers a commercial multi-cloud management platform that mitigates some of the problems of managing servers in the cloud. </p>
<p> Frameworks such as <a href="http://openstack.org/">OpenStack</a> and <a href="http://open.eucalyptus.com/">Eucalyptus</a> aim to present a uniform interface to both private data centers and the public cloud. Attracting a strong flow of cross industry support, OpenStack currently addresses computing resource (akin to Amazon&#8217;s EC2) and storage (parallels Amazon S3). </p>
<p> The race is on to make private clouds and IaaS services more usable: over the next two years using clouds should become much more straightforward as vendors adopt the nascent standards. There&#8217;ll be a uniform interface, whether you&#8217;re using public or private cloud facilities, or a hybrid of the two. </p>
<p> Particular to big data, several configuration tools already target Hadoop explicitly: among them <a href="http://content.dell.com/us/en/gen/d/cloud-computing/crowbar-software-framework">Dell&#8217;s Crowbar</a>, which aims to make deploying and configuring clusters simple, and <a href="http://whirr.apache.org/">Apache Whirr</a>, which is specialized for running Hadoop services and <a href="http://whirr.apache.org/docs/0.7.0/supported-services-and-clouds.html">other clustered data processing systems</a>. </p>
<p> Today, using IaaS gives you a broad choice of cloud supplier, the option of using a private cloud, and complete control: but you&#8217;ll be responsible for deploying, managing and maintaining your clusters. </p>
<div style="float: left;border-top: thin gray solid;border-bottom: thin gray solid;padding: 20px;margin: 20px 2px;clear: both"><a href="http://www.microsoft.com/sql"><img style="float: left;border: none;padding-right: 10px" src="http://s.radar.oreilly.com/wp-files/2/2011/12/sponsor-ms-sql-server.png" /></a><a href="http://www.microsoft.com/sql"><strong>Microsoft SQL Server</strong></a> is a comprehensive information platform offering enterprise-ready technologies and tools that help businesses derive maximum value from information at the lowest TCO. SQL Server 2012 launches next year, offering a cloud-ready information platform delivering mission-critical confidence, breakthrough insight, and cloud on your terms; find out more at <a href="http://www.microsoft.com/sql">www.microsoft.com/sql</a>.</div>
<h2 id="sec-3">Platform solutions</h2>
<p> Using IaaS only brings you so far for with big data applications: they handle the creation of computing and storage resources, but don&#8217;t address anything at a higher level. The set up of Hadoop and Hive or a similar solution is down to you. </p>
<p> Beyond IaaS, several cloud services provide application layer support for big data work. Sometimes referred to as managed solutions, or <a href="http://en.wikipedia.org/wiki/Platform_as_a_service">platform as a service</a> (PaaS), these services remove the need to configure or scale things such as databases or MapReduce, reducing your workload and maintenance burden. Additionally, PaaS providers can realize great efficiencies by hosting at the application level, and pass those savings on to the customer. </p>
<p> The general PaaS market is burgeoning, with major players including VMware (<a href="http://www.cloudfoundry.com/">Cloud Foundry</a>) and Salesforce (<a href="http://heroku.com/">Heroku</a>, <a href="http://force.com/">force.com</a>). As big data and machine learning requirements percolate through the industry, these players are likely to add their own big-data-specific services. For the purposes of this article, though, I will be sticking to the vendors who already have implemented big data solutions. </p>
<p> Today&#8217;s primary providers of such big data platform services are Amazon, Google and Microsoft. You can see their offerings summarized in the <a href="#comparison_table">table toward the end of this article</a>. Both Amazon Web Services and Microsoft&#8217;s Azure blur the lines between infrastructure as a service and platform: you can mix and match. By contrast, Google&#8217;s philosophy is to skip the notion of a server altogether, and focus only on the concept of the application. Among these, only Amazon can lay claim to extensive experience with their product. </p>
<h3 id="sec-3-1">Amazon Web Services</h3>
<p> Amazon has significant experience in hosting big data processing. Use of Amazon EC2 for Hadoop was a popular and natural move for many early adopters of big data, thanks to Amazon&#8217;s expandable supply of compute power. Building on this, Amazon launched <a href="http://aws.amazon.com/elasticmapreduce/">Elastic Map Reduce</a> in 2009, providing a hosted, scalable Hadoop service. </p>
<p> Applications on Amazon&#8217;s platform can  pick from the best of both the IaaS and PaaS worlds.  General purpose EC2 servers host applications that can then access the appropriate special purpose managed solutions provided by Amazon. </p>
<p> As well as Elastic Map Reduce, Amazon offers several other services relevant to big data, such as the <a href="http://aws.amazon.com/sqs/">Simple Queue Service</a> for coordinating distributed computing, and a hosted <a href="http://aws.amazon.com/rds/">relational database service</a>. At the specialist end of big data, Amazon&#8217;s <a href="http://aws.amazon.com/hpc-applications/">High Performance Computing</a> solutions are tuned for low-latency cluster computing, of the sort required by scientific and engineering applications. </p>
<h4 id="sec-3-1-1">Elastic Map Reduce</h4>
<p> Elastic Map Reduce (EMR) can be programmed in the <a href="http://radar.oreilly.com/2012/02/what-is-apache-hadoop.html">usual Hadoop ways</a>, through Pig, Hive or other programming language, and uses Amazon&#8217;s S3 storage service to get data in and out. </p>
<p> Access to Elastic Map Reduce is through Amazon&#8217;s SDKs and tools, or with GUI analytical and IDE products such as those <a href="http://karmasphere.com/Products-Information/karmasphere-analytics-for-amazon-elastic-mapreduce.html">offered by Karmasphere</a>. In conjunction with these tools, EMR represents a strong option for experimental and analytical work. Amazon&#8217;s EMR pricing makes it a much more attractive option to use EMR, rather than configure EC2 instances yourself to run Hadoop. </p>
<p> When integrating Hadoop with applications generating structured data, using S3 as the main data source can be unwieldy. This is because, similar to Hadoop&#8217;s HDFS, S3 works at the level of storing blobs of opaque data. Hadoop&#8217;s answer to this is HBase, a NoSQL database that integrates with the rest of the Hadoop stack. Unfortunately, Amazon does not currently offer HBase with Elastic Map Reduce. </p>
<h4 id="sec-3-1-2">DynamoDB</h4>
<p> Instead of HBase, Amazon provides <a href="http://aws.amazon.com/dynamodb/">DynamoDB</a>, its own managed, scalable NoSQL database. As this a managed solution, it represents a better choice than running your own database on top of EC2, in terms of both performance and economy. </p>
<p> DynamoDB data can be exported to and imported from S3, providing interoperability with EMR. </p>
<h3 id="sec-3-2">Google</h3>
<p> Google&#8217;s  cloud platform stands out as distinct from its competitors. Rather than offering virtualization, it provides an application container with defined APIs and services. Developers do not need to concern themselves with the concept of machines: applications execute in the cloud, getting access to as much processing power as they need, within defined resource usage limits. </p>
<p> To use Google&#8217;s platform, you must work within the constraints of its APIs. However, if that fits, you can reap the benefits of the security, tuning and performance improvements inherent to the way Google develops all its services. </p>
<p> AppEngine, Google&#8217;s cloud application hosting service, offers a MapReduce facility for parallel computation over data, but this is more of a feature for use as part of complex applications rather than for analytical purposes. Instead, BigQuery and the Prediction API form the core of Google&#8217;s big data offering, respectively offering analysis and machine learning facilities. Both these services are available exclusively via REST APIs, consistent with Google&#8217;s vision for web-based computing. </p>
<h4 id="sec-3-2-1">BigQuery</h4>
<p> <a href="https://developers.google.com/bigquery/docs/overview">BigQuery</a> is an analytical database, suitable for interactive analysis over datasets of the order of 1TB. It works best on a small number of tables with a large number of rows. BigQuery offers a familiar SQL interface to its data. In that, it is comparable to Apache Hive, but the typical performance is faster, making BigQuery a good choice for exploratory data analysis. </p>
<p> Getting data into BigQuery is a matter of directly uploading it, or importing it from Google&#8217;s Cloud Storage system. This is the aspect of BigQuery with the biggest room for improvement. Whereas Amazon&#8217;s S3 lets you mail in disks for import, Google doesn&#8217;t currently have this facility. Streaming data into BigQuery isn&#8217;t viable either, so regular imports are required for constantly updating data. Finally, as BigQuery only accepts data formatted as comma-separated value (CSV) files, you will need to use external methods to clean up the data beforehand. </p>
<p> Rather than provide end-user interfaces itself, Google wants an ecosystem to grow around BigQuery, with vendors incorporating it into their products, in the same way Elastic Map Reduce has acquired tool integration. Currently in beta test, to which anybody can apply, BigQuery is expected to be publicly available during 2012. </p>
<h4 id="sec-3-2-2">Prediction API</h4>
<p> Many uses of machine learning are well defined, such as classification, sentiment analysis, or recommendation generation. To meet these needs, Google offers its <a href="https://developers.google.com/prediction">Prediction API</a> product. </p>
<p> Applications using the Prediction API work by creating and training a model hosted within Google&#8217;s system. Once trained, this model can be used to make predictions, such as spam detection. Google is working on allowing these models to be shared, optionally with a fee. This will let you take advantage of <a href="https://developers.google.com/prediction/docs/gallery#hosted_model">previously trained models</a>, which in many cases will save you time and expertise with training. </p>
<p> Though promising, Google&#8217;s offerings are in their early days. Further integration between its services is required, as well as time for ecosystem development to make their tools more approachable. </p>
<h3 id="sec-3-3">Microsoft</h3>
<p> I have written in some detail about Microsoft&#8217;s big data strategy in <a href="http://radar.oreilly.com/2012/01/microsoft-big-data.html">Microsoft&#8217;s plan for Hadoop and big data</a>. By offering its data platforms on Windows Azure in addition to Windows Server, Microsoft&#8217;s aim is to make either on-premise or cloud-based deployments equally viable with its technology. Azure parallels Amazon&#8217;s web service offerings in many ways, offering a mix of IaaS services with managed applications such as SQL Server. </p>
<p> Hadoop is the central pillar of Microsoft&#8217;s big data approach, surrounded by the ecosystem of its own database and business intelligence tools. For organizations already invested in the Microsoft platform, Azure will represent the smoothest route for integrating big data into the operation. Azure itself is pragmatic about language choice, supporting technologies such as Java, PHP and Node.js in addition to Microsoft&#8217;s own. </p>
<p> As with Google&#8217;s BigQuery, Microsoft&#8217;s Hadoop solution is currently in closed beta test, and is expected to be generally available sometime in the middle of 2012. </p>
<h2 id="comparison_table">Big data cloud platforms compared</h2>
<p> The following table summarizes the data storage and analysis capabilities of Amazon, Google and Microsoft&#8217;s cloud platforms. Intentionally excluded are IaaS solutions without dedicated big data offerings. </p>
<table border="1" cellspacing="10" cellpadding="10">
<tr>
<td>&nbsp;</td>
<td><strong>Amazon</strong></td>
<td><strong>Google</strong></td>
<td><strong>Microsoft</strong></th>
</tr>
<tr>
<td>Product(s)</td>
<td><a href="http://aws.amazon.com/">Amazon Web Services</a></td>
<td><a href="http://www.google.com/enterprise/cloud/">Google Cloud Services</a></td>
<td><a href="http://www.windowsazure.com/en-us/">Windows Azure</a></td>
</tr>
<tr>
<td>Big data storage</td>
<td><a href="http://aws.amazon.com/s3/">S3</a></td>
<td><a href="http://www.google.com/enterprise/cloud/storage/">Cloud Storage</a></td>
<td><a href="https://www.hadooponazure.com/">HDFS on Azure</a></td>
</tr>
<tr>
<td>Working storage</td>
<td><a href="http://aws.amazon.com/ebs/">Elastic Block Store</a></td>
<td><a href="http://www.google.com/enterprise/cloud/appengine/">AppEngine</a> (Datastore, Blobstore)</td>
<td><a href="http://www.windowsazure.com/en-us/home/tour/storage/">Blob, table, queues</a></td>
</tr>
<tr>
<td>NoSQL database</td>
<td><a href="http://aws.amazon.com/dynamodb/">DynamoDB</a><sup><a class="footref" name="fnr.1" href="#fn.1">1</a></sup></td>
<td><a href="http://code.google.com/appengine/docs/python/datastore/">AppEngine Datastore</a></td>
<td><a href="http://www.windowsazure.com/en-us/home/tour/storage/">Table storage</a></td>
</tr>
<tr>
<td>Relational database</td>
<td><a href="http://aws.amazon.com/rds/">Relational Database Service</a> (MySQL or Oracle)</td>
<td><a href="https://developers.google.com/cloud-sql/">Cloud SQL</a> (MySQL compatible)</td>
<td><a href="http://www.windowsazure.com/en-us/home/tour/sql-azure/">SQL Azure</a></td>
</tr>
<tr>
<td>Application hosting</td>
<td><a href="http://aws.amazon.com/ec2/">EC2</a></td>
<td><a href="http://www.google.com/enterprise/cloud/appengine/">AppEngine</a></td>
<td><a href="http://www.windowsazure.com/en-us/home/tour/compute/">Azure Compute</a></td>
</tr>
<tr>
<td>Map/Reduce service</td>
<td><a href="http://aws.amazon.com/elasticmapreduce/">Elastic MapReduce</a> (Hadoop)</td>
<td><a href="http://www.google.com/enterprise/cloud/appengine/">AppEngine</a> (limited capacity)</td>
<td><a href="https://www.hadooponazure.com/">Hadoop on Azure</a><sup><a class="footref" name="fnr.2" href="#fn.2">2</a></sup></td>
</tr>
<tr>
<td>Big data analytics</td>
<td><a href="http://aws.amazon.com/elasticmapreduce/">Elastic MapReduce</a> (Hadoop interface<sup><a class="footref" name="fnr.3" href="#fn.3">3</a></sup>)</td>
<td><a href="https://developers.google.com/bigquery/">BigQuery</a><sup><a class="footref" name="fnr.2.2" href="#fn.2">2</a></sup> (TB-scale, SQL interface)</td>
<td><a href="https://www.hadooponazure.com/">Hadoop on Azure</a> (Hadoop interface<sup><a class="footref" name="fnr.3.2" href="#fn.3">3</a></sup>)</td>
</tr>
<tr>
<td>Machine learning</td>
<td>Via Hadoop + Mahout on EMR or EC2</td>
<td><a href="https://developers.google.com/prediction/">Prediction API</a></td>
<td>Mahout with Hadoop</td>
</tr>
<tr>
<td>Streaming processing</td>
<td>Nothing prepackaged: use custom solution on EC2</td>
<td><a href="http://code.google.com/appengine/docs/python/prospectivesearch/">Prospective Search API</a> <sup><a class="footref" name="fnr.4" href="#fn.4">4</a></sup></td>
<td><a href="http://blogs.msdn.com/b/streaminsight/archive/2011/05/24/streaminsight-project-codename-austin.aspx">StreamInsight</a><sup><a class="footref" name="fnr.2.3" href="#fn.2">2</a></sup> (&#8220;Project Austin&#8221;)</td>
</tr>
<tr>
<td>Data import</td>
<td>Network, <a href="http://aws.amazon.com/importexport/">physically ship drives</a></td>
<td>Network</td>
<td>Network</td>
</tr>
<tr>
<td>Data sources</td>
<td><a href="http://aws.amazon.com/publicdatasets/">Public Data Sets</a></td>
<td>A few <a href="https://developers.google.com/bigquery/docs/sample-datasets">sample datasets</a></td>
<td><a href="https://datamarket.azure.com/">Windows Azure Marketplace</a></td>
</tr>
<tr>
<td>Availability</td>
<td>Public production</td>
<td>Some services in private beta</td>
<td>Some services in private beta</td>
</tr>
</table>
<h2 id="sec-5">Conclusion</h2>
<p> Cloud-based big data services offer considerable advantages in removing the overhead of configuring and tuning your own clusters, and in ensuring you pay only for what you use. The biggest issue is always going to be data locality, as it is slow and expensive to ship data. The most effective big data cloud solutions will be the ones where the data is also collected in the cloud. This is an incentive to investigate EC2, Azure or AppEngine as a primary application platform, and an indicator that PaaS competitors such as Cloud Foundry and Heroku will have to address big data as a priority. </p>
<p> It is early days yet for big data in the cloud, with only Amazon offering battle-tested solutions at this point. Cloud services themselves are at an early stage, and we will see both increasing standardization and innovation over the next two years. </p>
<p> However, the twin advantages of not having to worry about infrastructure and economies of scale mean it is well worth investigating cloud services for your big data needs, especially for an experimental or green-field project. Looking to the future, there&#8217;s no doubt that big data analytical capability will form an essential component of utility computing solutions. </p>
<h3 class="footnotes">Notes: </h2>
<p class="footnote"><sup><a class="footnum" name="fn.1" href="#fnr.1">1</a></sup> In public beta. </p>
<p class="footnote"><sup><a class="footnum" name="fn.2" href="#fnr.2">2</a></sup> In controlled beta test. </p>
<p class="footnote"><sup><a class="footnum" name="fn.3" href="#fnr.3">3</a></sup> Hive and Pig compatible. </p>
<p class="footnote"><sup><a class="footnum" name="fn.4" href="#fnr.4">4</a></sup> Experimental status.</p>
<div style="float: left;border-top: thin gray solid;border-bottom: thin gray solid;padding: 20px;margin: 20px 2px;clear: both"><a href="https://en.oreilly.com/strata2012/public/regwith/radar20?cmp=il-radar-st12-big-data-cloud-edd"><img style="float: left;border: none;padding-right: 10px" src="http://radar.oreilly.com/2011-strata-ca-promo.png" /></a><a href="https://en.oreilly.com/strata2012/public/regwith/radar20?cmp=il-radar-st12-big-data-cloud-edd"><strong>Strata 2012</strong></a> &mdash;  The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.</p>
<p> <a href="https://en.oreilly.com/strata2012/public/regwith/radar20?cmp=il-radar-st12-big-data-cloud-edd"><strong>Save 20% on registration with the code RADAR20</strong></a></div>
<p><strong>Related:</strong></p>
<ul>
<li> <a href="http://radar.oreilly.com/2012/01/big-data-ecosystem.html">Big data market survey: Hadoop solutions</a></li>
<li> <a href="http://radar.oreilly.com/2012/01/microsoft-big-data.html">Microsoft&#8217;s plan for Hadoop and big data</a></li>
<li> <a href="http://radar.oreilly.com/2011/06/getting-started-with-hadoop.html">Get started with Hadoop: From evaluation to your first production cluster</a></li>
<li> <a href="http://radar.oreilly.com/2010/06/on-the-performance-of-clouds.html">On the performance of clouds</a></li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://strata.oreilly.com/2012/02/big-data-in-the-cloud-microsoft-amazon-google.html/feed</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>What is Apache Hadoop?</title>
		<link>http://strata.oreilly.com/2012/02/what-is-apache-hadoop.html</link>
		<comments>http://strata.oreilly.com/2012/02/what-is-apache-hadoop.html#comments</comments>
		<pubDate>Thu, 02 Feb 2012 14:00:00 +0000</pubDate>
		<dc:creator>Edd Dumbill</dc:creator>
				<category><![CDATA[Data]]></category>
		<category><![CDATA[@editpick]]></category>
		<category><![CDATA[@home]]></category>
		<category><![CDATA[@radaronly]]></category>
		<category><![CDATA[@top]]></category>
		<category><![CDATA[Big Data]]></category>
		<category><![CDATA[data tool]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Planning for Big Data]]></category>

		<guid isPermaLink="false">http://blogs.oreilly.com/radar/2012/02/what-is-apache-hadoop.html</guid>
		<description><![CDATA[Apache Hadoop has been the driving force behind the growth of the big data industry.  But what does it do, and why do you need all its strangely-named friends, such as Oozie, Zookeeper and Flume? ]]></description>
				<content:encoded><![CDATA[<p><a href="http://hadoop.apache.org/"><img src="http://s.radar.oreilly.com/wp-files/2/2011/12/hadoop.png" border="0" alt="Hadoop" width="300" style="float: right;margin: 5px 0 10px 15px" /></a><a href="http://hadoop.apache.org/">Apache Hadoop</a> has been   the driving force behind the growth of the big data industry. You&#8217;ll   hear it mentioned often, along with associated technologies such as   Hive and Pig. But what does it do, and why do you need all its   strangely-named friends, such as Oozie, Zookeeper and Flume?</p>
<p>Hadoop brings the ability to cheaply process large amounts of   data, regardless of its structure. By large, we mean from 10-100   gigabytes and above. How is this different from what went before?   </p>
<p>   Existing enterprise data warehouses and relational databases excel   at processing structured data and can store massive amounts of   data, though at a cost: This requirement for structure restricts the kinds of   data that can be processed, and it imposes an inertia that makes   data warehouses unsuited for agile exploration of massive   heterogenous data. The amount of effort required to warehouse data   often means that valuable data sources in organizations are never   mined. This is where Hadoop can make a big difference.</p>
<p>This article examines the components of the Hadoop ecosystem and   explains the functions of each.</p>
<h2>The core of Hadoop: MapReduce</h2>
<p><a href="http://labs.google.com/papers/mapreduce.html">Created at   Google</a> in response to the problem of creating web search   indexes, the MapReduce framework is the powerhouse behind most of   today&#8217;s big data processing. In addition to Hadoop, you&#8217;ll find   MapReduce inside MPP and NoSQL databases, such as Vertica or MongoDB.   </p>
<p>   The important innovation of MapReduce is the ability to take a query   over a dataset, divide it, and run it in parallel over multiple   nodes. Distributing the computation solves the issue of data too large to fit   onto a single machine. Combine this technique with commodity Linux   servers and you have a cost-effective alternative to massive   computing arrays.</p>
<p>At its core, Hadoop is an open source MapReduce   implementation. Funded by Yahoo, it emerged in 2006 and, <a href="http://research.yahoo.com/files/cutting.pdf">according to its   creator Doug Cutting</a>, reached &#8220;web scale&#8221; capability in early   2008.</p>
<p>As the Hadoop project matured, it acquired further components to enhance   its usability and functionality. The name &#8220;Hadoop&#8221; has   come to represent this entire ecosystem. There are parallels   with the emergence of Linux: The name refers strictly to the Linux   kernel, but it has gained acceptance as referring to a complete   operating system.</p>
<h2>Hadoop&#8217;s lower levels: HDFS and MapReduce</h2>
<p>Above, we discussed  the ability of MapReduce to distribute   computation over multiple servers. For that computation to take   place, each server must have access to the data. This is the role of   HDFS, the Hadoop Distributed File System.</p>
<p>HDFS and MapReduce are robust. Servers in a Hadoop cluster can   fail and not abort the computation process. HDFS ensures data is   replicated with redundancy across the cluster. On completion of a   calculation, a node will write its results back into HDFS.</p>
<p>There are no restrictions on the data that HDFS stores. Data may   be unstructured and schemaless. By contrast, relational databases   require that data be structured and schemas be defined before storing   the data. With HDFS, making sense of the data is the responsibility   of the developer&#8217;s code.</p>
<p>Programming Hadoop at the MapReduce level is a case of working with the   Java APIs, and manually loading data files into HDFS.</p>
<div style="float: left;border-top: thin gray solid;border-bottom: thin gray solid;padding: 20px;margin: 20px 2px;clear: both"><a href="http://www.microsoft.com/sql"><img style="float: left;border: none;padding-right: 10px" src="http://s.radar.oreilly.com/wp-files/2/2011/12/sponsor-ms-sql-server.png" /></a><a href="http://www.microsoft.com/sql"><strong>Microsoft SQL Server</strong></a> is a comprehensive information platform offering enterprise-ready technologies and tools that help businesses derive maximum value from information at the lowest TCO. SQL Server 2012 launches next year, offering a cloud-ready information platform delivering mission-critical confidence, breakthrough insight, and cloud on your terms; find out more at <a href="http://www.microsoft.com/sql">www.microsoft.com/sql</a>.</div>
<h2>Improving programmability: Pig and Hive</h2>
<p>Working directly with Java APIs can be tedious and error prone.   It also restricts usage of Hadoop to Java programmers. Hadoop offers   two solutions for making Hadoop programming easier.</p>
<ul>
<li> <a href="http://pig.apache.org/">Pig</a> is a programming     language that simplifies the common tasks of working with Hadoop:     loading data, expressing transformations on the data, and storing     the final results. Pig&#8217;s built-in operations can make sense of     semi-structured data, such as log files, and the language is     extensible using Java to add support for custom data types and     transformations.</li>
<li> <a href="http://hive.apache.org/">Hive</a> enables Hadoop     to operate as a data warehouse. It superimposes structure on data in HDFS     and then permits queries over the data using a familiar SQL-like     syntax. As with Pig, Hive&#8217;s core capabilities are     extensible.</li>
</ul>
<p>Choosing between Hive and Pig can be confusing. Hive   is more suitable for data warehousing tasks, with predominantly   static structure and the need for frequent analysis. Hive&#8217;s closeness   to SQL makes it an ideal point of integration between Hadoop and   other business intelligence tools.</p>
<p>Pig gives the developer more agility for the exploration of large datasets, allowing the development of succinct scripts for transforming   data flows for incorporation into larger applications. Pig is a   thinner layer over Hadoop than Hive, and its main advantage is to   drastically cut the amount of code needed compared to direct   use of Hadoop&#8217;s Java APIs. As such, Pig&#8217;s intended audience remains   primarily the software developer.</p>
<h2>Improving data access: HBase, Sqoop and Flume</h2>
<p>At its heart, Hadoop is a batch-oriented system. Data are loaded   into HDFS, processed, and then retrieved. This is somewhat of a   computing throwback, and often, interactive and random access to data   is required.</p>
<p>Enter <a href="http://hbase.apache.org/">HBase</a>, a column-oriented database that runs on top of HDFS. Modeled after Google&#8217;s   <a href="http://research.google.com/archive/bigtable.html">BigTable</a>,   the project&#8217;s goal is to host billions of rows of data for rapid access.   MapReduce   can use HBase as both a source and a destination for its   computations, and Hive and Pig can be used in combination with   HBase.</p>
<p>In order to grant random access to the data, HBase does impose a   few restrictions: Hive performance with HBase is 4-5 times slower than with plain   HDFS, and the maximum amount of data you can store in HBase is approximately   a petabyte, versus HDFS&#8217; limit of over 30PB.</p>
<p>HBase is ill-suited to ad-hoc analytics and more appropriate for   integrating big data as part of a larger application.  Use cases   include logging, counting and storing time-series data.</p>
<div>
<h3>The Hadoop Bestiary</h3>
<table>
<tr>
<td><strong>Ambari</strong></td>
<td> 	Deployment, configuration and monitoring       </td>
</tr>
<tr>
<td><strong>Flume</strong></td>
<td> 	Collection and import of log and event data       </td>
</tr>
<tr>
<td><strong>HBase</strong></td>
<td>Column-oriented database scaling to billions of rows       </td>
</tr>
<tr>
<td><strong>HCatalog</strong></td>
<td>Schema and data type sharing over Pig, Hive and MapReduce       </td>
</tr>
<tr>
<td><strong>HDFS</strong></td>
<td> 	Distributed redundant file system for Hadoop       </td>
</tr>
<tr>
<td><strong>Hive</strong></td>
<td> 	Data warehouse with SQL-like access       </td>
</tr>
<tr>
<td><strong>Mahout</strong></td>
<td> 	Library of machine learning and data mining algorithms       </td>
</tr>
<tr>
<td><strong>MapReduce</strong></td>
<td> 	Parallel computation on server clusters       </td>
</tr>
<tr>
<td><strong>Pig</strong></td>
<td> 	High-level programming language for Hadoop computations       </td>
</tr>
<tr>
<td><strong>Oozie</strong></td>
<td> 	Orchestration and workflow management       </td>
</tr>
<tr>
<td><strong>Sqoop</strong></td>
<td> 	Imports data from relational databases       </td>
</tr>
<tr>
<td><strong>Whirr</strong></td>
<td> 	Cloud-agnostic deployment of clusters       </td>
</tr>
<tr>
<td><strong>Zookeeper</strong></td>
<td> 	Configuration management and coordination       </td>
</tr>
</table></div>
<h3>Getting data in and out</h3>
<p>Improved interoperability with the rest of the data world is   provided by <a href="https://github.com/cloudera/sqoop/wiki">Sqoop</a> and <a href="https://cwiki.apache.org/FLUME/">Flume</a>. Sqoop is a tool designed to import data from   relational databases into Hadoop, either directly into HDFS or into   Hive. Flume is designed to import streaming flows of log data   directly into HDFS.</p>
<p>Hive&#8217;s SQL friendliness means that it can be used as a point of   integration with the vast universe of database tools capable of making   connections via JBDC or ODBC database drivers.</p>
<h2>Coordination and workflow: Zookeeper and Oozie</h2>
<p>With a growing family of services running as part of a Hadoop   cluster, there&#8217;s a need for coordination and naming services. As   computing nodes can come and go,  members of the cluster need   to synchronize with each other, know where to access services, and   know how they should be configured. This is the purpose of <a href="http://zookeeper.apache.org/">Zookeeper</a>.</p>
<p>Production systems utilizing Hadoop can often contain complex   pipelines of transformations, each with dependencies on each   other. For example, the arrival of a new batch of data will trigger   an import, which must then trigger recalculations in dependent   datasets. The <a href="http://incubator.apache.org/oozie/">Oozie</a>   component provides features to manage the workflow and dependencies,   removing the need for developers to code custom solutions.</p>
<h2>Management and deployment: Ambari and Whirr</h2>
<p>One of the commonly added features incorporated into Hadoop by   distributors such as IBM and Microsoft is monitoring and   administration. Though in an early stage, <a href="http://incubator.apache.org/ambari/">Ambari</a> aims   to add these features to the core Hadoop project. Ambari is intended to help system   administrators deploy and configure Hadoop, upgrade clusters, and   monitor services. Through an API, it may be integrated with other   system management tools.</p>
<p>Though not strictly part of Hadoop, <a href="http://whirr.apache.org/">Whirr</a> is a highly complementary   component. It offers a way of running services, including Hadoop, on   cloud platforms. Whirr is cloud neutral and   currently supports the Amazon EC2 and Rackspace services.</p>
<h2>Machine learning: Mahout</h2>
<p>Every organization&#8217;s data are diverse and particular   to their needs. However, there is much less diversity in the kinds of   analyses performed on that data. The <a href="http://mahout.apache.org/">Mahout</a> project is a library of   Hadoop implementations of common analytical computations. Use cases   include user collaborative filtering, user recommendations,   clustering and classification.</p>
<h2>Using Hadoop</h2>
<p>Normally, you will use Hadoop <a href="http://radar.oreilly.com/2012/01/big-data-ecosystem.html">in     the form of a distribution</a>. Much as with Linux before it,     vendors integrate and test the components of the Apache Hadoop     ecosystem and add in tools and administrative features of their     own.</p>
<p>Though not <em>per se</em> a distribution, a managed cloud installation     of Hadoop&#8217;s MapReduce is also available through Amazon&#8217;s <a href="http://aws.amazon.com/elasticmapreduce/">Elastic     MapReduce service</a>.</p>
<div style="float: left;border-top: thin gray solid;border-bottom: thin gray solid;padding: 20px;margin: 20px 2px;clear: both"><a href="https://en.oreilly.com/strata2012/public/regwith/radar20?cmp=il-radar-st12-what-is-hadoop"><img style="float: left;border: none;padding-right: 10px" src="http://radar.oreilly.com/2011-strata-ca-promo.png" /></a><a href="https://en.oreilly.com/strata2012/public/regwith/radar20?cmp=il-radar-st12-what-is-hadoop"><strong>Strata 2012</strong></a> &mdash;  The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.</p>
<p> <a href="https://en.oreilly.com/strata2012/public/regwith/radar20?cmp=il-radar-st12-what-is-hadoop"><strong>Save 20% on registration with the code RADAR20</strong></a></div>
<p><strong>Related:</strong></p>
<ul>
<li> <a href="http://radar.oreilly.com/2012/01/big-data-ecosystem.html">Big data market survey: Hadoop solutions</a></li>
<li> <a href="http://radar.oreilly.com/2012/02/hadoop-doug-cutting-apache-data-processing.html">Why Hadoop caught on</a></li>
<li> <a href="http://radar.oreilly.com/2012/01/microsoft-big-data.html">Microsoft&#8217;s plan for Hadoop and big data</a></li>
<li> <a href="http://radar.oreilly.com/2011/01/what-is-hadoop.html">Hadoop: What it is, how it works, and what it can do</a></li>
<li> <a href="http://radar.oreilly.com/2011/06/getting-started-with-hadoop.html">Get started with Hadoop: From evaluation to your first production cluster</a></li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://strata.oreilly.com/2012/02/what-is-apache-hadoop.html/feed</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Microsoft&#8217;s plan for Hadoop and big data</title>
		<link>http://strata.oreilly.com/2012/01/microsoft-big-data.html</link>
		<comments>http://strata.oreilly.com/2012/01/microsoft-big-data.html#comments</comments>
		<pubDate>Wed, 25 Jan 2012 16:00:00 +0000</pubDate>
		<dc:creator>Edd Dumbill</dc:creator>
				<category><![CDATA[Data]]></category>
		<category><![CDATA[Big Data]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Microsoft]]></category>
		<category><![CDATA[Planning for Big Data]]></category>

		<guid isPermaLink="false">http://blogs.oreilly.com/radar/2012/01/microsoft-big-data.html</guid>
		<description><![CDATA[Strata conference chair Edd Dumbill takes a look at Microsoft&apos;s plans for big data. By embracing Hadoop, the company aims to keep Windows and Azure as a standards-friendly option for data developers. ]]></description>
				<content:encoded><![CDATA[<p>Microsoft has placed <a href="http://hadoop.apache.org/">Apache Hadoop</a> at the core of its big data   strategy.  It&#8217;s a move that might seem surprising to the casual   observer, being a somewhat enthusiastic adoption of   a significant open source product.</p>
<p>The reason for this move is that Hadoop, by its sheer popularity, has   become the de facto standard for distributed data crunching. By   embracing Hadoop, Microsoft allows its customers to access the   rapidly-growing Hadoop ecosystem and take advantage of a growing   talent pool of Hadoop-savvy developers.</p>
<p>Microsoft&#8217;s goals go beyond integrating Hadoop into Windows. It   intends to contribute the adaptions it makes back to the Apache   Hadoop project, so that anybody can run a purely open source Hadoop   on Windows.</p>
<h2>Microsoft&#8217;s Hadoop distribution</h2>
<p>The Microsoft <a href="http://www.microsoft.com/sqlserver/en/us/solutions-technologies/business-intelligence/big-data-solution.aspx">distribution of Hadoop</a> is currently in &#8220;Customer Technology Preview&#8221;   phase. This means it is undergoing evaluation in the field by groups   of customers.  The expected release time is toward the middle of   2012, but will be influenced by the results of the technology   preview program.</p>
<p>Microsoft&#8217;s Hadoop distribution is usable either on-premise with   Windows Server, or in Microsoft&#8217;s cloud platform, Windows Azure. The   core of the product is in the MapReduce, HDFS, Pig and Hive   components of Hadoop. These are certain to ship in the 1.0   release.</p>
<p>As Microsoft&#8217;s aim is for 100% Hadoop compatibility, it is likely   that additional components of the Hadoop ecosystem such as   Zookeeper, HBase, HCatalog and Mahout will also be shipped.</p>
<p>Additional components integrate Hadoop with   Microsoft&#8217;s ecosystem of business intelligence and analytical products:</p>
<ul>
<li> Connectors for Hadoop, integrating it with SQL Server and SQL     Sever Parallel Data Warehouse.</li>
<li> An ODBC driver for Hive, permitting any Windows application to     access and run queries against the Hive data warehouse.</li>
<li> An Excel Hive Add-in, which enables the movement of data directly     from Hive into Excel or PowerPivot.</li>
</ul>
<p>On the back end, Microsoft offers Hadoop performance improvements,   integration with Active Directory to facilitate access control, and   with System Center for administration and management.</p>
<p class="image-box-580"><a href="http://www.microsoft.com/sqlserver/en/us/solutions-technologies/business-intelligence/big-data-solution.aspx"><img src="http://radar.oreilly.com/2012/01/25/0112-ms-hadoop.png" border="0" alt="How Hadoop integrates with the Microsoft ecosystem" width="580" style="margin-bottom: 15px" /></a><br /><em>How Hadoop integrates with the Microsoft ecosystem. (Source: <a href="http://www.microsoft.com/sqlserver/en/us/solutions-technologies/business-intelligence/big-data-solution.aspx">microsoft.com</a>.)</em></p>
<h2>Developers, developers, developers</h2>
<p>One of the most interesting features of Microsoft&#8217;s work with   Hadoop is the addition of a JavaScript API. Working with Hadoop at   a programmatic level can be tedious: this is why higher-level languages   such as Pig emerged.</p>
<p>Driven by its focus on the software developer as an important   customer, Microsoft chose to add a <a href="http://strataconf.com/strata2012/public/schedule/detail/22669">JavaScript layer</a> to the Hadoop   ecosystem. Developers can use it to create MapReduce jobs, and even   interact with Pig and Hive from a browser environment.</p>
<p>The real advantage of the JavaScript layer should show itself in   integrating Hadoop into a business environment, making it easy for   developers to create intranet analytical environments accessible by   business users. Combined with Microsoft&#8217;s focus on bringing   server-side JavaScript to Windows and Azure <a href="http://www.zdnet.com/blog/microsoft/microsoft-joyent-deliver-first-stable-build-of-nodejs-on-windows/11178">through Node.js</a>, this gives an interesting glimpse into Microsoft&#8217;s view   of where developer enthusiasm and talent will lie.</p>
<p>It&#8217;s also good news for the broader Hadoop community, as   Microsoft intends to contribute its JavaScript API to the Apache   Hadoop open source project itself.</p>
<p>The other half of Microsoft&#8217;s software development environment is   of course the .NET platform. With Microsoft&#8217;s Hadoop distribution,   it will be possible to create MapReduce jobs from .NET, though using   the Hadoop APIs directly. It is likely that higher-level interfaces   will emerge in future releases. The same applies to Visual Studio,   which over time will get increasing levels of Hadoop project   support.</p>
<h2>Streaming data and NoSQL</h2>
<p>Hadoop covers part of the big data problem, but what about   streaming data processing or NoSQL databases? The answer comes in   two parts, covering existing Microsoft products and future   Hadoop-compatible solutions.</p>
<p>Microsoft has some established products: Its streaming   data solution called <a href="http://www.microsoft.com/sqlserver/en/us/solutions-technologies/business-intelligence/complex-event-processing.aspx">StreamInsight</a>, and   for NoSQL, Windows Azure has a product called <a href="http://www.windowsazure.com/en-us/home/tour/storage/">Azure   Tables</a>.</p>
<p>Looking to the future, the commitment of Hadoop compatibility   means that streaming data solutions and NoSQL databases designed to   be part of the Hadoop ecosystem should work with the Microsoft   distribution &mdash; HBase itself will ship as a core offering. It seems   likely that solutions such as <a href="http://incubator.apache.org/s4/">S4</a> will prove   compatible.</p>
<h2>Toward an integrated environment</h2>
<p>Now that Microsoft is on the way to integrating the major   components of big data tooling, does it intend to   join it all together to provide an integrated data science platform   for businesses?</p>
<p>That&#8217;s certainly the vision, according to Madhu Reddy, senior   product planner for Microsoft Big Data: &#8220;Hadoop is primarily for   developers. We want to enable people to use the tools they   like.&#8221;</p>
<p>The strategy to achieve this involves entry points at multiple   levels: for developers, analysts and business users. Instead of   choosing one particular analytical platform of choice, Microsoft   will focus on interoperability with existing tools. Excel is an   obvious priority, but other tools are also   important to the company.</p>
<p>According to Reddy, data scientists represent a spectrum of   preferences. While Excel is a ubiquitous and popular choice, other   customers use Matlab, SAS, or R, for example.</p>
<h2>The data marketplace</h2>
<p>One thing unique to Microsoft as a big data and cloud platform is its data market, <a href="https://datamarket.azure.com/">Windows   Azure Marketplace</a>. Mixing external data, such as geographical or   social, with your own, can generate revealing insights. But it&#8217;s   hard to find data, be confident of its quality, and purchase it   conveniently. That&#8217;s where data marketplaces meet a need.</p>
<p>The availability of the Azure marketplace integrated with Microsoft&#8217;s   tools gives analysts a ready source of external data with some   guarantees of quality. Marketplaces are in their infancy now, but   will play a <a href="http://radar.oreilly.com/2011/12/5-big-data-predictions-2012.html">growing role</a> in the future of data-driven business.</p>
<h2>Summary</h2>
<p>The Microsoft approach to big data has ensured the continuing   relevance of its Windows platform for web-era organizations, and   makes its cloud services a competitive choice for data-centered   businesses.</p>
<p>Appropriately enough for a company with a large and diverse   software ecosystem of its own, the Microsoft approach is one of   interoperability. Rather than laying out a golden path   for big data, as suggested by the appliance-oriented approach of   others, Microsoft is focusing heavily on integration.</p>
<p>The guarantee of this approach lies in Microsoft&#8217;s choice to   embrace and work with the Apache Hadoop community, enabling the   migration of new tools and talented developers to its   platform.</p>
<div style="float: left;border-top: thin gray solid;border-bottom: thin gray solid;padding: 20px;margin: 20px 2px;clear: both"><a href="http://www.microsoft.com/sql"><img style="float: left;border: none;padding-right: 10px" src="http://s.radar.oreilly.com/wp-files/2/2011/12/sponsor-ms-sql-server.png" /></a><a href="http://www.microsoft.com/sql"><strong>Microsoft SQL Server</strong></a> is a comprehensive information platform offering enterprise-ready technologies and tools that help businesses derive maximum value from information at the lowest TCO. SQL Server 2012 launches next year, offering a cloud-ready information platform delivering mission-critical confidence, breakthrough insight, and cloud on your terms; find out more at <a href="http://www.microsoft.com/sql">www.microsoft.com/sql</a>.</div>
<p><strong>Related:</strong></p>
<ul>
<li> <a href="http://radar.oreilly.com/2012/01/big-data-ecosystem.html">Big data market survey: Hadoop solutions</a></li>
<li> <a href="http://radar.oreilly.com/2011/01/what-is-hadoop.html">Hadoop: What it is, how it works, and what it can do</a></li>
<li> <a href="http://radar.oreilly.com/2011/06/getting-started-with-hadoop.html">Get started with Hadoop: From evaluation to your first production cluster</a></li>
<li> <a href="http://radar.oreilly.com/2011/12/david-campbell.html">Tapping into a world of ambient data</a></li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://strata.oreilly.com/2012/01/microsoft-big-data.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Big data market survey: Hadoop solutions</title>
		<link>http://strata.oreilly.com/2012/01/big-data-ecosystem.html</link>
		<comments>http://strata.oreilly.com/2012/01/big-data-ecosystem.html#comments</comments>
		<pubDate>Thu, 19 Jan 2012 16:00:00 +0000</pubDate>
		<dc:creator>Edd Dumbill</dc:creator>
				<category><![CDATA[Data]]></category>
		<category><![CDATA[@editpick]]></category>
		<category><![CDATA[@home]]></category>
		<category><![CDATA[Big Data]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Planning for Big Data]]></category>
		<category><![CDATA[strataconf]]></category>

		<guid isPermaLink="false">http://blogs.oreilly.com/radar/2012/01/big-data-ecosystem.html</guid>
		<description><![CDATA[In this survey, Edd Dumbill explores the Hadoop-based big data solutions available on the market, contrasts the approaches of EMC Greenplum, IBM, Microsoft and Oracle and provides an overview of Hadoop distributions. ]]></description>
				<content:encoded><![CDATA[<p>    div.infobox {<br />
    float: right;<br />
    margin-left: 1em;<br />
    margin-right: 1em;<br />
    margin-bottom: 1em;<br />
    padding: 0.5em;<br />
    border: 1px solid #888;<br />
    background-color: #eee;<br />
    font-size: 90%;<br />
    width: 240px;<br />
    }</p>
<p>    td div.infobox {<br />
    float: none;<br />
    }</p>
<p>    div.ihdr { font-weight: bold; }<br />
    div.ifield { margin-bottom: 0.5em; padding-left: 0.5 em; }</p>
<p>    div.ititle { font-weight: bold; font-size: 120%; }<br />
    div.ibody { border-top: 1px solid #888;<br />
    margin-top: 0.5em;<br />
    padding-top: 0.75em; }</p>
<p>    span.ifootnote { border-bottom: 1px dotted #888; cursor: help }</p>
<p>    div.idesc { margin-top: 0.3em; font-size: 85%; color: #888; }</p>
<p>    .ismaller { font-size: 85%; }</p>
<p>    #hadoop_comparison { border-collapse: collapse; margin-top: 15px; }</p>
<p>    #hadoop_comparison th { font-size: 90%; vertical-align: top;<br />
    border-right: 1px solid #ddd; }</p>
<p>    #hadoop_comparison td { font-size: 90%; vertical-align: top;<br />
    margin-right: .5em; padding: 0.5em;<br />
    border-right: 1px solid #ddd; border-top: 1px solid #ddd;<br />
    margin-top: 0; margin-bottom; 0;<br />
     }</p>
<p>    #hadoop_comparison th.ivendor { padding: 0 0.25em 0.25em 0.25em; }</p>
<p>    #hadoop_comparison td.hcat { font-weight: bold; background-color: #eee; }</p>
<p>    #mpp_dbs td { vertical-align: top; font-size: 90%; }</p>
<p> The big data ecosystem can be confusing. The popularity of &#8220;big data&#8221; as industry buzzword has created a broad category. As Hadoop steamrolls through the industry, solutions from the business intelligence and data warehousing fields are also attracting the big data label. To confuse matters, Hadoop-based solutions such as Hive are at the same time evolving toward being a competitive data warehousing solution.     </p>
<p> Understanding the nature of your big data problem is a helpful first step in evaluating potential solutions. Let&#8217;s remind ourselves of <a href="http://radar.oreilly.com/2012/01/what-is-big-data.html">the     definition of big data</a>:</p>
<blockquote><p> &#8220;Big data is data that exceeds the processing capacity of conventional database systems. The data is too big, moves too fast, or doesn&#8217;t fit the strictures of your database architectures. To gain value from this data, you must choose an alternative way to process it.&#8221;     </p>
</blockquote>
<p> Big data problems vary in how heavily they weigh in on the axes     of volume, velocity and variability. Predominantly structured yet     large data, for example, may be most suited to an analytical     database approach.     </p>
<p> This survey makes the assumption that a data warehousing solution alone is not the answer to your problems, and concentrates on analyzing the commercial Hadoop ecosystem.  We&#8217;ll focus on the solutions that incorporate storage and data processing, excluding those products which only sit above those layers, such as the visualization or analytical workbench software.     </p>
<p>Getting started with Hadoop doesn&#8217;t require a large     investment as the software is open source, and is also available     instantly through the Amazon Web Services cloud. But for     production environments, support, professional services and     training are often required.</p>
<h2>Just Hadoop?</h2>
<p> Apache Hadoop is unquestionably the center of the latest iteration of big data solutions. At its heart, Hadoop is a system for distributing computation among commodity servers. It is often used with the Hadoop Hive project, which layers data warehouse technology on top of Hadoop, enabling ad-hoc analytical queries. </p>
<p> Big data platforms divide along the lines of their approach to Hadoop. The big data offerings from familiar enterprise vendors incorporate a Hadoop distribution, while other platforms offer Hadoop connectors to their existing analytical database systems. This latter category tends to comprise massively parallel processing (MPP) databases that made their name in big data before Hadoop matured: Vertica and Aster Data.  Hadoop&#8217;s strength in these cases is in processing unstructured data in tandem with the analytical capabilities of the existing database on structured or structured data.     </p>
<p> Practical big data implementations don&#8217;t in general fall neatly into either structured or unstructured data categories. You will invariably find Hadoop working as part of a system with a relational or MPP database.     </p>
<p> Much as with Linux before it, no Hadoop solution incorporates the raw Apache Hadoop code. Instead, it&#8217;s packaged into distributions. At a minimum, these distributions have been through a testing process, and often include additional components such as management and monitoring tools. The most well-used distributions now come from Cloudera, Hortonworks and MapR. Not every distribution will be commercial, however: the <a href="https://cwiki.apache.org/BIGTOP/index.html">BigTop project</a> aims to create a Hadoop distribution under the Apache umbrella.     </p>
<div style="float: left;border-top: thin gray solid;border-bottom: thin gray solid;padding: 20px;margin: 20px 2px;clear: both"><a href="http://www.microsoft.com/sql"><img style="float: left;border: none;padding-right: 10px" src="http://s.radar.oreilly.com/wp-files/2/2011/12/sponsor-ms-sql-server.png" /></a><a href="http://www.microsoft.com/sql"><strong>Microsoft SQL Server</strong></a> is a comprehensive information platform offering enterprise-ready technologies and tools that help businesses derive maximum value from information at the lowest TCO. SQL Server 2012 launches next year, offering a cloud-ready information platform delivering mission-critical confidence, breakthrough insight, and cloud on your terms; find out more at <a href="http://www.microsoft.com/sql">www.microsoft.com/sql</a>.</div>
<h2>Integrated Hadoop systems</h2>
<p> The leading Hadoop enterprise software vendors have aligned their Hadoop products with the rest of their database and analytical offerings. These vendors don&#8217;t require you to source Hadoop from another party, and offer it as a core part of their big data solutions. Their offerings integrate Hadoop into a broader enterprise setting, augmented by analytical and workflow tools.     </p>
<h3>EMC Greenplum</h3>
<div style="float: right;margin: 0 1em 1em 1em;padding: 0.5em;border: 1px solid #888;background-color: #eee;width: 240px">
<div class="ititle">EMC Greenplum</div>
<div class="ibody">
<div class="ihdr">Database	</div>
<div class="ifield"><a href="http://www.greenplum.com/products/greenplum-database">Greenplum Database</a></div>
<div class="ihdr">Deployment options</div>
<div class="ifield">Appliance 	(<a href="http://www.greenplum.com/products/greenplum-dca">Modular Data Computing Appliance</a>),  	    Software 	(Enterprise Linux)  	</div>
<div class="ihdr">Hadoop</div>
<div class="ifield"> 	  Bundled distribution 	    (<a href="http://www.greenplum.com/products/greenplum-hd">Greenplum HD</a>);  	<span class="ifootnote" title="Data warehouse system for Hadoop, supporting querying through an SQL-like language">Hive</span>,  	<span class="ifootnote" title="High level language for driving Hadoop">Pig</span>,  	<span class="ifootnote" title="Distributed coordination service for Hadoop components">Zookeeper</span>,  	<span class="ifootnote" title="NoSQL database for interactive random access to data">HBase</span>   	</div>
<div class="ihdr">NoSQL component</div>
<div class="ifield">HBase</div>
<div class="ihdr">Links</div>
<div class="ifield">  	    <a href="http://www.greenplum.com/products/greenplum-uap">Home page</a>,  	    <a href="http://www.greenplum.com/customers/customer-spotlight">case study</a>  	</div>
</p></div>
</p></div>
<p>   Acquired by EMC, and rapidly taken to the heart of the     company&#8217;s strategy, Greenplum is a relative newcomer to the     enterprise, compared     to other companies in this section. They have turned that to     their advantage in creating an analytic platform, positioned as     taking analytics &#8220;beyond BI&#8221; with agile data science teams.</p>
<p>Greenplum&#8217;s Unified Analytics Platform (UAP) comprises three     elements: the Greenplum MPP database, for structured data; a     Hadoop distribution, Greenplum HD; and <a href="http://www.greenplum.com/products/chorus">Chorus</a>, a     productivity and groupware layer for data science teams.</p>
<p>The HD Hadoop layer builds on MapR&#8217;s Hadoop compatible     distribution, which replaces the file system with a faster     implementation and provides other features for     robustness. Interoperability between HD and Greenplum Database     means that a single query can access both database and Hadoop data.</p>
<p>Chorus is a unique feature, and is indicative of Greenplum&#8217;s commitment     to the idea of data science and the importance of the agile team     element to effectively exploiting big data. It supports     organizational roles from analysts, data scientists and DBAs     through to executive business stakeholders. </p>
<p>As befits EMC&#8217;s role in the data center market, Greenplum&#8217;s UAP is     available in a modular appliance configuration.</p>
<h3>IBM</h3>
<div style="float: right;margin: 0 1em 1em 1em;padding: 0.5em;border: 1px solid #888;background-color: #eee;width: 240px">
<div class="ititle">IBM InfoSphere</div>
<div class="ibody">
<div class="ihdr"> 	  Database 	</div>
<div class="ifield">  	    <a href="http://www.ibm.com/software/data/db2/">DB2</a>  	</div>
<div class="ihdr">Deployment options</div>
<div class="ifield">  	    Software 	(Enterprise Linux), Cloud  	</div>
<div class="ihdr">Hadoop</div>
<div class="ifield"> 	  Bundled distribution 	    (<a href="http://www-01.ibm.com/software/data/infosphere/biginsights/">InfoSphere BigInsights</a>);  	<span class="ifootnote" title="Data warehouse system for Hadoop, supporting querying through an SQL-like language">Hive</span>,  	<span class="ifootnote" title="Workflow and coordination for managing Hadoop jobs">Oozie</span>,  	<span class="ifootnote" title="High level language for driving Hadoop">Pig</span>,  	<span class="ifootnote" title="Distributed coordination service for Hadoop components">Zookeeper</span>,  	<span class="ifootnote" title="Rich data serialization">Avro</span>,  	<span class="ifootnote" title="Log data importer">Flume</span>,  	<span class="ifootnote" title="NoSQL database for interactive random access to data">HBase</span>,  	<span class="ifootnote" title="Full text indexing and search">Lucene</span>   	</div>
<div class="ihdr">NoSQL component</div>
<div class="ifield">HBase</div>
<div class="ihdr">Links</div>
<div class="ifield">  	    <a href="http://www-01.ibm.com/software/data/">Home page</a>,  	    <a href="http://www-01.ibm.com/software/success/cssdb.nsf/advancedsearchVW?SearchView&amp;Query=%5BWebSiteProfileListTX%5D=dmmain+AND+(InfoSphere%20BigInsights)+AND+%5BCompletedDate%5D%3E01-01-2002&amp;site=dmmain&amp;cty=en_us&amp;frompage=ts&amp;start=1&amp;count=10">case study</a>  	</div>
</p></div>
</p></div>
<p>IBM&#8217;s <a href="http://www-01.ibm.com/software/data/infosphere/biginsights/">InfoSphere     BigInsights</a> is their Hadoop distribution, and part of a suite     of products offered under the &#8220;InfoSphere&#8221; information management     brand. Everything big data at IBM is helpfully labeled     Big, appropriately enough for a company affectionately known as &#8220;Big     Blue.&#8221;</p>
<p>BigInsights augments Hadoop with a variety of features,     including     management and administration tools. It also offers textual analysis tools     that aid with entity resolution &mdash; identifying people, addresses,     phone numbers and so on.</p>
<p>IBM&#8217;s Jaql query language provides a point of integration     between Hadoop and other IBM products, such as relational databases     or Netezza data warehouses.</p>
<p>InfoSphere BigInsights is interoperable with IBM&#8217;s other     database and warehouse products, including DB2, Netezza and its     InfoSphere warehouse and analytics lines. To aid analytical     exploration, BigInsights ships with BigSheets, a spreadsheet     interface onto big data.</p>
<p>IBM addresses streaming big data separately through its <a href="http://www-01.ibm.com/software/data/infosphere/streams/">InfoSphere     Streams</a> product. BigInsights is not currently offered in an     appliance form, but can be used in the cloud via Rightscale, Amazon, Rackspace, and IBM Smart Enterprise Cloud.</p>
<h3>Microsoft</h3>
<div style="float: right;margin: 0 1em 1em 1em;padding: 0.5em;border: 1px solid #888;background-color: #eee;width: 240px">
<div class="ititle">Microsoft</div>
<div class="ibody">
<div class="ihdr">Database	</div>
<div class="ifield"> 	    <a href="http://www.microsoft.com/sqlserver/en/us/default.aspx">SQL Server</a> 	  </div>
<div class="ihdr">Deployment options</div>
<div class="ifield">  	    Software 	(Windows Server),  	    Cloud 	(<span class="ifootnote" title="Azure can also be used to extend on-premise resources.">Windows Azure Cloud</span>)  	</div>
<div class="ihdr">Hadoop</div>
<div class="ifield"> 	  Bundled distribution 	    (<a href="http://www.microsoft.com/sqlserver/en/us/solutions-technologies/business-intelligence/big-data-solution.aspx">Big Data Solution</a>);  	<span class="ifootnote" title="Data warehouse system for Hadoop, supporting querying through an SQL-like language">Hive</span>,  	<span class="ifootnote" title="High level language for driving Hadoop">Pig</span>   	</div>
<div class="ihdr">Links</div>
<div class="ifield">  	    <a href="http://www.microsoft.com/sqlserver/en/us/solutions-technologies/business-intelligence.aspx">Home page</a>,  	    <a href="http://corp.klout.com/blog/2011/11/big-data-bigger-brains/">case study</a>  	</div>
</p></div>
</p></div>
<p>Microsoft have adopted Hadoop as the center of their big data     offering, and are pursuing an integrated approach aimed at making     big data available through their analytical tool suite, including     to the familiar tools of Excel and PowerPivot. </p>
<p>Microsoft&#8217;s     <a href="http://www.microsoft.com/sqlserver/en/us/solutions-technologies/business-intelligence/big-data-solution.aspx">Big     Data Solution</a> brings Hadoop to the Windows Server platform,     and in elastic form to their cloud platform Windows     Azure. Microsoft have packaged their own distribution of Hadoop,     integrated with Windows Systems Center and Active Directory.     They intend to contribute back changes to Apache Hadoop to     ensure that an open source version of Hadoop will run on Windows.     </p>
<p>On the server side, Microsoft offer integrations to their SQL     Server database and their data warehouse product. Using their     warehouse solutions aren&#8217;t mandated, however. The Hadoop Hive data     warehouse is part of the Big Data Solution,  including     connectors from Hive to ODBC and Excel.</p>
<p>Microsoft&#8217;s focus on the developer is evident in their creation     of a JavaScript API for Hadoop. Using JavaScript, developers can     create Hadoop jobs for MapReduce, Pig or Hive, even from a     browser-based environment. Visual Studio and .NET integration     with Hadoop is also provided.</p>
<p>Deployment is possible either on the server or in the cloud, or     as a hybrid combination. Jobs written against the Apache Hadoop     distribution should migrate with miniminal changes to Microsoft&#8217;s     environment.</p>
<h3>Oracle</h3>
<div style="float: right;margin: 0 1em 1em 1em;padding: 0.5em;border: 1px solid #888;background-color: #eee;width: 240px">
<div class="ititle">Oracle Big Data</div>
<div class="ibody">
<div class="ihdr">Deployment options</div>
<div class="ifield">  	    Appliance 	(<a href="http://www.oracle.com/us/products/database/big-data-appliance/overview/index.html">Oracle Big Data Appliance</a>)  	</div>
<div class="ihdr">Hadoop</div>
<div class="ifield"> 	  Bundled distribution 	    (<a href="http://www.cloudera.com/hadoop/">Cloudera&#039;s Distribution including Apache Hadoop</a>);  	<span class="ifootnote" title="Data warehouse system for Hadoop, supporting querying through an SQL-like language">Hive</span>,  	<span class="ifootnote" title="Workflow and coordination for managing Hadoop jobs">Oozie</span>,  	<span class="ifootnote" title="High level language for driving Hadoop">Pig</span>,  	<span class="ifootnote" title="Distributed coordination service for Hadoop components">Zookeeper</span>,  	<span class="ifootnote" title="Rich data serialization">Avro</span>,  	<span class="ifootnote" title="Log data importer">Flume</span>,  	<span class="ifootnote" title="NoSQL database for interactive random access to data">HBase</span>,  	<span class="ifootnote" title="Transfer data between relational databases and Hadoop">Sqoop</span>,  	<span class="ifootnote" title="Scalable machine-learning and data mining">Mahout</span>,  	<span class="ifootnote" title="Cloud-neutral API for running services">Whirr</span>   	</div>
<div class="ihdr">NoSQL component</div>
<div class="ifield"><a href="http://www.oracle.com/us/products/database/nosql/overview/index.html">Oracle NoSQL Database</a></div>
<div class="ihdr">Links</div>
<div class="ifield">  	    <a href="http://www.oracle.com/us/technologies/big-data/index.html">Home page</a>  	</div>
</p></div>
</p></div>
<p>Announcing their entry into the big data market at the end of     2011, Oracle is taking an appliance-based approach. Their     <a href="http://www.oracle.com/us/products/database/big-data-appliance/overview/index.html">Big     Data Appliance</a> integrates Hadoop, R for analytics, a new     Oracle NoSQL database, and connectors to Oracle&#8217;s     database and Exadata data warehousing product line.</p>
<p>Oracle&#8217;s approach caters to the high-end enterprise market, and     particularly leans to the rapid-deployment, high-performance end     of the spectrum. It is the only vendor to include the popular R     analytical language integrated with Hadoop, and to ship a NoSQL     database of their own design as opposed to Hadoop HBase.</p>
<p>Rather than developing their own Hadoop distribution, Oracle     have partnered with Cloudera for Hadoop support, which brings them     a mature and established Hadoop solution. Database connectors     again promote the integration of structured Oracle data with the     unstructured data stored in Hadoop HDFS.</p>
<p>Oracle&#8217;s <a href="http://www.oracle.com/us/products/database/nosql/overview/index.html">NoSQL     Database</a> is a scalable key-value database, built on the     Berkeley DB technology. In that, Oracle owes double gratitude to     Cloudera CEO Mike Olson, as he was previously the CEO of     Sleepycat, the creators of Berkeley DB. Oracle are positioning     their NoSQL database as a means of acquiring big data prior to     analysis.</p>
<p>The <a href="http://www.oracle.com/us/corporate/features/features-oracle-r-enterprise-498732.html">Oracle R Enterprise</a> product offers direct integration into     the Oracle database, as well as Hadoop, enabling R scripts to run     on data without having to round-trip it out of the data stores.</p>
<h3>Availability&lt;/h3
<p>While IBM and Greenplum&#8217;s offerings are available at the time     of writing, the Microsoft and Oracle solutions are expected to be     fully available early in 2012.</p>
<h2>Analytical databases with Hadoop connectivity</h2>
<p>MPP (massively parallel processing) databases are specialized     for  processing structured big data, as distinct from the     unstructured data that is Hadoop&#8217;s specialty. Along with Greenplum,     Aster Data and Vertica are early pioneers of big data     products before the mainstream emergence of Hadoop.</p>
<p>These MPP solutions are databases specialized for analyical     workloads and data integration, and provide connectors to     Hadoop and data warehouses. A     recent spate of acquisitions have seen these products become the     analytical play by data warehouse and storage vendors: Teradata     acquired Aster Data, EMC acquired Greenplum, and HP acquired     Vertica.</p>
<h3>Quick facts</h3>
<p>   <!-- Aster Data infobox -->
<div style="float: left;margin: 0 10px 15px 0;padding: 0.5em;border: 1px solid #888;background-color: #eee;width: 180px;height: 350px">
<div class="ititle">Aster Data</div>
<div class="ibody">
<div class="ihdr"> 	    Database 	  </div>
<div class="ifield">MPP analytical database</div>
<div class="ihdr">Deployment options</div>
<div class="ifield">  	    Appliance 	(<a href="http://www.asterdata.com/product/appliance.php">Aster MapReduce Appliance</a>),  	    Software 	(<a href="http://www.asterdata.com/product/deployment/software.php">Enterprise Linux</a>),  	    Cloud 	(<a href="http://www.asterdata.com/product/deployment/cloud.php">Amazon EC2, Terremark and Dell Clouds</a>)  	</div>
<div class="ihdr">Hadoop</div>
<div class="ifield">  	    Hadoop connector available  	</div>
<div class="ihdr">Links</div>
<div class="ifield">  	    <a href="http://www.asterdata.com/">Home page</a>  	</div>
</p></div>
</p></div>
<p>  <!-- ParAccel infobox -->
<div style="float: left;margin: 0 10px 15px 0;padding: 0.5em;border: 1px solid #888;background-color: #eee;width: 180px;height: 350px">
<div class="ititle">ParAccel</div>
<div class="ibody">
<div class="ihdr"> 	    Database 	  </div>
<div class="ifield">MPP analytical database</div>
<div class="ihdr">Deployment options</div>
<div class="ifield">  	    Software 	(<a href="http://www.paraccel.com/technology/paraccel-products-enterprise-edition.php">Enterprise Linux</a>),  	    Cloud 	(<a href="http://www.paraccel.com/technology/paraccel-products-cloud-edition.php">Cloud Edition</a>)  	</div>
<div class="ihdr">Hadoop</div>
<div class="ifield">  	    Hadoop integration available  	</div>
<div class="ihdr">Links</div>
<div class="ifield">  	    <a href="http://paraccel.com/">Home page</a>,  	    <a href="http://www.paraccel.com/resources/case-studies.php">case study</a>  	</div>
</p></div>
</p></div>
<p>   <!-- Vertica infobox -->
<div style="float: left;margin: 0 0 15px 0;padding: 0.5em;border: 1px solid #888;background-color: #eee;width: 180px;height: 350px">
<div class="ititle">Vertica</div>
<div class="ibody">
<div class="ihdr"> 	    Database 	  </div>
<div class="ifield">MPP analytical database</div>
<div class="ihdr">Deployment options</div>
<div class="ifield">  	    Appliance 	(<a href="http://h20000.www2.hp.com/bizsupport/TechSupport/Document.jsp?objectID=c02781514">HP Vertica Appliance</a>),  	    Software 	(Enterprise Linux),  	    Cloud 	(<a href="http://www.vertica.com/vertica-for-the-cloud-and-virtualization/"><span class="ifootnote" title="Vertica is deployable on Amazon EC2 and on VMware virtualization infrastructure">Cloud and Virtualized</span></a>)  	</div>
<div class="ihdr">Hadoop</div>
<div class="ifield">  	    Hadoop and Pig connectors available  	</div>
<div class="ihdr">Links</div>
<div class="ifield">  	    <a href="http://vertica.com/">Home page</a>,  	    <a href="http://www.vertica.com/customers/case-studies/">case study</a>  	</div>
</p></div>
</p></div>
<p> <br />
<h2>Hadoop-centered companies</h2>
<p>Directly employing Hadoop is another route to creating a big     data solution, especially where your infrastructure doesn&#8217;t fall     neatly into the product line of major vendors.  Practically every     database now features Hadoop connectivity, and there are multiple     Hadoop distributions to choose from.</p>
<p>Reflecting the developer-driven ethos of the big data world,     Hadoop distributions are frequently offered in a community edition.     Such editions lack enterprise management features, but contain  all     the functionality needed for evaluation and development.</p>
<p>The first iterations of Hadoop distributions, from Cloudera and     IBM, focused on usability and adminstration. We are now seeing the     addition of performance-oriented improvements to Hadoop, such as     those from MapR and Platform Computing. While maintaining API     compatibility, these vendors replace slow or fragile parts of the     Apache distribution with better performing or more robust components.</p>
<h3>Cloudera</h3>
<p>The longest-established provider of Hadoop distributions,     <a href="http://www.cloudera.com/">Cloudera</a> provides an     enterprise Hadoop solution, alongside     services, training and support options. Along with     Yahoo, Cloudera have made deep open source contributions to Hadoop, and     through hosting industry conferences have done much to establish     Hadoop in its current position.</p>
<h3>Hortonworks</h3>
<p>Though a recent entrant to the market, <a href="http://www.hortonworks.com/">Hortonworks</a> have a long     history with Hadoop. Spun off from Yahoo, where Hadoop     originated, Hortonworks aims to stick close to and promote the     core Apache Hadoop technology. Hortonworks also have a partnership     with Microsoft to assist and accelerate their Hadoop     integration.</p>
<p>Hortonworks <a href="http://hortonworks.com/technology/hortonworksdataplatform/">Data     Platform</a> is currently in a limited preview phase, with a     public preview expected in early 2012. The company also provides     support and training.</p>
<h3>An overview of Hadoop distributions</h3>
<p><em>(Note: The following table is embedded in an iframe. <a href="http://cdn.oreilly.com/radar/2012/08/hadoop-comparison-table.html">Click here</a> to see the full version.)</em></p>
<h2>Notes</h2>
<ul>
<li>  <strong>Pure cloud solutions</strong>: Both Amazon Web Services 	and Google offer cloud-based big data solutions. These will be 	reviewed separately. </li>
<li> <strong>HPCC</strong>: Though dominant, Hadoop is not the only big data solution. LexisNexis&#8217; <a href="http://hpccsystems.com/">HPCC</a> offers an alternative approach. </li>
<li> <strong>Hadapt</strong>: not yet featured in this survey. 	Taking a different approach from 	both Hadoop-centered and MPP solutions, <a href="http://hadapt.squarespace.com/product-overview/">Hadapt</a> 	integrates unstructured and structured data into one 	product: wrapping rather than exposing Hadoop. It is currently in &#8220;early access&#8221; stage. </li>
<li> <strong>NoSQL</strong>: Solutions built  on databases such as 	Cassandra, MongoDB and Couchbase are not in the scope of this 	survey, though these databases do offer Hadoop integration. </li>
<li>  <strong>Errors and omissions</strong>: 	given the fast-evolving nature of the market and variable 	quality of public information, any feedback about errors and 	omissions from this survey is most welcome. Please send it to 	<a href="mailto:edd+bigdata@oreilly.com?subject=Hadoop%20survey%20feedback">edd+bigdata@oreilly.com</a>. </li>
</ul>
<p><strong>Related:</strong></p>
<ul>
<li> <a href="http://radar.oreilly.com/2011/06/getting-started-with-hadoop.html">Get started with Hadoop: From evaluation to your first production cluster<br />
</a></li>
<li> <a href="http://radar.oreilly.com/2011/01/what-is-hadoop.html">Hadoop: What it is, how it works, and what it can do</a></li>
<li> <a href="http://radar.oreilly.com/2010/06/what-is-data-science.html">What is data science?</a></li>
<li> <a href="http://radar.oreilly.com/2012/01/what-is-big-data.html">What is big data?</a></li>
<li> <a href="http://radar.oreilly.com/2011/09/building-data-science-teams.html">Building data science teams</a></li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://strata.oreilly.com/2012/01/big-data-ecosystem.html/feed</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
	</channel>
</rss>
