<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Strata &#187; Ben Lorica</title>
	<atom:link href="http://strata.oreilly.com/ben/feed" rel="self" type="application/rss+xml" />
	<link>http://strata.oreilly.com</link>
	<description>Making Data Work</description>
	<lastBuildDate>Fri, 24 May 2013 20:31:52 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.5.1</generator>
		<item>
		<title>Improving options for unlocking your graph data</title>
		<link>http://strata.oreilly.com/2013/05/improving-options-for-unlocking-your-graph-data.html</link>
		<comments>http://strata.oreilly.com/2013/05/improving-options-for-unlocking-your-graph-data.html#comments</comments>
		<pubDate>Sun, 19 May 2013 16:00:24 +0000</pubDate>
		<dc:creator>Ben Lorica</dc:creator>
				<category><![CDATA[Data]]></category>
		<category><![CDATA[Big Data]]></category>
		<category><![CDATA[graph]]></category>
		<category><![CDATA[machine]]></category>
		<category><![CDATA[social graph]]></category>
		<category><![CDATA[social network analysis]]></category>
		<category><![CDATA[spark]]></category>

		<guid isPermaLink="false">http://strata.oreilly.com/?p=57388</guid>
		<description><![CDATA[The popular open source project GraphLab received a major boost early this week when a new company comprised of its founding developers, raised funding to develop analytic tools for graph data sets. GraphLab Inc. will continue to use the open &#8230; ]]></description>
				<content:encoded><![CDATA[<p>The popular open source project <a href="http://graphlab.org/">GraphLab</a> received a major boost early this week when a new company comprised of its founding developers, <a href="http://graphlab.com/press/">raised funding</a> to develop analytic tools for graph data sets. <a href="http://graphlab.com/">GraphLab Inc.</a> will continue to use the open source GraphLab to &#8220;push the limits of graph computation and develop new ideas&#8221;, but having a commercial company will accelerate development, and allow the hiring of resources dedicated to improving usability and documentation.</p>
<p>While social media placed graph data on the radar of many companies, similar data sets can be found in many domains including the life and health sciences, security, and financial services. Graph data is different enough that it necessitates special tools and techniques. Because tools were a bit too complex for casual users, in the past this meant graph data analytics was the province of specialists. Fortunately graph data is an area that has attracted many enthusiastic entrepreneurs and developers. The tools have improved and I expect things to get much easier for users in the future. A great place to learn more about tools for graph data, is at the upcoming <a href="http://graphlab.org/graphlab-workshop-2013/">GraphLab Workshop</a> (on July 1st in SF).</p>
<p><b>Data wrangling: creating graphs</b><br />
Before you can take advantage of the other tools mentioned in this post, you&#8217;ll need to turn your data (e.g., web pages) into graphs. <a href="https://01.org/graphbuilder/">GraphBuilder</a> is an open source project from Intel, that uses Hadoop MapReduce<sup>1</sup> to build graphs out of large data sets. Another option is the combination of GraphX/Spark <a href="#gx">described below</a>. (A startup called <a href="http://trifacta.com/">Trifacta</a> is building a general-purpose, data wrangling tool, that could help as well. )</p>
<p><span id="more-57388"></span></p>
<p><b>Data management and search</b><br />
Once you have a graph, there are many options for how to store it. The choice of database largely depends on amount of data (# of nodes, edges, along with the size of data associated with them), the types of tasks (pattern-matching and search, analytics), and workload. In the course of evaluating alternatives to MySQL (for storing social graph data), Facebook&#8217;s engineering team developed and released <a href="https://github.com/facebook/linkbench">Linkbench</a> &#8211; a data set that can be used to study how graph databases handle production workloads.</p>
<p>Most <a href="http://en.wikipedia.org/wiki/Graph_database#Graph_database_features">graph databases</a> (such as <a href="http://www.neo4j.org/">Neo4j</a><sup>2</sup>, <a href="http://www.franz.com/agraph/allegrograph/">AllegroGraph</a>, <a href="http://yarcdata.com/Products/">Yarcdata</a>, and <a href="http://www.objectivity.com/infinitegraph">InfiniteGraph</a>) come with tools for facilitating and speeding up search &#8211; Neo4j comes with a simple query language (<a href="http://www.neo4j.org/learn/cypher">Cipher</a>) for search, other graph databases support <a href="http://en.wikipedia.org/wiki/SPARQL">SPARQL</a>. The <a href="http://thinkaurelius.github.io/titan/">Titan</a> distributed graph database supports different storage engines (including HBase and Cassandra) and comes with tools for search and traversal (based on Lucene and <a href="https://github.com/tinkerpop/gremlin/wiki">Gremlin</a>). Used by Twitter to store graph data, <a href="https://github.com/twitter/flockdb">FlockDB</a> targets operations involving <a href="http://engineering.twitter.com/2010/05/introducing-flockdb.html">adjacency</a> lists.</p>
<p>Among Hadoop users <a href="http://www.cloudera.com/content/cloudera/en/resources/library/hbasecon/video-hbasecon-2012-storing-and-manipulating-graphs-in-hbase.html">HBase is a popular option for storing graph data</a>. Hadapt&#8217;s analytic platform<sup>3</sup> integrates Apache Hadoop and SQL, and now also <a href="http://hadapt.com/product/">supports graph analysis</a>.</p>
<p><b>Graph-parallel frameworks</b>: Pregel, PowerGraph, and <a name="gx">GraphX</a><br />
<a href="http://en.wikipedia.org/wiki/Bulk_Synchronous_Parallel">BSP</a> is a parallel computing model that has inspired many graph analytics tools. Just like Hadoop&#8217;s <i>map</i> and <i>reduce</i>, <a href="http://googleresearch.blogspot.com/2009/06/large-scale-graph-computing-at-google.html">Pregel</a><sup>4</sup>, <a href="http://giraph.apache.org/">Giraph</a> and <a href="http://hyracks.org/projects/pregelix/">Pregelix</a>, come with <i>primitives</i> that let neighboring nodes send/receive messages to one another, or change the state of a node (based on the state of its neighboring nodes). Efficient graph algorithms are a sequence of iterations built from such primitives. GraphLab uses similar primitives (called <i><a href="http://www.select.cs.cmu.edu/publications/paperdir/osdi2012-gonzalez-low-gu-bickson-guestrin.pdf">PowerGraph</a></i>) but allows for <a href="http://www.select.cs.cmu.edu/publications/paperdir/vldb2012-low-gonzalez-kyrola-bickson-guestrin-hellerstein.pdf"><i>asynchronous</i> iterative computations</a>, leading to an expanded set of (potentially) faster algorithms.</p>
<p><a href="https://amplab.cs.berkeley.edu/publication/graphx-grades/">GraphX is a new, fault-tolerant, framework</a> that runs within <a href="http://spark-project.org/">Spark</a>. Its core data structure is an immutable <i>graph</i><sup>5</sup> (Resilient Distributed Graph &#8211; or RDG), and GraphX programs are a sequence of <i>transformations</i> on RDG&#8217;s (with each transformation yielding a new RDG). Transformations on RDG&#8217;s can affect nodes, edges, or both (depending on the state of neighboring edges and nodes). GraphX greatly enhances productivity by simplifying a range of tasks (graph loading, construction, transformation, and computations). But it does so at the expense of performance: early prototype algorithms written in GraphX were slower<sup>6</sup> than those written in GraphLab/PowerGraph.</p>
<p><b>Machine-learning and analytics</b><br />
Machine-learning tools that target graph data lead to familiar applications such as detecting influential users (PageRank) and communities, fraud detection, and recommendations (<a href="http://strata.oreilly.com/2012/12/graphchi-graph-analytics-over-billions-of-edges-using-your-laptop.html">collaborative filtering is popular among GraphLab users</a>). Moreover techniques developed in one domain are often reused in other settings. Besides GraphLab, distributed analytics have been implemented in <a href="http://engineering.linkedin.com/open-source/apache-giraph-framework-large-scale-graph-processing-hadoop-reaches-01-milestone">Giraph</a>, <a href="https://amplab.cs.berkeley.edu/publication/graphx-grades/">GraphX</a>, <a href="http://hortonworks.com/blog/big-graph-data-on-hortonworks-data-platform/">Faunus</a>, and <a href="http://sampa.cs.washington.edu/grappa/overview.html">Grappa</a>. In addition, graph databases like Neo4j and Yarcdata come with some analytic capabilities. As I noted in a <a href="http://strata.oreilly.com/2013/04/single-server-systems-can-tackle-big-data.html">recent post, open source, single-node systems like Twitter&#8217;s Cassovary</a><sup>7</sup> are being used for computations involving massive graphs.</p>
<p><b>Visualization</b><br />
When you&#8217;re dealing with large graphs, being able to zoom in/out helps with clutter, but so do <a href="http://www2.research.att.com/~yifanhu/GALLERY/GRAPHS/">clever layout algorithms</a>. Popular tools for visualizing nodes and edges include <a href="https://gephi.org/">Gephi</a> and <a href="http://graphviz.org/">GraphViz</a>. Users who want to customize their graphs turn to packages like <a href="https://github.com/mbostock/d3/wiki/Gallery">d3</a>.</p>
<hr />
<p><small><br />
(1) I would love to see a version of GraphBuilder that&#8217;s built on top of <a href="http://strata.oreilly.com/2012/08/seven-reasons-why-i-like-spark.html">Spark</a>.<br />
(2) Many of these systems are quite efficient. For example a single instance of Neo4j <a href="http://blog.neo4j.org/2013/01/2013-whats-coming-next-in-neo4j.html">can handle very large graphs</a> (&#8220;into the tens of billions of nodes/ relationships/ properties&#8221;).<br />
(3) Note that using standard Hadoop for graph <i>processing</i> <a href="http://dbmsmusings.blogspot.com/2011/07/hadoops-tremendous-inefficiency-on.html">may not be the most efficient</a> option. This <a href="http://www.slideshare.net/cloudera/hadoop-and-graph-data-management-challenges-and-opportunities-daniel-abadi-yale-university-hadapt">talk by Hadapt co-founder Daniel Abadi</a> describes an advanced approach to graph analysis using Hadoop.<br />
(4) Related frameworks include <a href="http://wwwrel.ph.utexas.edu/Members/jon/golden_orb/">GoldenOrb</a> and <a href="http://hama.apache.org/">Hama</a>.<br />
(5) Resilient Distributed Graphs (RDG) extend Spark’s Resilient Distributed Dataset (RDD).<br />
(6) <a href="https://amplab.cs.berkeley.edu/publication/graphx-grades/">As the developers of GraphX note</a>: <i>&#8220;We emphasize that it is not our intention to beat PowerGraph in performance. &#8230; We believe that the loss in performance may, in many cases, be ameliorated by the gains in productivity achieved by the GraphX system. &#8230; It is our belief that we can shorten the gap in the near future, while providing a highly usable <u>interactive</u> system for graph data mining and computation&#8221;</i><br />
(7) On the plus side, being single-node means Cassovary doesn&#8217;t have to deal with finding the optimal way to partition a graph. On the other hand, it is limited to graphs that fit in the memory of a server &#8211; a limitation it alleviates through the use of efficient data structures.<br />
</small></p>
<div style="float: left;border-top: thin gray solid;border-bottom: thin gray solid;padding: 20px;margin: 20px 2px;clear: both">
<p><a href="http://strataconf.com/?intcmp=il-strata-stny13-blog-promo"><img style="float: left;border: none;padding-right: 10px" alt="" src="http://cdn.oreilly.com/radar/images/promos/2013-strata-rx-london-ny.gif" /></a><a href="http://strataconf.com/?intcmp=il-strata-stny13-blog-promo"><strong>O&#8217;Reilly Strata Conference</strong></a> — Strata brings together the leading minds in data science and big data — decision makers and practitioners driving the future of their businesses and technologies. Get the skills, tools, and strategies you need to make data work.</p>
<p><a href="http://strataconf.com/rx2013?intcmp=il-strata-strx13-strata-blog-banner-148x178">Strata Rx Health Data Conference</a>: September 25-27 | Boston, MA<br />
<a href="http://strataconf.com/stratany2013?intcmp=il-strata-stny13-blog-promo">Strata + Hadoop World</a>: October 28-30 | New York, NY<br />
<a href="http://strataconf.com/strataeu2013/?intcmp=il-strata-steu13-blog-promo">Strata in London</a>: November 15-17 | London, England</p>
</div>
]]></content:encoded>
			<wfw:commentRss>http://strata.oreilly.com/2013/05/improving-options-for-unlocking-your-graph-data.html/feed</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>11 Essential Features that Visual Analysis Tools Should Have</title>
		<link>http://strata.oreilly.com/2013/05/11-essential-features-that-visual-analysis-tools-should-have.html</link>
		<comments>http://strata.oreilly.com/2013/05/11-essential-features-that-visual-analysis-tools-should-have.html#comments</comments>
		<pubDate>Sun, 12 May 2013 16:00:16 +0000</pubDate>
		<dc:creator>Ben Lorica</dc:creator>
				<category><![CDATA[Data]]></category>
		<category><![CDATA[data visualization]]></category>
		<category><![CDATA[visualization]]></category>

		<guid isPermaLink="false">http://strata.oreilly.com/?p=57269</guid>
		<description><![CDATA[After recently playing with SAS Visual Analytics, I&#8217;ve been thinking about tools for visual analysis. By visual analysis I mean the type of analysis most recently popularized by Tableau, QlikView, and Spotfire: you encounter a data set for the first &#8230; ]]></description>
				<content:encoded><![CDATA[<p>After recently <a href="https://twitter.com/bigdata/status/329328698140553217">playing with SAS Visual Analytics</a>, I&#8217;ve been thinking about tools for visual analysis. By <i>visual analysis</i> I mean the type of analysis most recently popularized by <a href="http://www.tableausoftware.com/">Tableau</a>, <a href="http://www.qlikview.com/">QlikView</a>, and <a href="http://spotfire.tibco.com/">Spotfire</a>: you encounter a data set for the first time, conduct <a href="http://en.wikipedia.org/wiki/Exploratory_data_analysis">exploratory data analysis</a>, with the goal of discovering interesting patterns and associations. Having used a few visualization tools myself, here&#8217;s a quick <span style="text-decoration: underline">wish-list</span> of features (culled from tools I&#8217;ve used or have seen in action).</p>
<p><b>Requires little (to no) coding</b><br />
The viz tools I currently use require programming skills. Coding means switching back-and-forth between a visual (chart) and text (code). It&#8217;s nice<sup>1</sup> to be able to customize charts via code, but when you&#8217;re in the exploratory phase not having to think about code syntax is ideal. Plus GUI-based tools allow you to collaborate with many more users.</p>
<p><span id="more-57269"></span></p>
<p><b>Includes an expanded set of basic charts</b><br />
Aside from statistical graphics (line, bar, scatter, histogram, bubble, boxplot,&#8230;), these days the ability to visualize hierarchies (<a href="http://en.wikipedia.org/wiki/Treemapping">treemap</a>), financial (stock charts), longitudinal, geospatial (maps) and <a href="http://en.wikipedia.org/wiki/Graph_drawing">network</a> data are essential.</p>
<p><b>Charts are easy to customize</b><br />
It should be easy to tweak labels, colors, and other elements. There are times when default labels need to be resized or repositioned, to make them legible. You should also be able to adjust coloring schemes to your liking (colors are usually assigned based on category or, in the case of heat maps, value).</p>
<p><b>Templates can be created</b><br />
Once you create a chart with your preferred color and labeling scheme, you should be able to templatize it for future projects. [Ideally templates support rule-based formatting ("if negative, color = red"), but this starts to involve some coding.]</p>
<p><b>Visual <i>summaries</i> are easy to generate</b> (histograms, <a href="http://en.wikipedia.org/wiki/Distance_correlation">association matrix</a>)<br />
You&#8217;ll be exploring data sets that contain many observations (rows) and variables (columns). SAS Visual Analytics produces a quick <i>summary</i> (average, min/max, histogram) for <i>each</i> variable and displays the results in a compact, scrollable format. This is done entirely through a GUI and doesn&#8217;t require any coding.</p>
<p><b>Drill-down to source points: identify, isolate, and fix minor data errors</b><br />
Visual summaries<sup>2</sup> alert you to potential problems with your data (outliers or errors). A few tools give you the ability to isolate outliers or fix simple data problems through a GUI. More generally, it&#8217;s nice to be able to drill-down from the chart to <i>examine</i> (via dynamic rollover or other method) the underlying data.</p>
<p><b>In-place filtering</b><br />
While exploring data, you need to be able to quickly filter by value or category &#8211; using checkboxes, drop-downs, sliders, &#8230;</p>
<p><b>Support for <i>visual pivoting</i></b><br />
Many business analysts are heavy users of <a href="http://en.wikipedia.org/wiki/Pivot_table">pivot tables</a> &#8211; a tabular summarization technique found in spreadsheets and reporting tools. Visual pivoting replaces tabular presentation with charts. My first experience using this type of visual exploration was through the <a href="http://stat.bell-labs.com/project/trellis/display.examples.html">Trellis graphs</a> introduced in S/S-Plus. Thanks to <a href="http://www.tableausoftware.com/">Tableau&#8217;s easy-to-use interface</a>, this form of visual analysis has become a popular way to explore data.</p>
<p><b>Support for analytics</b><br />
Many visualization tools lack analytic capabilities. From simple (<a href="http://en.wikipedia.org/wiki/Error_bar">error bar</a>, <a href="http://en.wikipedia.org/wiki/Quantile">quantiles</a>) to advanced (clustering, forecasting, multidimensional scaling<sup>3</sup>), analytic tools expand what users can do. Case in point, SAS Visual Analytics has tools for conducting <i>sensitivity analysis and forecasting</i> (GUI-based, no coding required). An example is to take a given <i>time-series</i> (unit sales), plot a forecast of its behavior for the next six time periods, and study how the forecast varies when other <i>key variables</i> (customer satisfaction) change.</p>
<p><b>Tools for sharing, collaboration, and replication</b><br />
Several tools let you publish<sup>4</sup> your static or interactive charts, and some tools even let you <i>subscribe</i><sup>5</sup> to the work of other users. For sharing, collaboration, and documentation, it should be possible to annotate your work. Being able to collaborate with others would be nice, at a minimum one should at least be able to copy (<span style="text-decoration: underline">and</span> modify) the work of another user.</p>
<p><b>Big Data: Volume and <i>Variety</i></b><sup>6</sup><br />
A tool should produce charts <i>quickly</i> even when it&#8217;s hitting massive data sets. Simply put, it should be truly interactive<sup>7</sup>. Several new tools target larger data sets, some are geared specifically for Hadoop users (a partial list includes <a href="http://www.datameer.com">Datameer</a>, <a href="http://www.platfora.com">Platfora</a>, <a href="http://www.sisense.com">SiSense</a>, and <a href="http://www.sas.com/software/visual-analytics/overview.html">SAS Visual Analytics</a>). But there will be occasions when you&#8217;ll be working with small data sets (or be offline). To that end you should be able to visually explore small data (locally using your laptop) without having to connect to a more powerful environment (such as a cluster or a beefy server).</p>
<p>I haven&#8217;t come across great viz tools for exploring unstructured data, so I&#8217;ll interpret <i>variety</i> in a different way. <i>Co-existence</i> (usually of Hadoop &amp; data warehouses) means data will continue to reside in different systems. Being able to connect to a variety of data sources is essential. (Among startups, Datameer does a <a href="http://www.datameer.com/product/data-integration.html">good job</a> of this.) Some tools include public data sets (e.g., US Census) and use them to generate examples.</p>
<p>Update (5/23/2012): A recent conversation with <a href="http://www.eecs.berkeley.edu/~alspaugh/">Sara Alspaugh</a> inspired the following feature.<br />
<strong>Recommend items worth investigating</strong><br />
When you first encounter a data set with lots of variables, it can be a bit overwhelming. Using simple pattern recognition techniques, tools should surface associations/patterns/anomalies worth investigating. Some tools in finance do this for time-series: trends, new highs/lows, and forecasts are drawn automatically. I&#8217;d love to have suggestions for what visual pivots (trellis charts) to draw.</p>
<hr />
<p><small><br />
(0) Thanks to <a href="http://www.ghostweather.com/bio.html">Lynn Cherny</a> for reviewing an early draft of this post and for suggesting a few features.<br />
(1) Unless of course you have killer programming tools, a la <a href="http://www.youtube.com/watch?v=PUv66718DII">Bret Victor</a>. You can do some of the things described in the post using <a href="http://www.revolutionanalytics.com/products/enterprise-big-data.php">ScaleR from Revolution Analytics</a> &#8211; but it&#8217;s a tool that requires coding in R.<br />
(2) A good example: SAS Visual Analytics displays the number of distinct values of categorical variables. If the number of distinct values is unusually large, you likely have a data quality issue.<br />
(3) Or other tools for handling high-dimensional data sets. Still waiting for a next-gen <a href="http://en.wikipedia.org/wiki/GGobi">ggobi</a>!<br />
(4) Datameer takes this a step further: it has <a href="http://www.datameer.com/apps">an app market</a>.<br />
(5) Some tools even send you realtime alerts when data for charts you&#8217;ve subscribed to have changed.<br />
(6) I omitted <i>Velocity</i> &#8211; the ability to handle streaming data. I consider that a nice, but not a must-have feature for a visual <i>exploration</i> tool. Having said that, I do think the ability to handle realtime updates is essential when you share your work with others. See (5).<br />
(7) When working with truly massive data sets it&#8217;s natural to have some latency. Rather than having users idle while waiting, visual analysis tools should support multiple tabs or workspaces. Most database query tools have this feature: you can work on other queries while a query is still running.<br />
</small></p>
<div style="float: left;border-top: thin gray solid;border-bottom: thin gray solid;padding: 20px;margin: 20px 2px;clear: both"><a href="http://strataconf.com/?intcmp=il-strata-stny13-blog-promo"><img style="float: left;border: none;padding-right: 10px" src="http://cdn.oreilly.com/radar/images/promos/2013-strata-rx-london-ny.gif" /></a><a href="http://strataconf.com/?intcmp=il-strata-stny13-blog-promo"><strong>O&#8217;Reilly Strata Conference</strong></a> &mdash; Strata brings together the leading minds in data science and big data &mdash; decision makers and practitioners driving the future of their businesses and technologies. Get the skills, tools, and strategies you need to make data work.</p>
<p> <a href="http://strataconf.com/rx2013?intcmp=il-strata-strx13-strata-blog-banner-148x178">Strata Rx Health Data Conference</a>: September 25-27 |  Boston, MA<br /> <a href="http://strataconf.com/stratany2013?intcmp=il-strata-stny13-blog-promo">Strata + Hadoop World</a>: October 28-30 |  New York, NY<br /><a href="http://strataconf.com/strataeu2013/?intcmp=il-strata-steu13-blog-promo">Strata in London</a>: November 15-17  |  London, England </div>
]]></content:encoded>
			<wfw:commentRss>http://strata.oreilly.com/2013/05/11-essential-features-that-visual-analysis-tools-should-have.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Scalable streaming analytics using a single-server</title>
		<link>http://strata.oreilly.com/2013/05/scalable-streaming-analytics-using-a-single-server.html</link>
		<comments>http://strata.oreilly.com/2013/05/scalable-streaming-analytics-using-a-single-server.html#comments</comments>
		<pubDate>Sun, 05 May 2013 16:00:36 +0000</pubDate>
		<dc:creator>Ben Lorica</dc:creator>
				<category><![CDATA[Data]]></category>
		<category><![CDATA[big data analytics]]></category>
		<category><![CDATA[machine learning]]></category>
		<category><![CDATA[machine learning products]]></category>
		<category><![CDATA[real time]]></category>
		<category><![CDATA[realtime]]></category>
		<category><![CDATA[realtime data]]></category>
		<category><![CDATA[streaming data]]></category>
		<category><![CDATA[streams]]></category>

		<guid isPermaLink="false">http://strata.oreilly.com/?p=57056</guid>
		<description><![CDATA[For many organizations real-time1 analytics entails complex event processing systems (CEP) or newer distributed stream processing frameworks like Storm, S4, or Spark Streaming. The latter have become more popular because they are able to process massive amounts of data, and &#8230; ]]></description>
				<content:encoded><![CDATA[<p>For many organizations real-time<sup>1</sup> analytics entails complex event processing systems (CEP) or newer distributed stream processing frameworks like <a href="http://storm-project.net/">Storm</a>, <a href="http://incubator.apache.org/s4/">S4</a>, or <a href="http://spark-project.org/docs/latest/streaming-programming-guide.html">Spark Streaming</a>. The latter have become more popular because they are able to process massive amounts of data, and fit nicely with Hadoop and other cluster computing tools. For these distributed frameworks peak volume is function of network topology/bandwidth and the throughput of the individual nodes.</p>
<p>
<b>Scaling up machine-learning: Find efficient algorithms</b><br />
Faced with having to crunch through a massive data set, the first thing a machine-learning expert will try to do is devise a more efficient algorithm. Some popular approaches involve sampling, online learning, and <i>caching</i>. Parallelizing an algorithm tends to be lower on the list of things to try. The key reason is that while there are algorithms that are embarrassingly parallel (e.g., naive bayes), many others are harder to decouple. But as I highlighted in a <a href="http://strata.oreilly.com/2013/04/single-server-systems-can-tackle-big-data.htm">recent post</a>, efficient tools that run on single servers can tackle large data sets. In the machine-learning context recent examples<sup>2</sup> of efficient algorithms that scale to large data sets,  can be found in the products of startup <a href="http://www.skytree.net/">SkyTree</a>.</p>
<p><span id="more-57056"></span></p>
<p><b>Use the same approach to scale real-time analytics: consider streamdrill</b><br />
One can use the same strategy for real-time analytics: before opting for distributed systems that require many nodes before benefits accrue, evaluate simpler solutions that implement efficient algorithms for answering common questions. A new system called <a href="https://streamdrill.com/">streamdrill</a> fits this profile (coincidentally it was designed by machine-learning researchers). It&#8217;s a single-server system<sup>3</sup> optimized to answer &#8220;top k questions&#8221; against <i>massive amounts</i> of structured and unstructured data (its capable of ingesting Twitter&#8217;s firehose). Specifically you can use streamdrill to count and identify the most active entities and events, over different time windows<sup>4</sup>. This <a href="http://www.scribd.com/doc/137991394/Online-Learning-with-Stream-Mining">leads to  important stream mining applications</a> such as identifying trends (<u>discover</u> &#8220;trending topics&#8221;), anomaly detection (&#8220;alerts&#8221;), correlations, clustering, and classification. Streamdrill also includes<sup>5</sup> a feature (called <i>traces</i>) that lets users easily run queries over extended time windows. </p>
<p>
<b>Unlock your data sooner</b><br />
The simplest and quickest way to mine your event data stream is to deploy efficient algorithms designed to answer key questions at scale. Streamdrill offers a quick path to real-time intelligence through a combination of <em>ease-of-deployment</em> (single server), <em>streaming algorithms</em> (heavy-hitters, anomaly detection, approximate percentiles, and more), and <em>scale</em> (intelligent resource management). As an added bonus streamdrill&#8217;s suite of efficient<sup>6</sup> algorithms for real-time analysis can easily be accessed through a <a href="http://demo.streamdrill.com/docs/?p=api">REST API</a>. </p>
<p>
<b>Related posts</b>:</p>
<li><a href="http://strata.oreilly.com/2013/04/single-server-systems-can-tackle-big-data.html">Single server systems can tackle big data</a></li>
<li><a href="http://strata.oreilly.com/2013/04/workflow-tools-enable-the-rapid-deployment-of-models.html">Simpler workflow tools enable the rapid deployment of models</a></li>
<hr /><small><br />
(1) For this post, I&#8217;ll use the following <a href="http://events.pentaho.com/Real-Time-Big-Data-Analytics.html">definition from Joe Hellerstein</a>: &#8220;<i>Real-time</i> is for robots. If you have people in the loop, it’s not real time. Most people take a second or two to react, and that’s plenty of time for a traditional transactional system to handle input and output.&#8221;<br />
(2) Other startups, <a href="http://about.wise.io/wiserf.html">Wise.io</a> (single-node) and <a href="http://0xdata.com/">0xdata</a> (distributed), are also starting to make inroads. Note that I&#8217;m also a fan and consumer of distributed algorithms (see previous posts <a href="http://strata.oreilly.com/2013/02/mlbase-scalable-machine-learning-made-accessible.html">here</a> and <a href="http://strata.oreilly.com/2013/03/fast-easy-to-use-scalable-data-science-tools.html">here</a>).<br />
(3) The system as designed can be distributed, and there are plans to do so in the near future.<br />
(4) streamdrill includes &#8220;heavy-hitters&#8221; and other <a href="http://en.wikipedia.org/wiki/Streaming_algorithm#Some_streaming_problems">standard streaming algorithms</a><br />
(5) This feature <a href="http://demo.streamdrill.com">isn&#8217;t available in the demo</a>, and is reminiscent of <a href="http://practicalquant.blogspot.com/2012/11/hokusai-adds-a-temporal-component-to-count-min-sketch.html">hokusai &#8211; the addition of a temporal component to count-min sketch</a> &#8211; introduced by Yahoo! researchers.<br />
(6) <a href="http://blog.mikiobraun.de/2013/01/what-is-streamdrills-trick.html">streamdrill</a> &#8220;uses exponential decay for aggregation&#8221; and &#8220;bounds its resource usage by selectively discarding inactive entries&#8221;.<br />
</small></p>
<div style="float: left;border-top: thin gray solid;border-bottom: thin gray solid;padding: 20px;margin: 20px 2px;clear: both"><a href="http://strataconf.com/?intcmp=il-strata-stny13-blog-promo"><img style="float: left;border: none;padding-right: 10px" src="http://cdn.oreilly.com/radar/images/promos/2013-strata-rx-london-ny.gif" /></a><a href="http://strataconf.com/?intcmp=il-strata-stny13-blog-promo"><strong>O&#8217;Reilly Strata Conference</strong></a> &mdash; Strata brings together the leading minds in data science and big data &mdash; decision makers and practitioners driving the future of their businesses and technologies. Get the skills, tools, and strategies you need to make data work.</p>
<p> <a href="http://strataconf.com/rx2013?intcmp=il-strata-strx13-strata-blog-banner-148x178">Strata Rx Health Data Conference</a>: September 25-27 |  Boston, MA<br /> <a href="http://strataconf.com/stratany2013?intcmp=il-strata-stny13-blog-promo">Strata + Hadoop World</a>: October 28-30 |  New York, NY<br /><a href="http://strataconf.com/strataeu2013/?intcmp=il-strata-steu13-blog-promo">Strata in London</a>: November 15-17  |  London, England </div>
]]></content:encoded>
			<wfw:commentRss>http://strata.oreilly.com/2013/05/scalable-streaming-analytics-using-a-single-server.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Tachyon: An open source, distributed, fault-tolerant, in-memory file system</title>
		<link>http://strata.oreilly.com/2013/04/tachyon-open-source-distributed-fault-tolerant-in-memory-file-system.html</link>
		<comments>http://strata.oreilly.com/2013/04/tachyon-open-source-distributed-fault-tolerant-in-memory-file-system.html#comments</comments>
		<pubDate>Sun, 28 Apr 2013 16:00:42 +0000</pubDate>
		<dc:creator>Ben Lorica</dc:creator>
				<category><![CDATA[Data]]></category>
		<category><![CDATA[Big Data]]></category>
		<category><![CDATA[data science]]></category>
		<category><![CDATA[in-memory]]></category>
		<category><![CDATA[spark]]></category>

		<guid isPermaLink="false">http://strata.oreilly.com/?p=56628</guid>
		<description><![CDATA[In earlier posts I&#8217;ve written about how Spark and Shark run much faster than Hadoop and Hive by1 caching data sets in-memory. But suppose one wants to share datasets across jobs/frameworks, while retaining speed gains garnered by being in-memory? An &#8230; ]]></description>
				<content:encoded><![CDATA[<p>In earlier posts I&#8217;ve written about how <a href="http://strata.oreilly.com/2012/08/seven-reasons-why-i-like-spark.html">Spark</a> and <a href="http://strata.oreilly.com/2012/11/shark-real-time-queries-and-analytics-for-big-data.html">Shark</a> run much faster than Hadoop and Hive by<sup>1</sup> caching data sets in-memory. But suppose one wants to share datasets across jobs/frameworks, while retaining speed gains garnered by being in-memory? An example would be performing computations using Spark, saving it, and accessing the saved results in Hadoop MapReduce. An in-memory storage system would speed up sharing across jobs by allowing users to <i>save</i> at near memory speeds. In particular the main challenge is being able to do memory-speed &#8220;writes&#8221; while maintaining fault-tolerance.</p>
<p>
<b>In-memory storage system from UC Berkeley&#8217;s AMPLab</b><br />
<a href="https://amplab.cs.berkeley.edu/">The team</a> behind the <a href="https://amplab.cs.berkeley.edu/bdas/">BDAS stack</a> recently released a <u>developer preview</u> of <a href="http://tachyon-project.org/">Tachyon</a> &#8211; an in-memory, distributed, file system. The current version of Tachyon was written in Java and supports Spark, Shark, and Hadoop MapReduce. Working data sets can be loaded into Tachyon where they can be accessed at memory speed, by many concurrent users. Tachyon implements the HDFS FileSystem interface for standard file operations (such as create, open, read, write, close, and delete).   </p>
<p><span id="more-56628"></span></p>
<p>Workloads with working sets fitting into cluster memory can derive the most benefits from Tachyon. But as I pointed out in <a href="http://strata.oreilly.com/2013/04/single-server-systems-can-tackle-big-data.html">a recent post</a>, in many companies <i>working</i> data sets are in the gigabytes or terabytes. Such data sizes are well within the range of a system like Tachyon. </p>
<p>
<b>High-throughput <i>writes</i> <u>and</u> fault-tolerance: <em>Bounded</em> recovery times using asynchronous checkpointing and lineage</b><br />
A release slated for the summer will include features<sup>2</sup> that enable data sharing (users will be able to do memory-speed <i>writes</i> to Tachyon). With Tachyon, Spark users will have for the first time, a high throughput way of reliably sharing files with other users. Moreover, despite being an external storage system Tachyon is comparable to Spark’s internal cache. Throughput tests on a cluster showed that Tachyon can read 200x and write 300x faster than HDFS. (Tachyon can read and write 30x faster than <a href="http://research.microsoft.com/en-us/news/features/minutesort-052112.aspx">FDS&#8217;</a> reported <a href="http://research.microsoft.com/apps/pubs/default.aspx?id=170248">throughput</a>.) </p>
<p>
Similar to the <a href="http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf">resilient distributed datasets</a> (RDD) fundamental within Spark, fault-tolerance in Tachyon also relies<sup>3</sup> on the concept of <i>lineage</i> &#8211; logging the transformations used to build a dataset, and using those logs to rebuild datasets when needed. Additionally as an <i>external</i> storage system Tachyon also keeps tracks of binary programs used to generate datasets, and the input datasets required by those programs. </p>
<p>
Tachyon achieves higher throughput because it stores a copy of each &#8220;write&#8221; to the memory of a single node, without waiting for it to be written to disk or replicated. (Replicating across a network is much slower than writing to memory.) Checkpointing is done <i>asynchronously</i>, with the <i>latest generated</i> dataset checkpointed, each time a checkpoint is done being saved.</p>
<p>
<b>High-performance data sharing between different data science frameworks</b><br />
Tachyon will let users share data across frameworks and perform read/write operations at memory-speed. In particular a system like Tachyon will appeal to data scientists who rely on workflows that use a variety of tools: their resulting data analytic pipelines will run much faster. To that end, its creators simulated a <i>real-world</i><sup>4</sup> data pipeline comprised of 400 steps, and found that Tachyon resulted in &#8220;17x end-to-end latency improvements&#8221;.</p>
<p>
Tachyon uses memory (instead of disk) and recomputation (instead of replication) to produce a distributed, fault-tolerant, and high-throughput file system. While it initially targets data warehouse and analytics (Shark, Spark, Hadoop MapReduce), I&#8217;m looking forward to seeing other popular data science tools support this interesting new file system. </p>
<p>
<strong>Related posts:</strong></p>
<li><a href="http://strata.oreilly.com/2013/02/mlbase-scalable-machine-learning-made-accessible.html">MLbase: Scalable machine-learning made accessible</a></li>
<li><a href="http://strata.oreilly.com/2013/03/data-science-tools-all-in-or-mix-and-match.html">Data Science tools: Are you “all in” or do you “mix and match”?</a></li>
<hr />
<small><br />
(1) There are other reasons including data co-partitioning and the use of column stores.<br />
(2) To reiterate, for its <u>developer preview</u> Tachyon only has memory bandwidth &#8220;reads&#8221;, supporting Spark/Shark and Hadoop MapReduce. A version due later this year will have memory bandwidth &#8220;writes&#8221;. The current version lets users write to Tachyon, but not at memory speed.<br />
(3) The key insight is that for certain workloads, the overhead of <i>recording and replicating lineage</i> is much less than <i>replicating data</i>. Recovery via recomputation requires that computations are deterministic and data be immutable. For these workloads, tracking lineage is akin to a compression scheme.<br />
(4) They used an example involving the processing of log files (1 TB raw input, and 500 GB output data).<br />
</small></p>
<div style="float: left;border-top: thin gray solid;border-bottom: thin gray solid;padding: 20px;margin: 20px 2px;clear: both"><a href="http://strataconf.com/?intcmp=il-strata-stny13-blog-promo"><img style="float: left;border: none;padding-right: 10px" src="http://cdn.oreilly.com/radar/images/promos/2013-strata-rx-london-ny.gif" /></a><a href="http://strataconf.com/?intcmp=il-strata-stny13-blog-promo"><strong>O&#8217;Reilly Strata Conference</strong></a> &mdash; Strata brings together the leading minds in data science and big data &mdash; decision makers and practitioners driving the future of their businesses and technologies. Get the skills, tools, and strategies you need to make data work.</p>
<p> <a href="http://strataconf.com/rx2013?intcmp=il-strata-strx13-strata-blog-banner-148x178">Strata Rx Health Data Conference</a>: September 25-27 |  Boston, MA<br /> <a href="http://strataconf.com/stratany2013?intcmp=il-strata-stny13-blog-promo">Strata + Hadoop World</a>: October 28-30 |  New York, NY<br /><a href="http://strataconf.com/strataeu2013/?intcmp=il-strata-steu13-blog-promo">Strata in London</a>: November 15-17  |  London, England </div>
]]></content:encoded>
			<wfw:commentRss>http://strata.oreilly.com/2013/04/tachyon-open-source-distributed-fault-tolerant-in-memory-file-system.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Simpler workflow tools enable the rapid deployment of models</title>
		<link>http://strata.oreilly.com/2013/04/workflow-tools-enable-the-rapid-deployment-of-models.html</link>
		<comments>http://strata.oreilly.com/2013/04/workflow-tools-enable-the-rapid-deployment-of-models.html#comments</comments>
		<pubDate>Sun, 21 Apr 2013 16:00:26 +0000</pubDate>
		<dc:creator>Ben Lorica</dc:creator>
				<category><![CDATA[Data]]></category>
		<category><![CDATA[algorithm]]></category>
		<category><![CDATA[algorithms]]></category>
		<category><![CDATA[data engineer]]></category>
		<category><![CDATA[data science]]></category>
		<category><![CDATA[data scientist]]></category>
		<category><![CDATA[machine learning]]></category>
		<category><![CDATA[statistics]]></category>

		<guid isPermaLink="false">http://strata.oreilly.com/?p=56568</guid>
		<description><![CDATA[Data science often depends on data pipelines, that involve acquiring, transforming, and loading data. (If you&#8217;re fortunate most of the data you need is already in usable form.) Data needs to be assembled and wrangled, before it can be visualized &#8230; ]]></description>
				<content:encoded><![CDATA[<p>Data science often depends on data pipelines, that involve acquiring, transforming, and loading data. (If you&#8217;re fortunate most of the data you need is already in usable form.) Data needs to be assembled and wrangled, before it can be visualized and analyzed. Many companies have data engineers (adept at using workflow tools like <a href="https://github.com/azkaban/azkaban2">Azkaban</a> and <a href="http://oozie.apache.org/">Oozie</a>), who manage<sup>1</sup> pipelines for data scientists and analysts.</p>
<p><b>A workflow tool for data analysts: Chronos from airbnb</b><br />
A raw <a href="https://en.wikipedia.org/wiki/Bash_(Unix_shell)"><em>bash</em></a> scheduler written in Scala, <a href="http://airbnb.github.io/chronos/">Chronos</a> is flexible, fault-tolerant<sup>2</sup>, and distributed (it&#8217;s built on top of <a href="http://incubator.apache.org/mesos/research.html">Mesos</a>). What&#8217;s most interesting is that it makes the creation and maintenance of complex workflows more accessible: at least within <a href="https://www.airbnb.com/">airbnb</a>, it&#8217;s heavily used by analysts.</p>
<p>Job orchestration and scheduling tools contain features that data scientists would appreciate. They make it easy for users to express <i>dependencies</i> (start a job upon the completion of another job), and <i>retries</i> (particularly in cloud computing settings, jobs can fail for a variety of reasons). Chronos comes with a web UI designed to let <span style="text-decoration: underline">business analysts</span><sup>3</sup> define, execute, and monitor workflows: a zoomable <a href="http://en.wikipedia.org/wiki/Directed_acyclic_graph">DAG</a> highlights failed jobs and displays stats that can be used to identify bottlenecks. Chronos lets you include asynchronous jobs &#8211; a nice feature for data science pipelines that involve long-running calculations. It also lets you easily define <i>repeating</i> jobs over a finite time interval, something that comes in handy for short-lived<sup>4</sup> experiments (e.g. A/B tests or <a href="http://shop.oreilly.com/product/0636920027393.do">multi-armed bandits</a>).</p>
<p><span id="more-56568"></span></p>
<p><b>The unreasonable effectiveness of data: model selection &amp; deployment</b><br />
By enabling airbnb analysts to take prototype workflows and easily deploy them to production, Chronos taps into a need that other<sup>5</sup> tools are beginning to address. Startup <a href="http://www.alpinedatalabs.com/product.html">Alpine Data Labs provides a GUI tool</a> that lets business analysts define and manage, multi-step <i>analytic</i> workflows.</p>
<p>The landmark <a href="http://acl.ldc.upenn.edu/P/P01/P01-1005.pdf">paper by Banko and Brill</a> hinted that with massive amounts of data, the choice of models become less important. Thus tools that let you easily deploy analytic models at scale, become just as important as specific algorithms. A noteworthy project out of UW-Madison &#8211; <a href="http://hazy.cs.wisc.edu/hazy/">Hazy</a> &#8211; seeks to simplify the deployment and maintenance of analytic models.</p>
<blockquote><p>&#8220;The next breakthrough in data analysis may not be in individual algorithms, but in the ability to rapidly combine, deploy, and maintain existing algorithms.&#8221;<br />
<a href="http://queue.acm.org/detail.cfm?id=2431055">Hazy: Making it Easier to Build and Maintain Big-data Analytics</a></p></blockquote>
<p><b>Related posts:</b></p>
<ul>
<li><a href="http://strata.oreilly.com/2013/03/data-science-tools-all-in-or-mix-and-match.html">Data Science tools: Are you “all in” or do you “mix and match”?</a></li>
<li><a href="http://strata.oreilly.com/2013/03/fast-easy-to-use-scalable-data-science-tools.html">Data Science tools: Fast, easy to use, and scalable</a></li>
</ul>
<hr />
<p><small><br />
(1) Data scientists may build prototypes, but repeatable pipelines tend to be the domain of data engineers.<br />
(2) As with other workflow tools, Chronos includes alerts (for job deletes and failure after specified # of retries).<br />
(3) Chronos jobs are defined via a web GUI, other tools require the creation/maintenance of &#8220;configuration&#8221; files. Chronos also comes with a simple REST API.<br />
(4) Chronos uses <a href="http://en.wikipedia.org/wiki/ISO_8601">ISO8601</a> which makes it easy to define <a href="http://en.wikipedia.org/wiki/ISO_8601#Repeating_intervals">repeating intervals</a> and configure jobs that repeat over a time period, after which they get deleted.<br />
(5) Other companies include <a href="http://trifacta.com/about">Trifacta</a>, <a href="http://www.ufora.com/product/platform">Ufora</a>, and BI tools <a href="http://www.datameer.com">Datameer</a> and <a href="http://www.platfora.com">Platfora</a>.<br />
</small></p>
<div style="float: left;border-top: thin gray solid;border-bottom: thin gray solid;padding: 20px;margin: 20px 2px;clear: both">
<p><a href="http://strataconf.com/?intcmp=il-strata-stny13-blog-promo"><img style="float: left;border: none;padding-right: 10px" alt="" src="http://cdn.oreilly.com/radar/images/promos/2013-strata-rx-london-ny.gif" /></a><a href="http://strataconf.com/?intcmp=il-strata-stny13-blog-promo"><strong>O&#8217;Reilly Strata Conference</strong></a> — Strata brings together the leading minds in data science and big data — decision makers and practitioners driving the future of their businesses and technologies. Get the skills, tools, and strategies you need to make data work.</p>
<p><a href="http://strataconf.com/rx2013?intcmp=il-strata-strx13-strata-blog-banner-148x178">Strata Rx Health Data Conference</a>: September 25-27 | Boston, MA<br />
<a href="http://strataconf.com/stratany2013?intcmp=il-strata-stny13-blog-promo">Strata + Hadoop World</a>: October 28-30 | New York, NY<br />
<a href="http://strataconf.com/strataeu2013/?intcmp=il-strata-steu13-blog-promo">Strata in London</a>: November 15-17 | London, England</p>
</div>
]]></content:encoded>
			<wfw:commentRss>http://strata.oreilly.com/2013/04/workflow-tools-enable-the-rapid-deployment-of-models.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Single server systems can tackle big data</title>
		<link>http://strata.oreilly.com/2013/04/single-server-systems-can-tackle-big-data.html</link>
		<comments>http://strata.oreilly.com/2013/04/single-server-systems-can-tackle-big-data.html#comments</comments>
		<pubDate>Sun, 14 Apr 2013 16:00:25 +0000</pubDate>
		<dc:creator>Ben Lorica</dc:creator>
				<category><![CDATA[Data]]></category>
		<category><![CDATA[Big Data]]></category>
		<category><![CDATA[big data analytics]]></category>
		<category><![CDATA[in-memory]]></category>

		<guid isPermaLink="false">http://strata.oreilly.com/?p=56403</guid>
		<description><![CDATA[About a year ago a blog post from SAP posited1 that when it comes to analytics, most companies are in the multi-terabyte range: data sizes that are well-within the scope of distributed in-memory solutions like Spark, SAP HANA, ScaleOut Software, &#8230; ]]></description>
				<content:encoded><![CDATA[<p>About a year ago a <a href="http://www.saphana.com/community/blogs/blog/2012/04/30/what-oracle-wont-tell-you-about-sap-hana">blog post from SAP</a> posited<sup>1</sup> that when it comes to analytics, most companies are in the multi-terabyte range: data sizes that are well-within the scope of <i>distributed</i> <a href="http://radar.oreilly.com/2013/02/an-update-on-in-memory-data-management.html">in-memory solutions</a> like <a href="http://strata.oreilly.com/2012/08/seven-reasons-why-i-like-spark.html">Spark</a>, <a href="http://www.saphana.com/welcome">SAP HANA</a>, <a href="http://www.scaleoutsoftware.com/">ScaleOut Software</a>, <a href="http://www.gridgain.com/">GridGain</a>, and <a href="http://terracotta.org/">Terracotta</a>.</p>
<p><span id="more-56403"></span></p>
<p>Around the same time a team of researchers from Microsoft went a step further. They <a href="http://research.microsoft.com/apps/pubs/default.aspx?id=163083">released a study</a> that concluded that for many data processing tasks, scaling by using single machines with very large memories is more efficient than using clusters. They found two clusters devoted to analytics (one at Yahoo and another at Microsoft) had median job input sizes under 14 GB, while 90% of jobs on a Facebook cluster had input sizes under 100 GB. In addition, the researchers noted that</p>
<blockquote><p>&#8230; for workloads that are processing multi-gigabytes rather than terabyte+ scale, a big-memory server may well provide better performance per dollar than a cluster.</p></blockquote>
<p><b>One year later: some single server systems that tackle big data</b><br />
BI company <a href="http://www.sisense.com/">SiSense</a> won the Strata Startup Showcase audience award with Prism &#8211; a 64-bit software system that can handle a terabyte of data on a machine with only 8GB of RAM. Prism<sup>2</sup> relies on <a href="http://3.bp.blogspot.com/-AmqsTlDYZ7I/UWesJoHcyfI/AAAAAAAADK8/gwRqAQEYBQU/sisense.jpg">disk for storage, moves data to memory when needed, and also takes advantage of the CPU</a> (L1/L2/L3 cache). It also comes with a column store and visualization tools that lets it easily scale to a hundred terabytes.</p>
<p>Late last year <a href="http://strata.oreilly.com/2012/12/graphchi-graph-analytics-over-billions-of-edges-using-your-laptop.html">I wrote about GraphChi</a>, a graph processing system that can process graphs with billions of edges with a laptop. It uses a  technique called parallel sliding windows to process edges efficiently from disk. GraphChi is part of <a href="http://graphlab.org/">GraphLab</a>, an open source project that comes with toolkits for collaborative filtering<sup>3</sup>, topic models, and graph processing.</p>
<p><a href="https://github.com/twitter/cassovary">Cassovary</a> is an open source, graph processing system from Twitter. It&#8217;s designed to tackle graphs that fit in the memory of a single machine &#8211; nevertheless its creators believe that the use of space efficient data structures makes it a viable system for &#8220;most practical graphs&#8221;. In fact it <a href="http://www.stanford.edu/~rezab/papers/wtf_overview.pdf">already powers a system familiar to most Twitter users</a>: <i>WTF</i> (who to follow) is a recommendation service that suggests users with shared interests and common connections.</p>
<p><b>Next-gen SSD&#8217;s: narrowing the gap between main memory and storage</b><br />
GraphChi and SiSense scale to large data sets by using disk as primary storage. They speed up performance using techniques that rely on hardware optimization (SiSense) or sliding windows (GraphChi). As part of our <a href="http://radar.oreilly.com/2013/02/an-update-on-in-memory-data-management.html">investigation into in-memory data management systems</a>, the potential of <a href="http://snia.org/sites/default/files/NVM13-Beauchamp_EvolvingSSS.pdf">next-generation SSD&#8217;s</a> has come to our attention. If they live up to the promise of having speeds close to main memory, many more single-server systems for processing and analyzing big data will emerge.</p>
<hr />
<p><small><br />
(1) SAP <a href="http://www.saphana.com/community/blogs/blog/2012/04/30/what-oracle-wont-tell-you-about-sap-hana">blog post</a>: &#8220;Even with rapid growth of data, 95% of enterprises use between 0.5TB &#8211; 40 TB of data today.&#8221;<br />
(2) Prism uses <a href="http://3.bp.blogspot.com/-AmqsTlDYZ7I/UWesJoHcyfI/AAAAAAAADK8/gwRqAQEYBQU/sisense.jpg">a hierarchy</a>: Accessing data from CPU is faster compared to main memory, which in turn is faster than accessing it from disk.<br />
(3) GraphLab&#8217;s well-regarded collaborative filtering library <a href="http://bickson.blogspot.co.il/2012/08/collaborative-filtering-with-graphchi.html">has been ported to GraphChi</a>.<br />
</small></p>
<div style="float: left;border-top: thin gray solid;border-bottom: thin gray solid;padding: 20px;margin: 20px 2px;clear: both"><a href="http://strataconf.com/?intcmp=il-strata-stny13-blog-promo"><img style="float: left;border: none;padding-right: 10px" src="http://cdn.oreilly.com/radar/images/promos/2013-strata-rx-london-ny.gif" /></a><a href="http://strataconf.com/?intcmp=il-strata-stny13-blog-promo"><strong>O&#8217;Reilly Strata Conference</strong></a> &mdash; Strata brings together the leading minds in data science and big data &mdash; decision makers and practitioners driving the future of their businesses and technologies. Get the skills, tools, and strategies you need to make data work.</p>
<p> <a href="http://strataconf.com/rx2013?intcmp=il-strata-strx13-strata-blog-banner-148x178">Strata Rx Health Data Conference</a>: September 25-27 |  Boston, MA<br /> <a href="http://strataconf.com/stratany2013?intcmp=il-strata-stny13-blog-promo">Strata + Hadoop World</a>: October 28-30 |  New York, NY<br /><a href="http://strataconf.com/strataeu2013/?intcmp=il-strata-steu13-blog-promo">Strata in London</a>: November 15-17  |  London, England </div>
]]></content:encoded>
			<wfw:commentRss>http://strata.oreilly.com/2013/04/single-server-systems-can-tackle-big-data.html/feed</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>The re-emergence of time-series</title>
		<link>http://strata.oreilly.com/2013/04/the-re-emergence-of-time-series.html</link>
		<comments>http://strata.oreilly.com/2013/04/the-re-emergence-of-time-series.html#comments</comments>
		<pubDate>Sun, 07 Apr 2013 16:00:53 +0000</pubDate>
		<dc:creator>Ben Lorica</dc:creator>
				<category><![CDATA[Data]]></category>
		<category><![CDATA[temporal]]></category>
		<category><![CDATA[time-series]]></category>

		<guid isPermaLink="false">http://strata.oreilly.com/?p=56248</guid>
		<description><![CDATA[My first job after leaving academia was as a quant1 for a hedge fund, where I performed (what are now referred to as) data science tasks on financial time-series. I primarily used techniques from probability &#38; statistics, econometrics, and optimization, &#8230; ]]></description>
				<content:encoded><![CDATA[<p>My first job after leaving academia was as a quant<sup>1</sup> for a hedge fund, where I performed (what are now referred to as) data science tasks on financial time-series. I primarily used techniques from probability &amp; statistics, econometrics, and optimization, with occasional forays into machine-learning (clustering, classification, anomalies). More recently, I&#8217;ve been closely following the emergence of tools that target large time series and decided to highlight a few interesting bits.</p>
<p><b>Time-series and big data</b>:<br />
Over the last six months I&#8217;ve been encountering more data scientists (outside of finance) who work with massive amounts of time-series data. The rise of unstructured data has been widely reported, the growing importance of time-series much less so. Sources include data from consumer devices (gesture recognition &amp; user interface design), sensors (apps for &#8220;self-tracking&#8221;), machines (systems in data centers), and health care. In fact some research hospitals have troves of EEG and ECG readings that translate to time-series data collections with billions (even trillions) of points.</p>
<p><span id="more-56248"></span></p>
<p><b>Search and machine-learning at scale</b>:<br />
Before doing anything else, one has to be able to run queries at scale. Last year <a href="http://practicalquant.blogspot.com/2012/10/mining-time-series-with-trillions-of.html">I wrote about a team of researchers at UC Riverside</a> who took an existing search algorithm (<a href="http://web.science.mq.edu.au/~cassidy/comp449/html/ch11s02.html">dynamic time-warping</a><sup>2</sup>) and got it to scale to time-series with trillions of points. There are many potential applications of their research, one I highlighted is from health care:</p>
<blockquote><p>&#8230; a doctor who needs to search through EEG data (with hundreds of billions of points), for a &#8220;prototypical epileptic spike&#8221;, where the input query is a time-series snippet with thousands of points.</p></blockquote>
<p>As the size of data grows, the UCR dynamic time-warping algorithm takes time to finish (it takes a few hours for time-series with trillions of points). In general (academic) researchers who&#8217;ve spent weeks or months collecting data are fine waiting a few hours for a pattern recognition algorithm to finish. But users who come from different backgrounds (e.g. web companies) may not be as patient. Fortunately &#8220;search&#8221; is an active research area and faster (distributed) pattern recognition systems will likely emerge soon.</p>
<p>Once you scale up search, other interesting problems can be tackled. The UCR team is using their dynamic time-warping algorithm in tasks like classification, clustering, and motif<sup>3</sup> discovery. Other teams are investigating techniques from <a href="http://www.giss.nasa.gov/staff/mway/book/">signal-processing</a>, <a href="http://www.fast-lab.org/structuredcomplex.html">pattern recognition</a>, and <a href="http://www.fast-lab.org/">trajectory tracking</a>.</p>
<p><b>Some data management tools that target time-series</b>:<br />
One of the more popular sessions at <a href="http://practicalquant.blogspot.com/2012/05/much-to-like-about-hbasecon.html">last year&#8217;s HBase Conference</a> was on <a href="http://opentsdb.net/index.html">OpenTSDB</a>, a distributed, time series database built on top of HBase. It&#8217;s used to store and serve time series metrics, and comes with tools (based on <a href="http://www.gnuplot.info/">GNUPlot</a>) for charting. <a href="https://groups.google.com/forum/?fromgroups=#!topic/opentsdb/3HrW9pTl1cc">Originally named</a> OpenTSDB2, <a href="https://code.google.com/p/kairosdb/">KairosDB</a> was written primarily for Cassandra (but also works with HBase). OpenTSDB emphasizes tools for <a href="https://code.google.com/p/kairosdb/wiki/FAQ">readying data for charts</a> (interpolating to fill in missing values), KairosDB distinguishes between data and the presentation of data.</p>
<p>Startup <a href="https://tempo-db.com/features/">TempoDB</a> offers a <a href="https://tempo-db.com/pricing/">reasonably priced</a>, cloud-based service for storing, retrieving, and visualizing time-series data. Still a work in progress <a href="http://www.scidb.org/Documents/SciDB-Summary.pdf">SciDB</a> is an open source database project, designed specifically for data intensive science problems. The designers of the system plan to make time-series analysis easy to express within SciDB.</p>
<hr />
<p><small><br />
(1) I worked on trading strategies for derivatives, portfolio &amp; risk management, and option pricing.<br />
(2) From my <a href="http://practicalquant.blogspot.com/2012/10/mining-time-series-with-trillions-of.html">earlier post</a>: In a recent paper, the UCR team noted that <i>&#8220;&#8230; after an exhaustive literature search of more than 800 papers, we are not aware of any distance measure that has been shown to outperform DTW by a statistically significant amount on reproducible experiments&#8221;</i>.<br />
(3) <i>Motifs</i> are similar subsequences of a long time series; <em>shapelets</em> are time series primitives that can be used to speed up automatic classification (by reducing the number of &#8220;features&#8221;).<br />
</small></p>
<div style="float: left;border-top: thin gray solid;border-bottom:thin gray solid;padding: 20px;margin: 20px 2px;clear: both"><a href="http://strataconf.com/?intcmp=il-strata-stny13-blog-promo"><img style="float: left;border: none;padding-right: 10px" src="http://cdn.oreilly.com/radar/images/promos/2013-strata-rx-london-ny.gif" /></a><a href="http://strataconf.com/?intcmp=il-strata-stny13-blog-promo"><strong>O&#8217;Reilly Strata Conference</strong></a> &mdash; Strata brings together the leading minds in data science and big data &mdash; decision makers and practitioners driving the future of their businesses and technologies. Get the skills, tools, and strategies you need to make data work.</p>
<p><a href="http://strataconf.com/rx2013?intcmp=il-strata-strx13-strata-blog-banner-148x178">Strata Rx Health Data Conference</a>: September 25-27 |  Boston, MA<br />
<a href="http://strataconf.com/stratany2013?intcmp=il-strata-stny13-blog-promo">Strata + Hadoop World</a>: October 28-30 |  New York, NY<br />Strata in London: November 15-17  |  London, England </div>
]]></content:encoded>
			<wfw:commentRss>http://strata.oreilly.com/2013/04/the-re-emergence-of-time-series.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Data Science tools: Are you &#8220;all in&#8221; or  do you &#8220;mix and match&#8221;?</title>
		<link>http://strata.oreilly.com/2013/03/data-science-tools-all-in-or-mix-and-match.html</link>
		<comments>http://strata.oreilly.com/2013/03/data-science-tools-all-in-or-mix-and-match.html#comments</comments>
		<pubDate>Sun, 31 Mar 2013 16:00:00 +0000</pubDate>
		<dc:creator>Ben Lorica</dc:creator>
				<category><![CDATA[Data]]></category>
		<category><![CDATA[data science]]></category>
		<category><![CDATA[data scientist]]></category>
		<category><![CDATA[machine learning]]></category>
		<category><![CDATA[machine learning products]]></category>
		<category><![CDATA[R]]></category>
		<category><![CDATA[SAS]]></category>
		<category><![CDATA[spark]]></category>

		<guid isPermaLink="false">http://strata.oreilly.com/?p=55885</guid>
		<description><![CDATA[An integrated data stack boosts productivity As I noted in my previous post, Python programmers willing to go &#8220;all in&#8221;, have Python tools to cover most of data science. Lest I be accused of oversimplification, a Python programmer still needs &#8230; ]]></description>
				<content:encoded><![CDATA[<p><strong>An integrated data stack boosts productivity</strong><br />
As I noted in my previous post, Python programmers willing to go &#8220;all in&#8221;, have <a href="http://strata.oreilly.com/2013/03/python-data-tools-just-keep-getting-better.html"><em>Python</em> tools to cover most of data science</a>. Lest I be accused of oversimplification, a Python programmer still needs to commit to learning a non-trivial set of tools<sup>1</sup>.  I suspect that once they invest the time to learn the Python data stack, they tend to stick with it unless they absolutely have to use something else. But being able to stick with the same programming language and environment is a definite productivity boost. It requires less &#8220;setup time&#8221; in order to explore data using different <i>techniques</i> (viz, stats, ML).</p>
<p><strong>Multiple tools and languages can impede reproducibility and flow</strong><br />
On the other end of the spectrum are data scientists who <a href="http://practicalquant.blogspot.com/2013/01/how-my-big-data-toolset-evolved-in-2012.html">mix and match tools</a>, and use packages and frameworks from several languages. Depending on the task, data scientists can avail of tools that are scalable, performant, require less<sup>2</sup> code, and contain a lot of  features. On the other hand this approach requires a lot more context-switching, and extra effort is needed to annotate long workflows. Failure to document things properly makes it tough to reproduce<sup>3</sup> analysis projects, and impedes knowledge transfer<sup>4</sup> within a team of data scientists. Frequent context-switching also makes it more difficult to be in a <i><a href="http://en.wikipedia.org/wiki/Flow_(psychology)">state of flow</a></i>, as one has to think about implementation/package details instead of exploring data. It can be harder to discover interesting stories with your data, if you&#8217;re constantly having to think about what you&#8217;re doing. (It&#8217;s still possible, you just have to concentrate a bit harder.)</p>
<p><span id="more-55885"></span></p>
<p><strong>Some tools that cover a range of data science tasks</strong><br />
More <a href="http://strata.oreilly.com/2013/03/fast-easy-to-use-scalable-data-science-tools.html">tools that integrate different data science tasks</a> are starting to appear. <a href="http://www.sas.com">SAS</a> has long provided tools for data management and wrangling, business intelligence, visualization, statistics, and machine-learning. For massive<sup>5</sup> data sets, a new alternative to SAS is <a href="http://www.revolutionanalytics.com/products/enterprise-big-data.php">ScaleR</a> from Revolution Analytics. Within ScaleR programmers <span style="text-decoration: underline">use R</span> for data wrangling (<a href="http://www.revolutionanalytics.com/why-revolution-r/whitepapers/Data-Step-White-Paper.pdf">rxDataStep</a>), data visualization (basic <a href="http://www.revolutionanalytics.com/what-is-open-source-r/r-language-features/analytics.php#bigdata">viz functions for big data</a>), and statistical analysis (it comes with a variety of <a href="http://www.revolutionanalytics.com/products/enterprise-big-data.php">scalable statistical algorithms</a>).</p>
<p>Startup <a href="http://www.alpinedatalabs.com/product.html">Alpine Data Labs</a> lets users connect to a variety of data sources, manage their data science workflows, and access a limited set of advanced algorithms. Upstart BI vendors <a href="http://www.datameer.com">Datameer</a> and <a href="http://www.platfora.com">Platfora</a> provide data wrangling and visualization tools.  Datameer also provides <a href="http://www.datameer.com/product/data-integration.html">easy data integration</a> to a variety of structured/unstructured data sources,  <a href="http://www.datameer.com/product/data-analytics.html">analytic functions</a> and <a href="http://www.zementis.com/DAS-plugin.htm">PMML to execute predictive analytics</a>. The <a href="http://strata.oreilly.com/2013/02/mlbase-scalable-machine-learning-made-accessible.html">release of MLbase this summer adds machine-learning</a> to the <a href="http://strata.oreilly.com/2012/08/seven-reasons-why-i-like-spark.html">BDAS/Spark</a> stack &#8211; which currently covers data processing, interactive (SQL) and streaming analysis.</p>
<p>What does <i>your</i> data science toolkit look like? Do you mainly use one stack or do you tend to &#8220;mix and match&#8221;?</p>
<hr />
<p><small><br />
(1) This usually includes matplotlib or Bokeh, Scikit-learn, Pandas, SciPy, and NumPy. But as a general purpose language, you can even use it for data <em>acquisition</em> (e.g. web crawlers or web services).<br />
(2) An example would be using R for viz or stats.<br />
(3) This pertains to all data scientists, but is <em>particularly</em> important to those among us who use a wide variety of tools. Unless you document things properly, when you&#8217;re using many different tools the results of <i>very recent</i> analysis projects can be hard to redo.<br />
(4) Regardless of the tools you use, everything starts with knowing something about the <em>lineage and provenance</em> of your data set &#8211; something <a href="http://www.revelytix.com/?q=content/dataset-management-hadoop">Loom</a> attempts to address.<br />
(5) A quick and fun tool for <em>exploring smaller</em> data sets is the just released <a href="http://www.skytree.net/adviser-beta/">SkyTree Adviser</a>. After users perform data processing and wrangling in another tool, SkyTree Adviser exposes machine-learning, statistics, and statistical graphics through an interface that is accessible to business analysts.<br />
</small></p>
<div style="float: left;border-top: thin gray solid;border-bottom:thin gray solid;padding: 20px;margin: 20px 2px;clear: both"><a href="http://strataconf.com/?intcmp=il-strata-stny13-blog-promo"><img style="float: left;border: none;padding-right: 10px" src="http://cdn.oreilly.com/radar/images/promos/2013-strata-rx-london-ny.gif" /></a><a href="http://strataconf.com/?intcmp=il-strata-stny13-blog-promo"><strong>O&#8217;Reilly Strata Conference</strong></a> &mdash; Strata brings together the leading minds in data science and big data &mdash; decision makers and practitioners driving the future of their businesses and technologies. Get the skills, tools, and strategies you need to make data work.</p>
<p><a href="http://strataconf.com/rx2013?intcmp=il-strata-strx13-strata-blog-banner-148x178">Strata Rx Health Data Conference</a>: September 25-27 |  Boston, MA<br />
<a href="http://strataconf.com/stratany2013?intcmp=il-strata-stny13-blog-promo">Strata + Hadoop World</a>: October 28-30 |  New York, NY<br />Strata in London: November 15-17  |  London, England </div>
]]></content:encoded>
			<wfw:commentRss>http://strata.oreilly.com/2013/03/data-science-tools-all-in-or-mix-and-match.html/feed</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>Python data tools just keep getting better</title>
		<link>http://strata.oreilly.com/2013/03/python-data-tools-just-keep-getting-better.html</link>
		<comments>http://strata.oreilly.com/2013/03/python-data-tools-just-keep-getting-better.html#comments</comments>
		<pubDate>Sun, 24 Mar 2013 15:55:18 +0000</pubDate>
		<dc:creator>Ben Lorica</dc:creator>
				<category><![CDATA[Data]]></category>
		<category><![CDATA[data science]]></category>
		<category><![CDATA[data scientist]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[spark]]></category>

		<guid isPermaLink="false">http://strata.oreilly.com/?p=55692</guid>
		<description><![CDATA[Here are a few observations inspired by conversations I had during the just concluded PyData conference1. The Python data community is well-organized: Besides conferences (PyData, SciPy, EuroSciPy), there is a new non-profit (NumFOCUS) dedicated to supporting scientific computing and data &#8230; ]]></description>
				<content:encoded><![CDATA[<p>Here are a few observations inspired by conversations I had during the just concluded <a href="http://pydata.org/sv2013/schedule/">PyData conference</a><sup>1</sup>. </p>
<p>
<b>The Python data community is well-organized:</b><br />
Besides conferences (<a href="http://pydata.org/">PyData</a>, <a href="http://conference.scipy.org/index.html">SciPy, EuroSciPy</a>), there is a new non-profit (<a href="http://numfocus.org/projects-2/projects/">NumFOCUS</a>) dedicated to supporting scientific computing and data analytics projects. The list of supported projects are currently Python-based, but in principle NumFOCUS is an entity that can be used to support related efforts from other communities. </p>
<p>
<b>It&#8217;s getting easier to use the Python data stack:</b><br />
There are tools that facilitate the dissemination and sharing of code and programming environments. IPython<sup>2</sup> notebooks allow Python code and markup in the same document. Notebooks are used to record and share complex workflows and are used heavily for (conference) tutorials. As the data stack grows, one of the major pain points is getting all the packages to work properly together (version compatibility is a common issue). In particular setting up environments were all the pieces work together can be a pain. There are now a few solutions that address this issue: <a href="https://store.continuum.io/cshop/anaconda">Anaconda</a> and cloud-based <a href="https://www.wakari.io/">Wakari</a> from <a href="http://continuum.io/">Continuum Analytics</a>, and cloud computing platform <a href="http://docs.picloud.com/howto/pyscientifictools.html">PiCloud</a>. </p>
<p>
<b>There are many more visualization tools to choose from:</b><br />
The 2D plotting tool <a href="http://matplotlib.org/">matplotlib</a> is the first tool enthusiasts turn to, but as I learned at the conference, there are a number of other options available. <a href="http://continuum.io/">Continuum Analytics</a> recently introduced companion packages <a href="https://github.com/ContinuumIO/Bokeh">Bokeh</a> and <a href="https://github.com/ContinuumIO/bokehjs">Bokeh.js</a> that simplify the creation of static and interactive visualizations using Python. In particular Bokeh is the equivalent of <a href="http://ggplot2.org/">ggplot</a> (it even has an interface that mimics ggplot). With <a href="http://nodebox.net/">Nodebox</a>, programmers use Python code to create sketches and interactive visualizations that are similar to those produced by <a href="http://processing.org/">Processing</a>. <span id="more-55692"></span> </p>
<p>
<b>Large-scale data processing and wrangling tools have improved:</b><br />
<a href="http://pandas.pydata.org/">Pandas</a> and <a href="http://pytables.github.com/">PyTables</a> are already popular, and there was very strong interest in the forthcoming <a href="http://continuum.io/blog/blz-format">Blaze</a> project at the conference. Other options include the <a href="http://discoproject.org/about">Disco Project</a>, a data processing platform that includes an implementation of Map/Reduce, and <a href="http://spark-project.org/docs/latest/api/pyspark/pyspark-module.html">PySpark</a>, the Python API for the <a href="http://strata.oreilly.com/2012/08/seven-reasons-why-i-like-spark.html">Spark</a> data analytics framework. </p>
<p>
<b>There are viable tools for large-scale data analytics:</b><br />
<a href="http://scikit-learn.org/stable/">Scikit-learn</a> (machine-learning library) and <a href="http://scikit-image.org/">scikit-image</a> (image processing) are used by many academic research groups and companies. Both have extensive libraries of algorithms, and come with lots of examples to help users get started<sup>3</sup>. Another tool written in Python focuses on deployment<sup>4</sup>: <a href="https://code.google.com/p/augustus/">Augustus</a> is an open source system for <a href="http://opendatagroup.com/2012/04/09/adready-chooses-open-data-group/">building and scoring, scalable data mining and statistical algorithms</a>. Augustus produces and consumes <a href="http://en.wikipedia.org/wiki/Predictive_Model_Markup_Language">PMML</a>, and includes components for simple <em>data wrangling</em> (<a href="http://opendatagroup.com/2012/05/14/scores-models-and-rules/">users can embed Python code for data processing</a> in their PMML files). </p>
<p>
In addition, new tools like <a href="http://0xdata.github.com/h2o/">H20</a> and <a href="http://about.wise.io/">wise.io</a> plan to make their massively scalable algorithms accessible via Python. Frameworks that expose distributed algorithms to Python programmers include GraphLab (<a href="http://select.cs.cmu.edu/code/graphlab/java_jython.html">Python/Jython interface</a>) and Spark (algorithms<sup>5</sup> in Scala that are accessed via PySpark). Finally, there are also tools that let Python programmers target GPU&#8217;s for parallel programming: <a href="http://www.anandtech.com/show/6839/nvidia-and-continuum-analytics-announce-numbapro-a--cuda-compiler">NumbaPro</a> and <a href="https://developer.nvidia.com/pycuda">PyCUDA</a></p>
<hr />
<p><small> (1) The event drew about 300 attendees and is one of three PyData conferences scheduled this year (Boston in the summer, NYC in the fall).<br />
(2) The new language <a href="https://twitter.com/bigdata/status/312956469551173632">Julia, and IPython</a> are starting to work well together.<br />
(3) In practice these tools let Python programmers efficiently develop prototypes that are later re-implemented (in another language) and optimized before being deployed to production.<br />
(4) Data scientists tend not to focus on the deployment and maintenance of &#8220;models&#8221;. The <a href="http://hazy.cs.wisc.edu/hazy/">Hazy project</a> may change this mindset.<br />
(5) A suite of distributed algorithms will be available upon the release of <a href="http://strata.oreilly.com/2013/02/mlbase-scalable-machine-learning-made-accessible.html">MLbase on Spark</a>. </small></p>
<div style="float: left;border-top: thin gray solid;border-bottom: thin gray solid;padding: 20px;margin: 20px 2px;clear: both"><a href="http://strataconf.com/?intcmp=il-strata-stny13-blog-promo"><img style="float: left;border: none;padding-right: 10px" alt="" src="http://cdn.oreilly.com/radar/images/promos/2013-strata-rx-london-ny.gif" /></a><a href="http://strataconf.com/?intcmp=il-strata-stny13-blog-promo"><strong>O&#8217;Reilly Strata Conference</strong></a> — Strata brings together the leading minds in data science and big data — decision makers and practitioners driving the future of their businesses and technologies. Get the skills, tools, and strategies you need to make data work.</p>
<p><a href="http://strataconf.com/rx2013?intcmp=il-strata-strx13-strata-blog-banner-148x178">Strata Rx Health Data Conference</a>: September 25-27 | Boston, MA<br />
<a href="http://strataconf.com/stratany2013?intcmp=il-strata-stny13-blog-promo">Strata + Hadoop World</a>: October 28-30 | New York, NY<br />
Strata in London: November 15-17 | London, England</div>
]]></content:encoded>
			<wfw:commentRss>http://strata.oreilly.com/2013/03/python-data-tools-just-keep-getting-better.html/feed</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Data Science Tools: Fast, easy to use, and scalable</title>
		<link>http://strata.oreilly.com/2013/03/fast-easy-to-use-scalable-data-science-tools.html</link>
		<comments>http://strata.oreilly.com/2013/03/fast-easy-to-use-scalable-data-science-tools.html#comments</comments>
		<pubDate>Sun, 03 Mar 2013 17:00:25 +0000</pubDate>
		<dc:creator>Ben Lorica</dc:creator>
				<category><![CDATA[Data]]></category>
		<category><![CDATA[Big Data]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[machine learning products]]></category>
		<category><![CDATA[spark]]></category>
		<category><![CDATA[statistical analysis]]></category>

		<guid isPermaLink="false">http://strata.oreilly.com/?p=55443</guid>
		<description><![CDATA[Here are a few observations based on conversations I had during the just concluded Strata Santa Clara conference. Spark is attracting attention I&#8217;ve written numerous times about components of the Berkeley Data Analytics Stack (Spark, Shark, MLbase). Two Spark-related sessions &#8230; ]]></description>
				<content:encoded><![CDATA[<p>Here are a few observations based on conversations I had during the just concluded <a href="http://strataconf.com/strata2013/">Strata Santa Clara conference</a>.</p>
<p><b>Spark is attracting attention</b><br />
I&#8217;ve written numerous times about components of the <span style="text-decoration: underline">B</span>erkeley <span style="text-decoration: underline">D</span>ata <span style="text-decoration: underline">A</span>nalytics <span style="text-decoration: underline">S</span>tack (<a href="http://strata.oreilly.com/2012/08/seven-reasons-why-i-like-spark.html">Spark</a>, <a href="http://strata.oreilly.com/2012/11/shark-real-time-queries-and-analytics-for-big-data.html">Shark</a>, <a href="http://strata.oreilly.com/2013/02/mlbase-scalable-machine-learning-made-accessible.html">MLbase</a>). Two Spark-related sessions at Strata were packed (slides <a href="http://strataconf.com/strata2013/public/schedule/detail/27438">here</a> and <a href="http://strataconf.com/strata2013/public/schedule/detail/26743">here</a>) and I talked to many people who were itching to try <a href="https://amplab.cs.berkeley.edu/bdas/">the BDAS stack</a>. Being able to combine batch, real-time, and interactive analytics in a framework that uses a simple programming model is very attractive. The <a href="http://spark-project.org/spark-release-0-7-0/">release of version 0.7</a> adds a Python API to Spark&#8217;s native Scala interface and Java API.</p>
<p><a href="http://3.bp.blogspot.com/-NhOQ6DV66Vk/UTEAMiMIHAI/AAAAAAAACqw/7fgJFGXNn5w/s1600/bdas1.png"><img alt="" src="http://3.bp.blogspot.com/-NhOQ6DV66Vk/UTEAMiMIHAI/AAAAAAAACqw/7fgJFGXNn5w/s320/bdas1.png" width="200" height="190" border="0" /></a></p>
<p><span id="more-55443"></span></p>
<p><b>SQL is alive and well</b><br />
Impala&#8217;s well-received launch at Strata NYC last fall confirmed the strong interest in <i>interactive analytics</i> (adhoc query-and-response) within the Hadoop ecosystem. The <a href="https://twitter.com/joe_hellerstein/status/306109137685729281">list of solutions</a> for querying big data (stored in HDFS) continues to grow with <a href="http://www.citusdata.com/">CitusDB</a> and <a href="http://www.greenplum.com/blog/topics/hadoop/introducing-pivotal-hd">Pivotal HD</a>, joining <a href="http://blog.cloudera.com/blog/2012/10/cloudera-impala-real-time-queries-in-apache-hadoop-for-real/">Impala</a>, <a href="http://shark.cs.berkeley.edu/">Shark</a>, <a href="http://www.hadapt.com">Hadapt</a>, and cloud-based alternatives <a href="https://cloud.google.com/products/big-query">BigQuery</a>, <a href="http://aws.amazon.com/redshift/">Redshift</a>, and <a href="http://www.qubole.com/">Qubole</a>. I use Shark and have been impressed by its speed and ease of use. (I&#8217;ve heard similar things about Impala&#8217;s speed recently.)</p>
<p><b>Business Intelligence reboot (again)</b><br />
<a href="http://practicalquant.blogspot.com/2013/02/2012-revenue-of-big-data-companies.html">QlikTech and Tableau had combined 2012 revenues of more than $450M</a>. They are easy-to-use analysis tools, that let users visually explore data, and share charts and dashboards. Both use in-memory technologies to speed up query response and visualization rendering times. Both run only on MS-Windows.</p>
<p>Startups that draw inspiration from these two successful companies are targeting much larger data sets &#8211; in the case of <a href="http://www.datameer.com">Datameer</a> and <a href="http://www.platfora.com">Platfora</a>, and <a href="http://www.karmasphere.com/">Karmasphere</a>, massive data sets stored in HDFS. Platfora has been generating buzz with its fast in-memory, columnar data store, custom HTML5 visualization package, and emphasis on tools that let users interact with massive data. Datameer continues to quietly rack up sales &#8211; it closed 2012 with <a href="http://practicalquant.blogspot.com/2013/02/2012-revenue-of-big-data-companies.html">more than $10M in revenues</a>. Strata Startup showcase (audience choice) winner <a href="http://www.sisense.com/">SiSense</a>, offers a hardware optimized business analytics platform that delivers fast processing times by efficiently utilizing disk, RAM, and CPU.</p>
<p><b>Scalable machine-learning and analytics are going to get simpler</b><sup>1</sup><br />
<a href="http://0xdata.github.com/h2o/">H20</a> is a new, <span style="text-decoration: underline">open source</span>, machine-learning platform from <a href="http://0xdata.com/">0xdata</a>. It can use data stored in HDFS or flat files and comes with a few <i>distributed</i> algorithms (random forests, GLM, and a few others). H20 also has tools for rudimentary exploratory data analysis and wrangling. Users can navigate the system using a web browser or a command-line interface. Just like Revolution Analytics&#8217; <a href="http://www.revolutionanalytics.com/products/enterprise-big-data.php">ScaleR</a>, users can interact with H20 <i>using R code</i> (limited to the subset of models and algorithms available). H20 is also available via REST/JSON interfaces.</p>
<p>What I found intriguing<sup>2</sup> was SkyTree&#8217;s acquisition of <a href="https://adviseanalytics.com/">AdviseAnalytics</a> &#8211; a desktop software product designed to make statistical data analysis accessible. (AdviseAnalytics was founded by <a href="http://www.cs.uic.edu/~wilkinson/">Leland Wilkinson</a>, creator of the popular <a href="http://www.systat.com/">Systat</a> software package and author of the <a href="http://www.amazon.com/Grammar-Graphics-Statistics-Computing/dp/0387245448">Grammar of Graphics</a>.) The  system now called <a href="http://www.skytree.net/adviser-beta/">SkyTree Adviser</a>,  provides a GUI that emphasizes <i>tasks</i> (cluster, classify, compare, etc.) over algorithms. In addition it produces results that include short explanations of the underlying statistical methods (power users can opt for concise results similar to those produced by standard statistical packages). Finally SkyTree Adviser users benefit from the vast number of algorithms available &#8211; the system uses ensembles, or finds optimal algorithms. (<a href="http://strata.oreilly.com/2013/02/mlbase-scalable-machine-learning-made-accessible.html">The MLbase optimizer</a> will perform the same type of automatic &#8220;model selection&#8221; for <i>distributed</i> algorithms.)  </p>
<p>SkyTree now offers users an easy-to-use tool for analytic explorations over medium sized data sets (<a href="http://www.skytree.net/adviser-beta/">SkyTree Adviser</a>), and <a href="http://www.skytree.net/products-services/skytree-server/">a server product</a> for building and deploying algorithms against massive amounts of data. Throw in <a href="http://strata.oreilly.com/2013/02/mlbase-scalable-machine-learning-made-accessible.html">MLbase</a> and <a href="http://hazy.cs.wisc.edu/hazy/">Hazy</a>, and I can see the emergence of several large-scale machine-learning tools<sup>3</sup> for non-technical users.</p>
<p><b>Reproducibility of Data Science Workflows</b></p>
<div class="wp-caption alignright" style="width: 410px"><img alt="" src="http://2.bp.blogspot.com/-UgIk3xSJNdE/UOH9w4-f_AI/AAAAAAAACnM/RdgAoZVSAhM/s400/ben-fry-data-science.jpg" width="400" height="72" /><p class="wp-caption-text">Source: &#8220;Computational Information Design&#8221; by Ben Fry</p></div>
<p>Data scientists tend to use many tools and the frequent context-switching is a drag on their productivity. An important side-effect is that it&#8217;s often challenging to document and reproduce analysis projects that involve many steps and tools.</p>
<p>Data scientists who rely on the Python data stack (<a href="http://www.numpy.org/">Numpy</a>, <a href="http://www.scipy.org/">SciPy</a>, <a href="http://pandas.pydata.org/">Pandas</a>, nltk, etc.) should check out <a href="https://www.wakari.io/">Wakari</a> from <a href="http://www.continuum.io">Continuum Analytics</a>. It&#8217;s a cloud-based service that takes care of many details including data management, package and version management, while insulating the user from the intricacies of Amazon Web Services.</p>
<p><a href="http://www.revelytix.com/?q=content/dataset-management-hadoop">Loom</a> is a just-released, data management system that initially targets users of Hadoop (and R). By letting users <a href="http://strataconf.com/strata2013/public/schedule/detail/28546">track lineage and data provenance</a>, Loom makes it easier to recreate multi-step data analysis projects.</p>
<hr />
<p><small><br />
(1) In previous posts I detailed why <a href="http://strata.oreilly.com/2012/12/graphchi-graph-analytics-over-billions-of-edges-using-your-laptop.html">I like GraphChi/GraphLab </a>and why <a href="http://strata.oreilly.com/2013/02/mlbase-scalable-machine-learning-made-accessible.html">I&#8217;m excited about MLbase</a>. Two other open source projects are worth highlighting: <a href="http://mahout.apache.org/">Mahout</a> has many more <a href="https://cwiki.apache.org/confluence/display/MAHOUT/Algorithms">algorithms</a> but <a href="https://github.com/JohnLangford/vowpal_wabbit/wiki">VW</a> generates more enthusiastic endorsements from users I&#8217;ve spoken with. However the sparse documentation and the many command-line options makes it tough to get going in VW. (A forthcoming O&#8217;Reilly book should make VW more accessible.) For users who want to roll their own, I&#8217;ve written a few simple distributed, machine-learning algorithms in Spark, and found it quite fast for batch training and scoring.<br />
(2) <strong>Update (3/18/2013)</strong>: I removed this from the original version of this post, and re-inserted it following the official <a href="http://www.skytree.net/adviser-beta/">launch of SkyTree Adviser</a>.<br />
(3) BI tools like Datameer already come with <a href="http://www.datameer.com/product/data-analytics.html">simple analytic functions</a> available through a GUI.<br />
</small></p>
]]></content:encoded>
			<wfw:commentRss>http://strata.oreilly.com/2013/03/fast-easy-to-use-scalable-data-science-tools.html/feed</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
	</channel>
</rss>
