ENTRIES TAGGED "data analysis"

MATLAB, R, and Julia: Languages for data analysis

Inside core features of specialized data analysis languages.

Big data frameworks like Hadoop have received a lot of attention recently, and with good reason: when you have terabytes of data to work with — and these days, who doesn’t? — it’s amazing to have affordable, reliable and ubiquitous tools that allow you to spread a computation over tens or hundreds of CPUs on commodity hardware. The dirty truth is, though, that many analysts and scientists spend as much time or more working with mere megabytes or gigabytes of data: a small sample pulled from a larger set, or the aggregated results of a Hadoop job, or just a dataset that isn’t all that big (like, say, all of Wikipedia, which can be squeezed into a few gigs without too much trouble).

At this scale, you don’t need a fancy distributed framework. You can just load the data into memory and explore it interactively in your favorite scripting language. Or, maybe, a different scripting language: data analysis is one of the few domains where special-purpose languages are very commonly used. Although in many respects these are similar to other dynamic languages like Ruby or Javascript, these languages have syntax and built-in data structures that make common data analysis tasks both faster and more concise. This article will briefly cover some of these core features for two languages that have been popular for decades — MATLAB and R — and another, Julia, that was just announced this year.

MATLAB

MATLAB is one of the oldest programming languages designed specifically for data analysis, and it is still extremely popular today. MATLAB was conceived in the late ’70s as a simple scripting language wrapped around the FORTRAN libraries LINPACK and EISPACK, which at the time were the best way to efficiently work with large matrices of data — as they arguably still are, through their successor LAPACK. These libraries, and thus MATLAB, were solely concerned with one data type: the matrix, a two-dimensional array of numbers.

This may seem very limiting, but in fact, a very wide range of scientific and data-analysis problems can be represented as matrix problems, and often very efficiently. Image processing, for example, is an obvious fit for the 2D data structure; less obvious, perhaps, is that a directed graph (like Twitter’s follow graph, or the graph of all links on the web) can be expressed as an adjacency matrix, and that graph algorithms like Google’s PageRank can be easily implemented as a series of additions and multiplications of these matrices. Similarly, the winning entry to the Netflix Prize recommendation challenge relied, in part, on a matrix representation of everyone’s movie ratings (you can imagine every row representing a Netflix user, every column a movie, and every entry in the matrix a rating), and in particular on an operation called Singular Value Decomposition, one of those original LINPACK matrix routines that MATLAB was designed to make easy to use.

Read more…

Comments: 15 |

Statwing simplifies data analysis

Quickly perform and interpret the results of routine Small Data analysis

With so much focus on Big Data, the needs of many analysts who work with Small Data tend to get ignored. The default tool for many of these users remains spreadsheets1 and/or statistical packages which come with a lot of features and options. However many analysts need a very small subset of what these tools have to offer.

Enter Statwing, a software-as-a-service provider for routine statistical analysis. While the tool is still in the early stages, it can already do many basic “data analysis” tasks.

Consider the following example of a pivot table constructed in Excel: this required 8 mouse-clicks, if you do everything perfectly, and about 5 decisions (what variables to include, what metric to use, …)

The same task in Statwing required 4 mouse-clicks and 0 decisions! Plus it comes with visuals:

The lack of clutter and the addition of a simple “headline” (“Female tends to have much higher values for satisfaction than Male“), makes the result much easier to interpret. The advanced tab contains detailed statistical analysis (in this case the p-value, counts, values). Many users get confused by the output/results produced by traditional statistical software. Let’s face it, many analysts have had little training in statistics. I welcome a tool that produces readily interpretable results.

The company hopes to replicate the above example across a wide variety of routine data analysis tasks. Their initial focus is on tools for (consumer) survey analysis, a potentially huge market given that online companies have made surveys so much easier to conduct. Users of Statwing pay a small monthly subscription, making it cheaper than most2 statistical packages. For a small monthly fee, their intuitive UI lets analysts get their tasks done quickly. More importantly Statwing may nurture aspiring data scientists in your organization.


(1) As this recent Strata presentation points out: Spreadsheets are the glue that keeps many organizations together.

(2) Open source tools like OpenOffice, R and Octave are free. So is the use of Google spreadsheets.

Comment: 1 |
Data as seeds of content

Data as seeds of content

A look at lesser-known ways to extract insight from data.

Visualizations are one way to make sense of data, but they aren't the only way. Robbie Allen reveals six additional outputs that help users derive meaningful insights from data.

Comment |
Automated science, deep data and the paradox of information

Automated science, deep data and the paradox of information

Be aware of the just-so data stories that sound reasonable but cannot be conclusively proven.

Bradley Voytek: "Our goal as data scientists should be to distill the essence of the data into something that tells as true a story as possible while being as simple as possible to understand."

Comments: 11 |

Profile of the Data Journalist: The Homicide Watch

Chris Amico and Laura Norton Amico's project started as a spreadsheet. Now it's a community news platform.

To learn more about the people who are redefining the practice computer-assisted reporting, in some cases, building the newsroom stack for the 21st century, Radar conducted a series of email interviews with data journalists during the 2012 NICAR Conference. "It’s not just about the data, and it’s not just about the journalism, but it’s about meeting a community need in an innovative way," said Laura Norton Amico.

Comment |

Unstructured data is worth the effort when you’ve got the right tools

Alyona Medelyan and Anna Divoli on the opportunities in chaotic data.

Alyona Medelyan and Anna Divoli are inventing tools to help companies contend with vast quantities of fuzzy data. They discuss their work and what lies ahead for big data in this interview.

Comment |
Strata Week: Why ThinkUp matters

Strata Week: Why ThinkUp matters

ThinkUp and data ownership, DataSift turns on its Twitter firehose, and Google cracks opens the door to BigQuery.

Data democratization gets an important new tool with the release of ThinkUp 1.0. Also, DataSift offers another way to get the Twitter firehose, and Google offers a little more access to its BigQuery data analytics service.

Comment |
Social network analysis isn’t just for social networks

Social network analysis isn’t just for social networks

Social network analysis (SNA) finds meaningful patterns in relationship data.

The scientific methodology of social network analysis (SNA) helps explain not just how people connect, but why they come together as well. Here, "Social Network Analysis for Startups" co-author Maksim Tsvetovat offers a primer on SNA.

Comment: 1 |
Global Adaptation Index enables better data-driven decisions

Global Adaptation Index enables better data-driven decisions

The Global Adaptation Index combines development indicators from 161 countries.

Speed, accessibility and open data have come together in the Global Adaptation Index, a new data browser that rates a given country's vulnerability to environmental shifts.

Comment |
Look at Cook sets a high bar for open government data visualizations

Look at Cook sets a high bar for open government data visualizations

Open source tools and a focus on user experience elevate Cook County's "Look at Cook" data website.

One of the best recent efforts at visualizing open government data can be found at LookatCook.com, which tracks government budgets and expenditures from 1993-2011 in Cook County, Illinois.

Comment: 1 |