ENTRIES TAGGED "data"
According to the Committee to Protect Journalists, 2013 was the second worst year on record for imprisoning journalists around the world for doing their work.
Which makes this story from PBS Idea Lab all the more important: How Journalists Can Stay Secure Reporting from Android Devices. There are tips here on how to anonymize data flowing through your phone using Tor, an open network that helps protect against traffic analysis and network surveillance. Also, there is information about video publishing software that facilitates YouTube posting, even if the site is blocked in your country. Very cool.
The Neiman Lab is publishing an ongoing series of Predictions for Journalism in 2014, and, predictably, the idea of harnessing data looms large. Hassan Hodges, director of innovation for the MLive Media Group, says that in this new journalism landscape, content will start to look more like data and data will look more like content. Poderopedia founder Miguel Paz says that news organizations should fire the consultants and hire more nerds. There are 51 contributions so far, and counting. It’s good reading.
Human judgment is at the center of successful data analysis. This statement might initially seem at odds with the current Big Data frenzy and its focus on data management and machine learning methods. But while these tools provide immense value, it is important to remember that they are just that: tools. A hammer does not a carpenter make — though it certainly helps.
Consider the words of John Tukey 1, possibly the greatest statistician of the last half-century: “Nothing — not the careful logic of mathematics, not statistical models and theories, not the awesome arithmetic power of modern computers — nothing can substitute here for the flexibility of the informed human mind. Accordingly, both approaches and techniques need to be structured so as to facilitate human involvement and intervention.” Tukey goes on to write: “Some implications for effective data analysis are: (1) that it is essential to have convenience of interaction of people and intermediate results and (2) that at all stages of data analysis the nature and detail of output need to be matched to the capabilities of the people who use it and want it.” Though Tukey and colleagues voiced these sentiments nearly 50 years ago, they ring even more true today. The interested analyst is at the heart of the Big Data question: how well do our tools help users ask better questions, formulate hypotheses, spot anomalies, correct errors and create improved models and visualizations? To “facilitate human involvement” across “all stages of data analysis” is a grand challenge for our age.
Making Machine Learning Accessible & Usable
Big Data may seem like a familiar concept to those working in IT, but for most executives it’s difficult to imagine just how much Big Data impacts business on a daily basis. Most companies already collect customer data, ranging from purchase habits to social media interactions, but few translate their data into actionable business insights. By applying advanced analytics to Big Data, companies can identify patterns and make predictions from huge amounts of information that a single human analyst could never see, let alone understand.
Machine Learning – the core technology behind this type of Big Data analytics – involves a collection of algorithms that are designed to uncover patterns that classical statistical algorithms often fail to detect. Procedures like k-means clustering, support vector machines, Bayes nets, and decision trees are flexible and adapt themselves to nonlinear and high-dimensional data structures. This flexibility comes with a price, however. Expert users must decide in advance on a host of parameter settings – kernel types, cluster numbers, prior probabilities, and so on. The complexity of these decisions necessarily eludes the average analyst. Furthermore, Machine Learning algorithms rest on certain assumptions that are similar to those required for classical statistical analysis. Outliers, missing values, and unusual distributions can invalidate the conclusions drawn from Machine Learning applications.
Lessons from the design community for developing data-driven applications
When you hear someone say, “that is a nice infographic” or “check out this sweet dashboard,” many people infer that they are “well-designed.” Creating accessible (or for the cynical, “pretty”) content is only part of what makes good design powerful. The design process is geared toward solving specific problems. This process has been formalized in many ways (e.g., IDEO’s Human Centered Design, Marc Hassenzahl’s User Experience Design, or Braden Kowitz’s Story-Centered Design), but the basic idea is that you have to explore the breadth of the possible before you can isolate truly innovative ideas. We, at Datascope Analytics, argue that the same is true of designing effective data science tools, dashboards, engines, etc — in order to design effective dashboards, you must know what is possible.
We must go beyond hype for incentives to provide data to researchers
The FDA order stopping 23andM3 from offering its genetic test kit strikes right into the heart of the major issue in health care reform: the tension between individual care and collective benefit. Health is not an individual matter. As I will show, we need each other. And beyond narrow regulatory questions, the 23andMe issue opens up the whole goal of information sharing and the funding of health care reform.
By Dean Malmgren and Jon Wettersten
There’s a lot of hype around “Big Data” these days. Don’t believe us? None other than the venerable Harvard Business Review named “data scientist” the “Sexiest Job of the 21st Century” only 13 years into it. Seriously. Some of these accolades are deserved. It’s decidedly cheaper to store data now than it is to analyze it, which is considerably different than 10 or 20 years ago. Other aspects, however, are less deserved. In isolation, big data and data scientists don’t hold some magic formula that’s going to save the world, radically transform businesses, or eliminate poverty. The act of solving problems is decidedly different than amassing a data set the size of 200 trillion Moby Dicks or setting a team of nerds loose on the data. Problem solving not only requires a high-level conceptual understanding of the challenge, but also a deep understanding of the nuances of a challenge, how those nuances affect businesses, governments, and societies, and—don’t forget—the creativity to address these challenges.
In our experience, solving problems with data necessitates a diversity of thought and an approach that balances number crunching with thoughtful design to solve targeted problems. Ironically, we don’t believe this means that it’s important to have an army of PhDs with deep knowledge on every topic under the sun. Rather, we find it’s important to have multi-disciplinary teams of curious, thoughtful, and motivated learners with a broad range of interests who aren’t afraid to immerse themselves in a totally ambiguous topic. With this common vision, IDEO and Datascope Analytics decided to embark on an experiment and integrate our teams to collaborate on a few big data projects over the last year. We thought we’d share a few things here we’ve learned along the way.
Myths and Realities
Since its first public release in February 2012, the Julia programming language has received a lot of hype. This has led to some confusion about the language’s current status. In this post, I’d like to make clear where Julia stands and where Julia is going, especially in regard to Julia’s role in data science, where the dominant languages are R and Python. We’re working hard to make Julia a viable alternative to those languages, but it’s important to separate out myth from reality.
Where Julia Stands
In order to the dispel some of the confusion about Julia, I want to discuss the two main types of misunderstandings that I come across:
- Confusion 1: Julia already possesses a mature package ecosystem and can be used as a feature-complete replacement for R or Python.
- Confusion 2: Julia’s compiler is so good that it will make any piece of code fast – even bad code.
By John Russell
When I came to work on the Cloudera Impala project, I found many things that were familiar from my previous experience with relational databases, UNIX systems, and the open source world. Yet other aspects were all new to me. I know from documenting both enterprise software and open source projects that it’s a special challenge when those two aspects converge. A lot of new users come in with 95% of the information they need, but they don’t know where the missing or outdated 5% is. One mistaken assumption or unfamiliar buzzword can make someone feel like a complete beginner. That’s why I was happy to have the opportunity to write this overview article, with room to explore how users from all kinds of backgrounds can understand and start using the Cloudera Impala product.
For database users, the Apache Hadoop ecosystem can feel like a new world:
- Sysadmins don’t bat an eye when you say you want to work on terabytes or petabytes of data.
- A networked cluster of machines isn’t a complicated or scary proposition. Instead, it’s the standard environment you ask an intern to set up on their first day as a training exercise.
- All the related open source projects aren’t an either-or proposition. You work with a dozen components that all interoperate, stringing them together like a UNIX toolchain.
behind the scenes with datascope analytics
During a trip to Chicago for a conference on R, I had a chance to cowork at the Datascope Analytics (DsA) office. While I had worked with co-founders Mike and Dean before, this was my first time coworking at their office. It was an eye-opening experience. Why? The culture. I saw how this team of data scientists with different backgrounds connected with each other as they worked, collaborated, and joked around. I also observed how intensely present everyone was…whether they were joking or working. I completely understand how much work and commitment it takes to facilitate such a creative and collaborative environment.
Over the next few months, this initial coworking experience led to many conversations with Dean and Mike about building data science teams, Strata, design, and data both in Chicago and the SF Bay Area. I also got to know a few of the other team members such as Aaron, Bo, Gabe, and Irmak. Admittedly, the more I got to know the team, the more intensely curious I became about the human-centered design “ideation” workshops that they hold for clients. According to Aaron, the workshops “combine elements from human-centered design to diverge and converge on valuable and viable ideas, solutions, strategies for our clients. We start by creating an environment that spurs creativity and encourages wild ideas. After developing many different ideas, we cull them down and focus on the ones that are viable to add life and meaning.”
Esri conference highlights uses of GIS data
We’ve all seen cool maps of health data, such as these representations of diabetes prevalence by US county. But few people think about how thoroughly geospacial data is transforming public health and changing the allocation of resources at individual hospitals. I got a peek into this world at the Esri Health GIS Conference this week in Cambridge, Mass.