ENTRIES TAGGED "data"
Insights from a business executive and law professor
If you develop software or manage databases, you’re probably at the point now where the phrase “Big Data” makes you roll your eyes. Yes, it’s hyped quite a lot these days. But, overexposed or not, the Big Data revolution raises a bunch of ethical issues related to privacy, confidentiality, transparency and identity. Who owns all that data that you’re analyzing? Are there limits to what kinds of inferences you can make, or what decisions can be made about people based on those inferences? Perhaps you’ve wondered about this yourself.
We’re obsessed by these questions. We’re a business executive and a law professor who’ve written about this question a lot, but our audience is usually lawyers. But because engineers are the ones who confront these questions on a daily basis, we think it’s essential to talk about these issues in the context of software development.
While there’s nothing particularly new about the analytics conducted in big data, the scale and ease with which it can all be done today changes the ethical framework of data analysis. Developers today can tap into remarkably varied and far-flung data sources. Just a few years ago, this kind of access would have been hard to imagine. The problem is that our ability to reveal patterns and new knowledge from previously unexamined troves of data is moving faster than our current legal and ethical guidelines can manage. We can now do things that were impossible a few years ago, and we’ve driven off the existing ethical and legal maps. If we fail to preserve the values we care about in our new digital society, then our big data capabilities risk abandoning these values for the sake of innovation and expediency.
Collecting actionable data is a challenge for today's data tools
One of the problems dragging down the US health care system is that nobody trusts one another. Most of us, as individuals, place faith in our personal health care providers, which may or may not be warranted. But on a larger scale we’re all suspicious of each other:
- Doctors don’t trust patients, who aren’t forthcoming with all the bad habits they indulge in and often fail to follow the most basic instructions, such as to take their medications.
- The payers–which include insurers, many government agencies, and increasingly the whole patient population as our deductibles and other out-of-pocket expenses ascend–don’t trust the doctors, who waste an estimated 20% or more of all health expenditures, including some thirty or more billion dollars of fraud each year.
- The public distrusts the pharmaceutical companies (although we still follow their advice on advertisements and ask our doctors for the latest pill) and is starting to distrust clinical researchers as we hear about conflicts of interest and difficulties replicating results.
- Nobody trusts the federal government, which pursues two (contradictory) goals of lowering health care costs and stimulating employment.
Yet everyone has beneficent goals and good ideas for improving health care. Doctors want to feel effective, patients want to stay well (even if that desire doesn’t always translate into action), the Department of Health and Human Services champions very lofty goals for data exchange and quality improvement, clinical researchers put their work above family and comfort, and even private insurance companies are trying moving to “fee for value” programs that ensure coordinated patient care.
In order to make an effective decision, I need to understand key issues about the design, performance, and cost of cars, regardless of whether or not I actually know how to build one myself. The same is true for people deciding if machine learning is a good choice for their business goals or project. Will the payoff be worth the effort? What machine learning approach is most likely to produce valuable results for your particular situation? What size team with what expertise is necessary to be able to develop, deploy, and maintain your machine learning system?
Given the complex and previously esoteric nature of machine learning as a field – the sometimes daunting array of learning algorithms and the math needed to understand and employ them – many people feel the topic is one best left only to the few.
Other industries can show health care the way
This article was written with Ellen M. Martin.
Most healthcare clinicians don’t often think about donating or sharing data. Yet, after hearing Stephen Friend of Sage Bionetworks talk about involving citizens and patients in the field of genetic research at StrataRx 2012, I was curious to learn more.
McKinsey points out the 300 billion dollars in potential savings from using open data in healthcare, while a recent IBM Institute of Business Value study showed the need for corporate data collaboration.
Also, during my own research for Big Data in Healthcare: Hype and Hope, the resounding request from all the participants I interviewed was to “find more data streams to analyze.”
The 30,000-foot view and the nitty gritty details of working with electronic health data
Ever wonder what the heck “meaningful use” really means? By now, you’ve probably heard it come up in discussions of healthcare data. You might even know that it specifically pertains to electronic health records (EHRs). But what is it really about, and why should you care?
If you’ve ever had to carry a large folder of paper between specialists, or fill out the same medical history form in different offices over and over—with whatever details you happen to remember off the top of your head that day—then you already have some idea of why EHRs are a desirable thing. The idea is that EHRs will lead to better care—and better research data—through more complete and accurate record-keeping, and will eventually become part of health information exchanges (HIEs) with features like trend analysis and push-notifications. However, the mere installation of EHR software isn’t enough; we need not just cursory use but meaningful use of EHRs, and we need to ensure that the software being used meets certain standards of efficiency and security.
Today, it’s shocking (and honestly exciting) how much of my daily experience is determined by a recommender system. These systems drive amazing experiences everywhere, telling me where to eat, what to listen to, what to watch, what to read, and even who I should be friends with. Furthermore, information overload is making recommender systems indispensable, since I can’t find what I want on the web simply using keyword search tools. Recommenders are behind the success of industry leaders like Netflix, Google, Pandora, eHarmony, Facebook, and Amazon. It’s no surprise companies want to integrate recommender systems with their own online experiences. However, as I talk to team after team of smart industry engineers, it has become clear that building and managing these systems is usually a bit out of reach, especially given all the other demands on the team’s time.
In the summer of 2012, Accel Partners hosted an invitation-only Big Data conference at Stanford. Ping Li stood near the exit with a checkbook, ready to invest $1MM in pitches for real-time analytics on clusters. However, real-time means many different things. For MetaScale working on the Sears turnaround, real-time means shrinking a 6 hour window on a mainframe to 6 minutes on Hadoop. For a hedge fund, real-time means compiling Python to run on GPUs where milliseconds matter, or running on FPGA hardware for microsecond response.
With much emphasis on Hadoop circa 2012, one might think that no other clusters existed. Nothing could be further from the truth: Memcached, Ruby on Rails, Cassandra, Anaconda, Redis, Node.js, etc. – all in large-scale production use for mission critical apps, much closer to revenue than the batch jobs. Google emphasizes a related point in their Omega paper: scheduling batch jobs is not difficult, while scheduling services on a cluster is a hard problem, and that translates to lots of money.
Tools, Trends, What Pays (and What Doesn't) for Data Professionals
There is no shortage of news about the importance of data or the career opportunities within data. Yet a discussion of modern data tools can help us understand what the current data evolution is all about, and it can also be used as a guide for those considering stepping into the data space or progressing within it.
In our report, 2013 Data Science Salary Survey, we make our own data-driven contribution to the conversation. We collected a survey from attendees of the Strata Conference in New York and Santa Clara, California, about tool usage and salary.
Strata attendees span a wide spectrum within the data world: Hadoop experts and business leaders, software developers and analysts. By no means does everyone use data on a “Big” scale, but almost all attendees have some technical aspect to their role. Strata attendees may not represent a random sample of all professionals working with data, but they do represent a broad slice of the population. If there is a bias, it is likely toward the forefront of the data space, with attendees using the newest tools (or being very interested in learning about them).
Skills of the Agile Data Wrangler
As data processing has become more sophisticated, there has been little progress on improving the most time-consuming and tedious parts of the pipeline: Data Transformation tasks including discovery, structuring, and content cleaning . In standard practice, this kind of “data wrangling” requires writing idiosyncratic scripts in programming languages such as Python or R, or extensive manual editing using interactive tools such as Microsoft Excel. The result has two significantly negative outcomes. First, people with highly specialized skills (e.g., statistics, molecular biology, micro-economics) spend far more time in tedious data wrangling tasks than they do in exercising their specialty. Second, less technical users are often unable to wrangle their own data. The result in both cases is that significant data is often left unused due to the hurdle of transforming it into shape. Sadly, when it comes to standard practice in modern data analysis, “the tedium is the message.” In our upcoming tutorial at Strata, we will survey both sources and solutions to the problems of Data Transformation.
Analysts must regularly transform data to make it palatable to databases, statistics packages, and visualization tools. Data sets also regularly contain missing, extreme, duplicate or erroneous values that can undermine the results of analysis. These anomalies come from various sources, including human data entry error, inconsistencies between integrated data sets, and sensor interference. Our own interviews with data analysts have found that these types of transforms constitute the most tedious component of their analytic process. Flawed analyses due to dirty data are estimated to cost billions of dollars each year. Discovering and correcting data quality issues can also be costly: some estimate cleaning dirty data to account for 80 percent of the cost of data warehousing projects.
According to the Committee to Protect Journalists, 2013 was the second worst year on record for imprisoning journalists around the world for doing their work.
Which makes this story from PBS Idea Lab all the more important: How Journalists Can Stay Secure Reporting from Android Devices. There are tips here on how to anonymize data flowing through your phone using Tor, an open network that helps protect against traffic analysis and network surveillance. Also, there is information about video publishing software that facilitates YouTube posting, even if the site is blocked in your country. Very cool.
The Neiman Lab is publishing an ongoing series of Predictions for Journalism in 2014, and, predictably, the idea of harnessing data looms large. Hassan Hodges, director of innovation for the MLive Media Group, says that in this new journalism landscape, content will start to look more like data and data will look more like content. Poderopedia founder Miguel Paz says that news organizations should fire the consultants and hire more nerds. There are 51 contributions so far, and counting. It’s good reading.