ENTRIES TAGGED "Big Data"
MIT workshop kicks off Obama campaign on privacy
Thrust into controversy by Edward Snowden’s first revelations last year, President Obama belatedly welcomed a “conversation” about privacy. As cynical as you may feel about US spying, that conversation with the federal government has now begun. In particular, the first of three public workshops took place Monday at MIT.
Given the locale, a focus on the technical aspects of privacy was appropriate for this discussion. Speakers cheered about the value of data (invoking the “big data” buzzword often), delineated the trade-offs between accumulating useful data and preserving privacy, and introduced technologies that could analyze encrypted data without revealing facts about individuals. Two more workshops will be held in other cities, one focusing on ethics and the other on law.
Other industries can show health care the way
This article was written with Ellen M. Martin.
Most healthcare clinicians don’t often think about donating or sharing data. Yet, after hearing Stephen Friend of Sage Bionetworks talk about involving citizens and patients in the field of genetic research at StrataRx 2012, I was curious to learn more.
McKinsey points out the 300 billion dollars in potential savings from using open data in healthcare, while a recent IBM Institute of Business Value study showed the need for corporate data collaboration.
Also, during my own research for Big Data in Healthcare: Hype and Hope, the resounding request from all the participants I interviewed was to “find more data streams to analyze.”
Applications get easier to build as packaged combinations of open source tools become available
As a user who tends to mix-and-match many different tools, not having to deal with configuring and assembling a suite of tools is a big win. So I’m really liking the recent trend towards more integrated and packaged solutions. A recent example is the relaunch of Cloudera’s Enterprise Data hub, to include Spark1 and Spark Streaming. Users benefit by gaining automatic access to analytic engines that come with Spark2. Besides simplifying things for data scientists and data engineers, easy access to analytic engines is critical for streamlining the creation of big data applications.
Another recent example is Dendrite3 – an interesting new graph analysis solution from Lab41. It combines Titan (a distributed graph database), GraphLab (for graph analytics), and a front-end that leverages AngularJS, into a Graph exploration and analysis tool for business analysts:
When the death of trust meets the birth of BYOD
Dr. Andrew Litt, Chief Medical Officer at Dell, made a thoughtful blog post last week about the trade-offs inherent in designing for both the security and accessibility of medical data, especially in an era of BYOD (bring your own device) and the IoT (internet of things). As we begin to see more internet-enabled diagnostic and monitoring devices, Litt writes, “The Internet of Things (no matter what you think of the moniker), is related to BYOD in that it could, depending on how hospitals set up their systems, introduce a vast array of new access points to the network. … a very scary thought when you consider the sensitivity of the data that is being transmitted.”
As he went on to describe possible security solutions (e.g., store all data in central servers rather than on local devices), I was reminded of a post my colleague Simon St.Laurent wrote last fall about “security after the death of trust.” In the wake of some high-profile security breaches, including news of NSA activities, St.Laurent says, we have a handful of options when it comes to data security—and you’re not going to like any of them.
We must go beyond hype for incentives to provide data to researchers
The FDA order stopping 23andM3 from offering its genetic test kit strikes right into the heart of the major issue in health care reform: the tension between individual care and collective benefit. Health is not an individual matter. As I will show, we need each other. And beyond narrow regulatory questions, the 23andMe issue opens up the whole goal of information sharing and the funding of health care reform.
Tutorials for designers, data scientists, data engineers, and managers
As the Program Development Director for Strata Santa Clara 2014, I am pleased to announce that the tutorial session descriptions are now live. We’re pleased to offer several day-long immersions including the popular Data Driven Business Day and Hardcore Data Science tracks. We curated these topics as we wanted to appeal to a broad range of attendees including business users and managers, designers, data analysts/scientists, and data engineers. In the coming months we’ll have a series of guest posts from many of the instructors and communities behind the tutorials.
Analytics for Business Users
We’re offering a series of data intensive tutorials for non-programmers. John Foreman will use spreadsheets to demonstrate how data science techniques work step-by-step – a topic that should appeal to those tasked with advanced business analysis. Grammar of Graphics author, SYSTAT creator, and noted Statistician Leland Wilkinson, will teach an introductory course on analytics using an innovative expert system he helped build.
Data Science essentials
Scalding – a Scala API for Cascading – is one of the most popular open source projects in the Hadoop ecosystem. Vitaly Gordon will lead a hands-on tutorial on how to use Scalding to put together effective data processing workflows. Data analysts have long lamented the amount of time they spend on data wrangling. But what if you had access to tools and best practices that would make data wrangling less tedious? That’s exactly the tutorial that distinguished Professors and Trifacta co-founders, Joe Hellerstein and Jeff Heer, are offering.
The co-founders of Datascope Analytics are offering a glimpse into how they help clients identify the appropriate problem or opportunity to focus on by using design thinking (see the recent Datascope/IDEO post on Design Thinking and Data Science). We’re also happy to reprise the popular (Strata Santa Clara 2013) d3.js tutorial by Scott Murray.
By John Russell
When I came to work on the Cloudera Impala project, I found many things that were familiar from my previous experience with relational databases, UNIX systems, and the open source world. Yet other aspects were all new to me. I know from documenting both enterprise software and open source projects that it’s a special challenge when those two aspects converge. A lot of new users come in with 95% of the information they need, but they don’t know where the missing or outdated 5% is. One mistaken assumption or unfamiliar buzzword can make someone feel like a complete beginner. That’s why I was happy to have the opportunity to write this overview article, with room to explore how users from all kinds of backgrounds can understand and start using the Cloudera Impala product.
For database users, the Apache Hadoop ecosystem can feel like a new world:
- Sysadmins don’t bat an eye when you say you want to work on terabytes or petabytes of data.
- A networked cluster of machines isn’t a complicated or scary proposition. Instead, it’s the standard environment you ask an intern to set up on their first day as a training exercise.
- All the related open source projects aren’t an either-or proposition. You work with a dozen components that all interoperate, stringing them together like a UNIX toolchain.
The trend is clear: The CIO’s IT budget is getting smaller and the CMO’s IT budget is getting larger. As a result, the CIO’s role is diminishing and the CMO’s role is expanding. From a business perspective, the shift feels inevitable. Despite talk about transforming corporate IT organizations from cost centers into profit centers, the role of the CIO has remained largely administrative.
In their hearts, most CIOs know the score. They’ve won their battle to earn “a seat at the table,” but the table has gotten smaller. The main challenges ahead of them are technical, not strategic. Their key areas of focus today are mobility, cloud, and security. They are aware of big data, but it’s just not a survival issue for them.
At the most basic level, stream mining is about generating summaries that can be used to answer fundamental questions
A series of open source, distributed stream processing frameworks have become essential components in many big data technology stacks. Apache Storm remains the most popular, but promising new tools like Spark Streaming and Apache Samza are going to have their share of users. These tools excel at data processing and are also used for data mining – in many cases users have to write a bit of code1 to do stream mining. The good news is that easy-to-use stream mining libraries will likely emerge in the near future.
High volume data streams (data that arrive continuously) arise in many settings, including IT operations, sensors, and social media. What can one learn by looking at data one piece (or a few pieces) at a time? Can techniques that look at smaller representations of data streams be used to unlock their value? In this post, I’ll briefly summarize a recent overview given by stream mining pioneer Graham Cormode.
Massive amounts of data arriving at high velocity pose a challenge to data miners. At the most basic level, stream mining is about generating summaries that can be used to answer fundamental questions:
Properly constructed summaries are useful for highlighting emerging patterns, trends, and anomalies. Common summaries (frequency moments in stream mining parlance) include a list of distinct items, recently trending items, heavy hitters (items that have appeared frequently), and the top k (most popular) items.
"Modelers have a bigger responsibility now than ever before."
People come to data science in all sorts of ways. I happen to be someone who came via finance. Trained as a mathematician, I worked first at a hedge fund and then a financial risk software company, each for about two years, starting in June 2007 and ending in February 2011. If you look at those dates again, you’ll realize I had a front row seat for the financial crisis.
I worked on a few projects in algorithmic trading with Larry Summers at the hedge fund and was invited, along with the other quants at Shaw, to see him discuss the impending doom one evening with Alan Greenspan and Robert Rubin. It honestly kind of surprised and shocked me to see how little they seemed to know, or at least admitted to knowing, about the true situation in the markets. These guys were supposed to be the experts, after all.