ENTRIES TAGGED "Big Data"
Yawn. Yet another article trashing “big data,” this time an op-ed in the Times. This one is better than most, and ends with the truism that data isn’t a silver bullet. It certainly isn’t.
I’ll spare you all the links (most of which are much less insightful than the Times piece), but the backlash against “big data” is clearly in full swing. I wrote about this more than a year ago, in my piece on data skepticism: data is heading into the trough of a hype curve, driven by overly aggressive marketing, promises that can’t be kept, and spurious claims that, if you have enough data, correlation is as good as causation. It isn’t; it never was; it never will be. The paradox of data is that the more data you have, the more spurious correlations will show up. Good data scientists understand that. Poor ones don’t.
It’s very easy to say that “big data is dead” while you’re using Google Maps to navigate downtown Boston. It’s easy to say that “big data is dead” while Google Now or Siri is telling you that you need to leave 20 minutes early for an appointment because of traffic. And it’s easy to say that “big data is dead” while you’re using Google, or Bing, or DuckDuckGo to find material to help you write an article claiming that big data is dead.
HBase has made inroads in companies across many industries and countries
With HBaseCon right around the corner, I wanted to take stock of one of the more popular1 components in the Hadoop ecosystem. Over the last few years, many more companies have come to rely on HBase to run key products and services. The conference will showcase a wide variety of such examples, and highlight some of the new features that HBase developers have added over the past year. In the meantime here are some things2 you may not have known about HBase:
Many companies have had HBase in production for 3+ years: Large technology companies including Trend Micro, EBay, Yahoo! and Facebook, and analytics companies RocketFuel and Flurry depend on HBase for many mission-critical services.
There are many use cases beyond advertising: Examples include communications (Facebook messages, Xiaomi), security (Trend Micro), measurement (Nielsen), enterprise collaboration (Jive Software), digital media (OCLC), DNA matching (Ancestry.com), and machine data analysis (Box.com). In particular Nielsen uses HBase to track media consumption patterns and trends, mobile handset company Xiaomi uses Hbase for messaging and other consumer mobile services, and OCLC runs the world’s largest online database of library resources on HBase.
Flurry has the largest contiguous HBase cluster: Mobile analytics company Flurry has an HBase cluster with 1,200 nodes (replicating into another 1,200 node cluster). Flurry is planning to significantly expand their large HBase cluster in the near future.
MIT workshop kicks off Obama campaign on privacy
Thrust into controversy by Edward Snowden’s first revelations last year, President Obama belatedly welcomed a “conversation” about privacy. As cynical as you may feel about US spying, that conversation with the federal government has now begun. In particular, the first of three public workshops took place Monday at MIT.
Given the locale, a focus on the technical aspects of privacy was appropriate for this discussion. Speakers cheered about the value of data (invoking the “big data” buzzword often), delineated the trade-offs between accumulating useful data and preserving privacy, and introduced technologies that could analyze encrypted data without revealing facts about individuals. Two more workshops will be held in other cities, one focusing on ethics and the other on law.
Other industries can show health care the way
This article was written with Ellen M. Martin.
Most healthcare clinicians don’t often think about donating or sharing data. Yet, after hearing Stephen Friend of Sage Bionetworks talk about involving citizens and patients in the field of genetic research at StrataRx 2012, I was curious to learn more.
McKinsey points out the 300 billion dollars in potential savings from using open data in healthcare, while a recent IBM Institute of Business Value study showed the need for corporate data collaboration.
Also, during my own research for Big Data in Healthcare: Hype and Hope, the resounding request from all the participants I interviewed was to “find more data streams to analyze.”
Applications get easier to build as packaged combinations of open source tools become available
As a user who tends to mix-and-match many different tools, not having to deal with configuring and assembling a suite of tools is a big win. So I’m really liking the recent trend towards more integrated and packaged solutions. A recent example is the relaunch of Cloudera’s Enterprise Data hub, to include Spark1 and Spark Streaming. Users benefit by gaining automatic access to analytic engines that come with Spark2. Besides simplifying things for data scientists and data engineers, easy access to analytic engines is critical for streamlining the creation of big data applications.
Another recent example is Dendrite3 – an interesting new graph analysis solution from Lab41. It combines Titan (a distributed graph database), GraphLab (for graph analytics), and a front-end that leverages AngularJS, into a Graph exploration and analysis tool for business analysts:
When the death of trust meets the birth of BYOD
Dr. Andrew Litt, Chief Medical Officer at Dell, made a thoughtful blog post last week about the trade-offs inherent in designing for both the security and accessibility of medical data, especially in an era of BYOD (bring your own device) and the IoT (internet of things). As we begin to see more internet-enabled diagnostic and monitoring devices, Litt writes, “The Internet of Things (no matter what you think of the moniker), is related to BYOD in that it could, depending on how hospitals set up their systems, introduce a vast array of new access points to the network. … a very scary thought when you consider the sensitivity of the data that is being transmitted.”
As he went on to describe possible security solutions (e.g., store all data in central servers rather than on local devices), I was reminded of a post my colleague Simon St.Laurent wrote last fall about “security after the death of trust.” In the wake of some high-profile security breaches, including news of NSA activities, St.Laurent says, we have a handful of options when it comes to data security—and you’re not going to like any of them.
We must go beyond hype for incentives to provide data to researchers
The FDA order stopping 23andM3 from offering its genetic test kit strikes right into the heart of the major issue in health care reform: the tension between individual care and collective benefit. Health is not an individual matter. As I will show, we need each other. And beyond narrow regulatory questions, the 23andMe issue opens up the whole goal of information sharing and the funding of health care reform.
Tutorials for designers, data scientists, data engineers, and managers
As the Program Development Director for Strata Santa Clara 2014, I am pleased to announce that the tutorial session descriptions are now live. We’re pleased to offer several day-long immersions including the popular Data Driven Business Day and Hardcore Data Science tracks. We curated these topics as we wanted to appeal to a broad range of attendees including business users and managers, designers, data analysts/scientists, and data engineers. In the coming months we’ll have a series of guest posts from many of the instructors and communities behind the tutorials.
Analytics for Business Users
We’re offering a series of data intensive tutorials for non-programmers. John Foreman will use spreadsheets to demonstrate how data science techniques work step-by-step – a topic that should appeal to those tasked with advanced business analysis. Grammar of Graphics author, SYSTAT creator, and noted Statistician Leland Wilkinson, will teach an introductory course on analytics using an innovative expert system he helped build.
Data Science essentials
Scalding – a Scala API for Cascading – is one of the most popular open source projects in the Hadoop ecosystem. Vitaly Gordon will lead a hands-on tutorial on how to use Scalding to put together effective data processing workflows. Data analysts have long lamented the amount of time they spend on data wrangling. But what if you had access to tools and best practices that would make data wrangling less tedious? That’s exactly the tutorial that distinguished Professors and Trifacta co-founders, Joe Hellerstein and Jeff Heer, are offering.
The co-founders of Datascope Analytics are offering a glimpse into how they help clients identify the appropriate problem or opportunity to focus on by using design thinking (see the recent Datascope/IDEO post on Design Thinking and Data Science). We’re also happy to reprise the popular (Strata Santa Clara 2013) d3.js tutorial by Scott Murray.
By John Russell
When I came to work on the Cloudera Impala project, I found many things that were familiar from my previous experience with relational databases, UNIX systems, and the open source world. Yet other aspects were all new to me. I know from documenting both enterprise software and open source projects that it’s a special challenge when those two aspects converge. A lot of new users come in with 95% of the information they need, but they don’t know where the missing or outdated 5% is. One mistaken assumption or unfamiliar buzzword can make someone feel like a complete beginner. That’s why I was happy to have the opportunity to write this overview article, with room to explore how users from all kinds of backgrounds can understand and start using the Cloudera Impala product.
For database users, the Apache Hadoop ecosystem can feel like a new world:
- Sysadmins don’t bat an eye when you say you want to work on terabytes or petabytes of data.
- A networked cluster of machines isn’t a complicated or scary proposition. Instead, it’s the standard environment you ask an intern to set up on their first day as a training exercise.
- All the related open source projects aren’t an either-or proposition. You work with a dozen components that all interoperate, stringing them together like a UNIX toolchain.
The trend is clear: The CIO’s IT budget is getting smaller and the CMO’s IT budget is getting larger. As a result, the CIO’s role is diminishing and the CMO’s role is expanding. From a business perspective, the shift feels inevitable. Despite talk about transforming corporate IT organizations from cost centers into profit centers, the role of the CIO has remained largely administrative.
In their hearts, most CIOs know the score. They’ve won their battle to earn “a seat at the table,” but the table has gotten smaller. The main challenges ahead of them are technical, not strategic. Their key areas of focus today are mobility, cloud, and security. They are aware of big data, but it’s just not a survival issue for them.