A much needed break away from data transparency and privacy issues
I could have focused on the Governments Search for Google Data visualization from Chris Canipe and Madeline Farbman of the Wall Street Journal. Or, I could have focused on Neal Ungerleider’s piece that covers Eric Fisher and MapBox for Gnip’s twitter metadata visualizations. Yet, my curiosity took over once I came across The Economist’s High Spirits graphic. Not only do I make my own bitters which qualifies me for preliminary booze nerd status, I also needed a brief break away from the transparency issues currently dominating the data-oriented conversations. Following my booze nerd curiosity led me to this interactive data visualization of common cocktail ingredients:
New open source tools for interactive SQL analysis, model development and deployment
When Hadoop users need to develop apps that are “latency sensitive”, many of them turn to HBase1. Its tight integration with Hadoop makes it a popular data store for real-time applications. When I attended the first HBase conference last year, I was pleasantly surprised by the diversity of companies and applications that rely on HBase. This year’s conference was even bigger and I ran into attendees from a wide range of companies. Another set of interesting real-world case studies were showcased, along with sessions highlighting work of the HBase team aimed at improving usability, reliability, and availability (bringing down mean time to recovery has been a recent area of focus).
HBase has had a reputation of being a bit difficult to use – its core users have been data engineers, not data scientists. The good news is that as HBase gets adopted by more companies, tools are being developed to open it up to more users. Let me highlight some tools that will appeal to data scientists.
Response to NSA data mining and the troubling lack of technical details, Facebook's Open Compute data center, and local police are growing their own DNA databases.
It’s a question of power, not privacy — and what is the NSA really doing?In the wake of the leaked NSA data-collection programs, the Pew Research Center conducted a national survey to measure American’s response. The survey found that 56% of respondents think NSA’s telephone record tracking program is an acceptable method to investigate terrorism, and 62% said the government’s investigations into possible terrorist threats are more important than personal privacy.
Rebecca J. Rosen at The Atlantic took a look at legal scholar Daniel J. Solove’s argument that we should care about the government’s collection of our data, but not for the reasons one might think — the collection itself, he argues, isn’t as troubling as the fact that they’re holding the data in perpetuity and that we don’t have access to it. Rosen quotes Solove:
“The NSA program involves a massive database of information that individuals cannot access. … This kind of information processing, which forbids people’s knowledge or involvement, resembles in some ways a kind of due process problem. It is a structural problem involving the way people are treated by government institutions. Moreover, it creates a power imbalance between individuals and the government. … This issue is not about whether the information gathered is something people want to hide, but rather about the power and the structure of government.”
Notes and links from the data journalism beat
Data journalism is becoming a truly global practice. Data journalists from the UK, China, and the US are sharing data-oriented best practices, insights, and tools. Journalists in Latin America are meeting this week to push for more transparency and access to data in the region. At the same time, recent revelations about NSA domestic surveillance programs have pushed big data stories to the front pages of US papers. Here are a few links from the past week:
Transparency…or Lack Thereof
- OpenData Latinoamérica: Driving the demand side of data and scraping towards transparency (Neiman Journalism Lab)
“There’s a saying here, and I’ll translate, because it’s very much how we work,” Miguel Paz said to me over a Skype call from Chile. “But that doesn’t mean that it’s illegal. Here, it’s ‘It’s better to ask forgiveness than to ask permission.” Paz is a veteran of the digital news business. The saying has to do with his approach to scraping public data from governments that may be slow to share it.
- The real story in the NSA scandal is the collapse of journalism (zdnet.com)
On Thursday, June 6, the Washington Post published a bombshell of a story, alleging that nine giants of the tech industry had “knowingly participated” in a widespread program by the United States National Security Agency (NSA). One day later, with no acknowledgment except for a change in the timestamp, the Post revised the story, backing down from sensational claims it made originally. But the damage was already done.
- We are shocked, shocked… (davidsimon.com)
Having labored as a police reporter in the days before the Patriot Act, I can assure all there has always been a stage before the wiretap, a preliminary process involving the capture, retention and analysis of raw data. It has been so for decades now in this country. The only thing new here, from a legal standpoint, is the scale on which the FBI and NSA are apparently attempting to cull anti-terrorism leads from that data. But the legal and moral principles? Same old stuff.
- Big Data Has Big Stage at Personal Democracy Forum (pbs.org)
Engaging News Project’s Talia Stroud tackled the issue of public engagement in news organizations. Polls on websites don’t yield scientifically accurate results, nor do they get people to address difficult issues, she said. “These data are junk. We know they’re junk,” Stroud said. “City council representatives know they’re junk. Even news organizations know that the results of these data are junk. The only reason that this poll is being included on the news organization’s site is to increase interactivity and increase your time on page.”
Oliver O'Brien has visualized real-time bike share use not only in NYC, but in cities around the world as well.
New York City’s new bike-share program, Citi Bike, has been underway for a couple of weeks now. Its level of success is still up for debate, but the stats are impressive: as of June 10, there had been 173,516 trips traveled over 510,782 miles since the launch. Oliver O’Brien, a researcher and software developer at the Centre for Advanced Spatial Analysis (CASA), and a contributor to OpenStreetMap, has developed a visualization of bike share use in real time.
A Conversation with the Founder of Neo4J, Emil Eifrem
Emil Eifrem @emileifrem is the Founder of Neo4j and CEO of Neo Technology. He is also one of the authors of Graph Databases. Recently, I had the opportunity to sit down with Emil and we talked about the current and future opportunities for graph databases.
Key highlights include:
- Emil explains graph databases [Discussed at 0:29]
- Facebook Graph Search is a well-known example of a graph database [Discussed at 3:28]
- But really, graph databases can be used more much more than social search [Discussed at 4:50]
- Neo4j, the original graph database [Discussed at 5:25]
- Graph databases ‘shape’ data [Discussed at 6:20]
You can view the full interview here:
Report from 2013 Health Privacy Summit
The timing was superb for last week’s Health Privacy Summit, held on June 5 and 6 in Washington, DC. First, it immediately followed the 2000-strong Health Data Forum (Health Datapalooza), where concern for patients rights came up repeatedly. Secondly, scandals about US government spying were breaking out and providing a good backdrop for talking about protection our most sensitive personal information–our health data.
The health privacy summit, now in its third year, provides a crucial spotlight on the worries patients and their doctors have about their data. Did you know that two out of three doctors (and probably more–this statistic cites just the ones who admit to it on a survey) have left data out of a patient’s record upon the patient’s request? I have found that the summit reveals the most sophisticated and realistic assessment of data protection in health care available, which is why I look forward to it each year. (I’m also on the planning committee for the summit.) For instance, it took a harder look than most observers at how health care would be affected by patient access to data, and the practice of sharing selected subsets of data, called segmentation.
What effect would patient access have?
An odd perceptual discontinuity exists around patient access to health records. If you go to your doctor and ask to see your records, chances are you will be turned down outright or forced to go through expensive and frustrating magical passes. One wouldn’t know that HIPAA explicitly required doctors long ago to give patients their data, or that the most recent meaningful use rules from the Department of Health and Human Services require doctors to let patients view, download, and transmit their information within four business days of its addition to the record.
It's not the data itself but what you do with it that counts.
This post originally appeared on Cumulus Partners. It’s republished with permission.
Quentin Hardy’s recent post in the Bits blog of The New York Times touched on the gap between representation and reality that is a core element of practically every human enterprise. His post is titled “Why Big Data is Not Truth,” and I recommend it for anyone who feels like joining the phony argument over whether “big data” represents reality better than traditional data.
In a nutshell, this “us” versus “them” approach is like trying to poke a fight between oil painters and water colorists. Neither oil painting nor water colors are “truth”; both are forms of representation. And here’s the important part: Representation is exactly that — a representation or interpretation of someone’s perceived reality. Pitting “big data” against traditional data is like asking you if Rembrandt is more “real” than Gainsborough. Both of them are artists and both painted representations of the world they perceived around them.
Analytic engines on top of Hadoop simplify the creation of interesting, low-cost, scalable applications
Hadoop’s low-cost, scale-out architecture has made it a new platform for data storage. With a storage system in place, the Hadoop community is slowly building a collection of open source, analytic engines. Beginning with batch processing (MapReduce, Pig, Hive), Cloudera has added interactive SQL (Impala), analytics (Cloudera ML + a partnership with SAS), and as of early this week, real-time search. The economics that led to Hadoop dominating batch processing is permeating other types of analytics.
Another collection of open source, Hadoop-compatible analytic engines, the Berkeley Data Analytics Stack (BDAS), is being built just across the San Francisco Bay. Starting with a batch-processing framework that’s faster than MapReduce (Spark), it now includes interactive SQL (Shark), and real-time analytics (Spark Streaming). Sometime this summer, frameworks for machine-learning (MLbase) and graph analytics (GraphX) will be released. A cluster manager (Mesos) and an in-memory file system (Tachyon) allow users of other analytic frameworks to leverage the BDAS platform. (The Python data community is looking at Tachyon closely.)
Humans as nodes, pills and electronic tattoo password authenticators, NSA surveillance leaks, and hiding data in temporal cloaks.
Collaborative sensor networks of humans, and your body may be the next two-factor authenticator
There has been much coverage recently of the Internet of Things, connecting everything from washers and dryers to thermostats to cars to the Internet. Wearable sensors — things like FitBit and health-care-related sensors that can be printed onto fabric or even onto human skin — are also in the spotlight.
Kevin Fitchard reports at GigaOm that researchers at CEA-Leti and three French universities believe these areas are not mutually exclusive and have launched a project around wireless body area networks called CORMORAN. The group believes that one day soon our bodies will be constantly connected to the Internet via sensors and transmitters that “can be used to form cooperative ad hoc networks that could be used for group indoor navigation, crowd-motion capture, health monitoring on a massive scale and especially collaborative communications,” Fitchard writes. He takes a look at some of the benefits and potential applications of such a collaborative network — location-based services would be able to direct users to proper gates or trains in busy airports and train stations, for instance — and some of the pitfalls, such as potential security and privacy issues. You can read his full report at GigaOm.
In related news, wearable sensors — and even our bodies — may not only be used to connect us to a network, but also to identify us as well. Read more…