ENTRIES TAGGED "Google"
Two views on new Google Maps; a look at predictive, intelligent apps; and Aaron Swartz's and Kevin Poulsen's anonymous inbox launches.
Google aims for a new level of map customization
Google introduced a new version of Google maps at Google I/O this week that learns from each use to customize itself to individual users, adapting based on user clicks and searches. A post on the Google blog outlines the updates, which include recommendations for places you might enjoy (based upon your map activity), ratings and reviews, integrated Google Earth, and tours generated from user photos, to name a few.
Big data aids HR, DataKind heads to the U.K., and German regulators fine Google a "paltry" 145,000 euros.
Big data replaces gut instinct in HR management
In a post at the New York Times, Steve Lohr took a look this week at a new data discipline: work-force science. The field pairs big data with human resources to help remove subjectivity and gut instinct from the hiring process and HR management. Lohr notes that in the past, studies conducted to understand worker behavior included a few hundred test subjects at most. Today, they can include thousands of subjects and far more data points. Lohr writes:
“Today, every e-mail, instant message, phone call, line of written code and mouse-click leaves a digital signal. These patterns can now be inexpensively collected and mined for insights into how people work and communicate, potentially opening doors to more efficiency and innovation within companies. Digital technology also makes it possible to conduct and aggregate personality-based assessments, often using online quizzes or games, in far greater detail and numbers than ever before.”
Lohr looks at several companies applying data-driven decision making to HR management. Read more…
Ilya Grigorik's GitHub project shows what happens when questions, data, and tools converge.
1. Ask the question, “I wonder what happens if I do this?” and then follow it all the way through.
2. Start a project on a whim and open it up so anyone can participate.
By day, Grigorik is a developer advocate on Google’s Make the Web Fast team (he’s a perfect candidate for a future Velocity interview). On the side, he likes to track open source projects on GitHub. As he explained during our chat, this can be a time-intensive hobby:
“I follow about 3,000 open source projects, and I try to keep up with what’s going on, what are people contributing to, what are the new interesting sub-branches of work being done … The problem I ran into about six months ago was that, frankly, it was just too much to keep up with. The GitHub timeline was actually overflowing. In order to keep up, I would have to go in every four hours and scan through everything, and then repeat it. That doesn’t give you much time for sleep.” [Discussed 15 seconds into the interview.]
Grigorik built a system — including a newsletter— that lets him stay in the loop efficiently. He worked with GitHub to archive public GitHub activity, and he then made that data available in raw form and through Google BigQuery (the data is updated hourly).
This is a fun project, no doubt, but it’s also a big deal. Here’s why: When you shorten the distance between questions and answers, you empower people to ask more questions. It’s the liberation of curiosity, and that’s exactly what happened here. Read more…
Here are a few stories from the data space that caught my attention this week.
Presidential candidates are mining your data
Data is playing an unprecedented role in the US presidential election this year. The two presidential campaigns have access to personal voter data “at a scale never before imagined,” reports Charles Duhigg at the New York Times. The candidate camps are using personal data in polling calls, accessing such details as “whether voters may have visited pornography Web sites, have homes in foreclosure, are more prone to drink Michelob Ultra than Corona or have gay friends or enjoy expensive vacations,” Duhigg writes. He reports that both campaigns emphasized they were committed to protecting voter privacy, but notes:
“Officials for both campaigns acknowledge that many of their consultants and vendors draw data from an array of sources — including some the campaigns themselves have not fully scrutinized.”
A Romney campaign official told Duhigg: “You don’t want your analytical efforts to be obvious because voters get creeped out. A lot of what we’re doing is behind the scenes.”
The “behind the scenes” may be enough in itself to creep people out. These sorts of situations are starting to tarnish the image of the consumer data-mining industry, and a Manhattan trade group, the Direct Marketing Association, is launching a public relations campaign — the “Data-Driven Marketing Institute” — to smooth things over before government regulators get involved. Natasha Singer reports at the New York Times:
“According to a statement, the trade group intends to promote such targeted marketing to lawmakers and the public ‘with the goal of preventing needless regulation or enforcement that could severely hamper consumer marketing and stifle innovation’ as well as ‘tamping down unfavorable media attention.’ As part of the campaign, the group plans to finance academic research into the industry’s economic impact, said Linda A. Woolley, the acting chief executive of the Direct Marketing Association.”
One of the biggest issues, Singer notes, is that people want control over their data. Chuck Teller, founder of Catalog Choice, told Singer that in a recent survey conducted by his company, 67% of people responded that they wanted to see the data collected about them by data brokers and 78% said they wanted the ability to opt out of the sale and distribution of that data.
Obstacles for big data, big data intelligence, and a privacy plugin puts Google and Facebook settings in the spotlight.
Here are a few stories from the data space that caught my attention this week.
Big obstacles for big data
For the latest issue of Foreign Policy, Uri Friedman put together a summarized history of big data to show “[h]ow we arrived at a term to describe the potential and peril of today’s data deluge.” A couple months ago, MIT’s Alex “Sandy” Pentland took a look at some of that big data potential for Harvard Business Review; this week, he looked at some of the perilous aspects. Pentland writes that to be realistic about big data, it’s important to look not only at its promise, but also its obstacles. He identifies the problem of finding meaningful correlations as one of big data’s biggest obstacles:
“When your volume of data is massive, virtually any problem you tackle will generate a wealth of ‘statistically significant’ answers. Correlations abound with Big Data, but inevitably most of these are not useful connections. For instance, your Big Data set may tell you that on Mondays, people who drive to work rather than take public transportation are more likely to get the flu. Sounds interesting, and traditional research methods show that it’s factually true. Jackpot!
“But why is it true? Is it causal? Is it just an accident? You don’t know. This means, strangely, that the scientific method as we normally use it no longer works, because there are so many possible relationships to consider that many are bound to be ‘statistically significant’. As a consequence, the standard laboratory-based question-and-answering process — the method that we have used to build systems for centuries — begins to fall apart.”
Pentland says that big data is going to push us out of our comfort zone, requiring us to conduct experiments in the real world — outside our familiar laboratories — and change the way we test the causality of connections. He also addresses issues of understanding those correlations enough to put them to use, knowing who owns the data and learning to forge new types of collaborations to use it, and how putting individuals in charge of their own data helps address big data privacy concerns. This piece, together with Pentland’s earlier big data potential post, are this week’s recommended reads.
Did Google just prove the industry wrong? Early thoughts on the Spanner database.
In case you missed it, Google Research published another one of “those” significant research papers — a paper like the BigTable paper from 2006 that had ramifications for the entire industry (that paper was one of the opening volleys in the NoSQL movement).
Google’s new paper is about a distributed relational database called Spanner that was a follow up to a presentation from earlier in the year about a new database for AdWords called F1. If you recall, that presentation revealed Google’s migration of AdWords from MySQL to a new database that supported SQL and hierarchical schemas — two ideas that buck the trend from relational databases.
This new database, Spanner, is a database unlike anything we’ve seen. It’s a database that embraces ACID, SQL, and transactions, that can be distributed across thousands of nodes spanning multiple data centers across multiple regions. The paper dwells on two main features that define this database:
- Schematized Semi-relational Tables — A hierarchical approach to grouping tables that allows Spanner to co-locate related data into directories that can be easily stored, replicated, locked, and managed on what Google calls spanservers. They have a modified SQL syntax that allows for the data to be interleaved, and the paper mentions some changes to support columns encoded with Protobufs.
- “Reification of Clock Uncertainty” — This is the real emphasis of the paper. The missing link in relational database scalability was a strong emphasis on coordination backed by a serious attempt to minimize time uncertainty. In Google’s new global-scale database, the variable that matters is epsilon — time uncertainty. Google has achieved very low overhead (14ms introduced by Spanner in this paper for datacenters at 1ms network distance) for read-write (RW) transactions that span U.S. East Coast and U.S. West Coast (data centers separated by around 2ms of network time) by creating a system that facilitates distributed transactions bound only by network distance (measured in milliseconds) and time uncertainty (epsilon).
Google shows off its Knowledge, Yahoo stumbles, and a bill cuts some census funding.
In this week's data news, Google updates its search features with a Knowledge Graph, while the U.S. House of Representatives de-funds surveys that helped businesses construct theirs.
BigQuery for all, a new resource for data journalists, open data is challenged.
In this week's data news, Google's BigQuery opens up to everyone, the Data Journalism Handbook is released, and the open data movement is called to the mat.
Google's Kathryn Dekas on how a data-driven mindset applies to human resources.
Google's people analytics manager Kathryn Dekas discusses the ways in which human resources departments can use data for the benefit of both employers and employees.