ENTRIES TAGGED "data mining"

The 0th Law of Data Mining

Preview of The Laws of Data Mining Session at Strata Santa Clara 2013

Many years ago I was taught about the three laws of thermodynamics. When that didn’t stick, I was taught a quick way to remember originally identified by C.P. Snow:

  • 1st Law: you can’t win
  • 2nd Law: you can’t draw
  • 3rd Law: you can’t get out of the game

These laws (well the real ones) were firmly established by the mid 19th century. Yet, it wasn’t until the 1930s that the value of the 0th law was identified.

At Strata I’m going to be talking about the 9 Laws of Data Mining – a set of principles identified by Tom Khabaza and very closely related to the CRISP-DM data mining methodology.

They may possibly, just possibly, not be as important as the laws of thermodynamics, but at Strata they will be supported by an equally important 0th Law.

Read more…

Comment |

Strata Week: Political data mining “bait-and-switch”

Inaugural 2013 app has plans for your data, the "unprecedented" security issues of the Internet of Things, and optical switches speed up data centers.

Here are a few stories from the data space that caught my attention this week.

Inaugural 2013 app takes as much as it gives

Inaugural2013appThe Presidential Inaugural Committee (PIC) launched the first official inaugural smartphone app, Inaugural 2013 (for iOS and for Android), Monday. Daniel Strauss reports in a post at The Hill that inauguration attendees can use the app to locate and RSVP to events, watch events via livestream, and navigate the event with an interactive map.

What isn’t front and center in the pomp and circumstance of the shiny new app are the terms of service and the privacy statement. Steve Friess at Politico points out that in the fine print, users are giving the PIC permission to share their data — phone numbers, email, home addresses, and GPS location data, for instance — “with candidates, organizations, groups or causes that [the PIC] believe have similar political viewpoints, principles or objectives.”

Gregory Ferenstein reports at TechCrunch that “privacy advocates find it troubling that the fine-print on the PIC’s website says it can use activity data ‘without limitation in advertising, fundraising and other communications in support of PIC and the principles of the Democratic party, without any right of compensation or attribution.’”

Read more…

Comments: 2 |

Strata Week: Big data’s big future

Big data in 2013, and beyond; the Sunlight Foundation's new data mining app; and the growth of our planet's central nervous system.

Here are a few stories from the data space that caught my attention this week.

Big data will continue to be a big deal

“Big data” became something of a buzz phrase in 2012, with its role in the US Presidential election, and businesses large and small starting to realize the benefits and challenges of mountains upon zettabytes of data — so much so that NPR’s linguist contributor Geoff Nunberg thinks it should have been the phrase of the year.

Nunberg says that though “it didn’t get the wide public exposure given to items like ‘frankenstorm,’ ‘fiscal cliff‘ and YOLO,” and might not have been “as familiar to many people as ‘Etch A Sketch’ and ’47 percent’” were during the election, big data has become a phenomenon affecting our lives: “It’s responsible for a lot of our anxieties about intrusions on our privacy, whether from the government’s anti-terrorist data sweeps or the ads that track us as we wander around the Web.” He also notes that big data has transformed statistics into “a sexy major” and predicts the term will long outlast “Gangnam Style.” (You can read Nunberg’s full case for big data at NPR.)

Read more…

Comment |

Strata Week: Big data’s daily influence

Big data's broad effect on our world, myriad uses for traffic data, and Obama's big data practice vs. policy.

Here are a few stories from the data space that caught my attention this week.

How big data is transforming just about everything

Professor John Naughton took a look this week at how big data is transforming various industries that affect our daily lives.

He highlights finance, of course, which he says has been “pathologically mathematised;” marketing, for which there is more data about human behavior than we’ve ever had; and the very broad category of science. Naughton notes that researchers used to conjure up theories and look to data to support or refute; now, researchers turn to data to find patterns and connections that might inspire new theories. Naughton also looks at medicine, which is just on the brink of delving into the big data realm. He writes:

“Last week’s news about how Cambridge researchers stopped an MRSA outbreak affecting 12 babies in the Rosie Hospital by rapidly sequencing the genome of the bacteria illustrates how medicine has become a data-intensive field. Even a few years ago, the resources required to achieve this would have involved a roomful of computers and upwards of a week.”

Naughton addresses the use of big data in sports as well, speculating that baseball has been the sport most transformed by data. He’ll likely find agreement there. Barry Eggers goes into depth on the dramatic effect big data is having on baseball over at TechCrunch. He notes that simple data analysis of statistics, which baseball has embraced since its beginnings, has evolved into gathering mountains of unstructured data and employing Hadoop to gain new and better insights from data that isn’t part of the structured game information. Eggers writes:

“By having his data scientist run a Hadoop job before every game, [San Francisco Giants manager] Bruce Bochy can not only make an informed decision about where to locate a 3-1 Matt Cain pitch to Prince Fielder, but he can also predict how and where the ball might be hit, how much ground his infielders and outfielders can cover on such a hit, and thus determine where to shift his defense. Taken one step further, it’s not hard to imagine a day where managers like Bochy have their locker room data scientist run real-time, in-game analytics using technologies like Cassandra, Hbase, Drill, and Impala.”

Read more…

Comment |

Strata Week: Data mining for votes

Candidates are data mining behind the scenes, data mining gets a PR campaign, Google faces privacy policy issues, and Hadoop and BI.

Here are a few stories from the data space that caught my attention this week.

Presidential candidates are mining your data

Data is playing an unprecedented role in the US presidential election this year. The two presidential campaigns have access to personal voter data “at a scale never before imagined,” reports Charles Duhigg at the New York Times. The candidate camps are using personal data in polling calls, accessing such details as “whether voters may have visited pornography Web sites, have homes in foreclosure, are more prone to drink Michelob Ultra than Corona or have gay friends or enjoy expensive vacations,” Duhigg writes. He reports that both campaigns emphasized they were committed to protecting voter privacy, but notes:

“Officials for both campaigns acknowledge that many of their consultants and vendors draw data from an array of sources — including some the campaigns themselves have not fully scrutinized.”

A Romney campaign official told Duhigg: “You don’t want your analytical efforts to be obvious because voters get creeped out. A lot of what we’re doing is behind the scenes.”

The “behind the scenes” may be enough in itself to creep people out. These sorts of situations are starting to tarnish the image of the consumer data-mining industry, and a Manhattan trade group, the Direct Marketing Association, is launching a public relations campaign — the “Data-Driven Marketing Institute” — to smooth things over before government regulators get involved. Natasha Singer reports at the New York Times:

“According to a statement, the trade group intends to promote such targeted marketing to lawmakers and the public ‘with the goal of preventing needless regulation or enforcement that could severely hamper consumer marketing and stifle innovation’ as well as ‘tamping down unfavorable media attention.’ As part of the campaign, the group plans to finance academic research into the industry’s economic impact, said Linda A. Woolley, the acting chief executive of the Direct Marketing Association.”

One of the biggest issues, Singer notes, is that people want control over their data. Chuck Teller, founder of Catalog Choice, told Singer that in a recent survey conducted by his company, 67% of people responded that they wanted to see the data collected about them by data brokers and 78% said they wanted the ability to opt out of the sale and distribution of that data.

Read more…

Comment |

Unstructured data is worth the effort when you’ve got the right tools

Alyona Medelyan and Anna Divoli on the opportunities in chaotic data.

Alyona Medelyan and Anna Divoli are inventing tools to help companies contend with vast quantities of fuzzy data. They discuss their work and what lies ahead for big data in this interview.

Comment |

Demoting Halder: A wild look at social tracking and sentiment analysis

You no longer have control over where a first impression occurs.

My short story, "Demoting Halder," was supposed to lay out an alternative reality where social tracking and sentiment analysis had taken over society. As the story evolved, I wondered if the reality in the story is something we're living right now.

Comments: 2 |
If your data practices were made public, would you be nervous?

If your data practices were made public, would you be nervous?

Solon Barocas on data mining's reputation and the ethics of data collection.

Solon Barocas, a doctoral student at New York University, discusses consumer perceptions of data mining and how companies and data scientists can shape data mining's reputation.

Comment |
Strata Week: Overcharging algorithms

Strata Week: Overcharging algorithms

Algorithms go awry on Amazon, the future of Hadoop at Yahoo, and the Supreme Court mulls data mining

In this Strata Week: Algorithm pricing on Amazon pushes the price of a biology book to astronomical levels, Yahoo weighs the future of Hadoop, and the Supreme Court hears arguments about a Vermont law restricting the data mining of prescription records.

Comment: 1 |
Strata gems: What your inbox knows

Strata gems: What your inbox knows

Mining implicit data trails makes CRM more effective

One of the richest sources of data exhaust, email logs contain valuable information. When added to data from a traditional CRM, email analytics can provide a much fuller picture of your company's relationships and activity.

Comment: 1 |