ENTRIES TAGGED "Strata SC 2013"
Strata Community Profile on Amy Heineike, Director of Mathematics
According to Amy Heineike, the Director of Mathematics at Quid, there’s nothing like having a fresh dataset in R and knowing how to use it. “You can add a few lines of code and discover all kinds of interesting information,” Heineike says. “One question leads to another, you get into a flow, and you can have an amazing exploration.”
Heineike started working with data several years ago at a consultancy in London, where “playing around” with data shed light on the impact of social networks on government policies. Part of her job was figuring out what types of data to use in order to find solutions to crucial problems, from public transportation to obesity. Her day-to-day work at Quid entails working with new data sets, prototyping analytics, and collaborating with an engineering team to improve data analysis and bring products into production.
Sneak Peek at Upcoming Session at Strata Santa Clara 2013
By Robert Munro
Plain text is the world’s largest source of digital information. As the amount of unstructured text grows, so does the percentage of text that is not in English. The majority of the world’s data is now unstructured text outside of English. So unless you’re an exceptional polyglot, you can’t understand most of what’s out there, even if you want to.
Language technologies underlie many of our daily activities. Search engines, spam filtering, and news personalization (including your social media feeds) all employ smart, adaptive knowledge of how we communicate. We can automate many of these tasks well, but there are places where we fall short. For example, the world’s most spoken language, Mandarin Chinese, is typically written without spaces. “解放大道” can mean “Liberation Avenue” or “Solution Enlarged Road” depending on where you interpret the gaps. It’s a kind of ambiguity that we only need to worry about in English when we’re registering domain names and inventing hashtags (something the folk at “Who Represents” didn’t worry about enough). For Chinese, we still don’t get it right with automated systems: the best systems get an error every 20 words or so. We face similar problems for about a quarter of the world’s data. We can’t even reliably tell you what the words are, let alone extract complex information at scale.
Preview of upcoming session at Strata Santa Clara
Is your organization considering embracing data science? If so, we would like to give you some helpful advice on organizational and technical issues to consider before you embark on any initiatives or consider hiring data scientists. Join us, Sean Murphy and Marck Vaisman, two Washington, D.C. based data scientists and founding members of Data Community DC, as we walk you through the trials and tribulations of practicing data scientists at our upcoming talk at Strata.
We will discuss anecdotes and best practices, and finish by presenting the results of a survey we conducted last year to help understand the varieties of people, skills, and experiences that fall under the broad term of “Data Scientist”. We analyzed data from over 250 survey respondents, and are excited to share our findings, which will also be published soon by O’Reilly.
Preview of upcoming Strata session on data exploration
Amy Heineike is Director of Mathematics for Quid Inc, where she has been since its inception, prototyping and launching the company’s technology for analyzing document sets. Below is the teaser for her upcoming talk at Strata Santa Clara.
I recently discovered that my favorite map is online. It used to hang on my housemate’s wall in our little house in London back in 2005. At the time I was working to understand how London was evolving and changing, and how different policy or infrastructure changes (a new tube line, land use policy changes) would impact that.
The map was originally published as a center-page pull out from the Guardian, showing the ethnic groups that dominate different neighborhoods across the city. The legend was as long as the image, and the small print labels necessitated standing up close, peering and reading, tracing your finger to discover the Congolese on the West Green Road, our neighbors the Portuguese on the Stockwell Road, or the Tamils in Chessington in the distant south west.
We are simply not good at playing with others when it comes to data
Russia’s railway gauge is different from Western Europe’s. At the border of the former Soviet states, the Russian gauge of 1.524m meets the European & American ‘Standard’ gauge of 1.435m. The reasons for this literal disconnect arise from discussions between the Tsar and his War Minister. When asked the most effective way to prevent Russia’s own rail lines being used against them in times of invasion, the Minister suggested a different gauge to prevent supply trains rolling through the border. The artifact of this decision remains visible today at all rail crossings between Poland and Belarus or Slovakia and Ukraine. The rail cars are jacked up at the border, new wheels inserted underneath, and the car lowered again. It is about a 2-4 hour time burn for each crossing.
Per head, per crossing, over 170 years, is a heck of a lot of resource wasted. But to change it would entail changing the rail stock of the entire country and realigning about 225,000 km (140,000 mi) of track.
Talk about technical debt.
Data suffers from a similar disconnect. It really wasn’t until the advent of XML 15 years ago that we had an agreed (but not entirely satisfactory) mechanism for storing arbitrary data structures outside the application layer. This is as much a commentary on our technical priorities as it is a social indictment. We are simply not good at playing with others when it comes to data.
A Call for Industry-Standard Benchmarks for Big Data Platforms at Strata SC 2013
Big data systems are characterized by their flexibility in processing diverse data genres, such as transaction logs, connection graphs, and natural language text, with algorithms characterized by multiple communication patterns, e.g. scatter-gather, broadcast, multicast, pipelines, and bulk-synchronous. A single benchmark that characterizes a single workload could not be representative of such a multitude of use-cases. However, our systematic study of several use-cases of current big data platforms indicates that most workloads are composed of a common set of stages, which capture the variety of data genres and algorithms commonly used to implement most data-intensive end-to-end workloads. Our upcoming session at Strata SC discusses the BigData Top 100 List, a new community-based initiative for benchmarking big data systems.
Tips for interacting with analytics colleagues
To quote Pride and Prejudice, businesses have for many years “labored under the misapprehension” that their analytics talent was made up of misanthropes with neither the will nor the ability to communicate or work with others on strategic or creative business problems. These employees were meant to be kept in the basement out of sight, fed bad pizza, and pumped for spreadsheets to be interpreted in the sunny offices aboveground.
This perception is changing in industry as the big data phenomenon has elevated data science to a C-level priority. Suddenly folks once stereotyped by characters like Milton in Office Space are now “sexy.” The truth is there have always been well-rounded, articulate, friendly analytics professionals (they may just like Battlestar more than you), and now that analytics is an essential business function, personalities of all types are being attracted to practice the discipline.
Preview of Strata Santa Clara 2013 Session
The 2013 Strata Conference in Santa Clara, CA will be my fifth Strata conference. As always, I’m excited to join so many leaders in the data and data viz communities, and I’m honored that I’ll be speaking there.
I will be presenting my tutorial “Communicating Data Clearly” at 9AM on Tuesday, February 26. This talk will cover methods and principles of creating effective graphs, to ensure they are clear, accurate, and make it easier to understand the data. It will also emphasize how to avoid common graphical mistakes. To give you a preview of a few of the topics I will be covering as well as to provide some information to those who cannot attend, I will now link to some of the blog posts I‘ve written for Forbes. I was invited to blog for Forbes at a New York Strata Conference in 2011 so that my relationships with Forbes and Strata are intertwined.
Preview of an upcoming tutorial at Strata Santa Clara 2013
This month at Strata, the U.C. Berkeley AMPLab will be running a full day of big data tutorials.In this post, we present the motivation and vision for the Berkeley Data Analytics Stack (BDAS), and an overview of several BDAS components that we released over the past two years, including Mesos, Spark, Spark Streaming, and Shark.
While batch processing systems like Hadoop MapReduce paved the way for organizations to ask questions about big datasets, they represent only the beginning of what users need to do with big data. More and more, users wish to move from periodically building reports about datasets to continuously using new data to make informed business decisions in real-time. Achieving these goals imposes three key requirements on big data processing:
- Low latency queries: Interactive ad-hoc queries allows data scientists to find valuable inferences faster, or explore a larger solution space to make better decisions. Furthermore, there is an increasing need for stream processing, as this allows organizations to make decisions in real-time, such as detecting an SLA violation and fixing the problem before the users notice, or deciding what ads to show based on user’s live tweets.
- Sophisticated analysis: People are increasingly looking to use new state of art algorithms, such as predictive machine learning algorithms, to make better forecasts and decisions.
- Unification of existing data computation models: Users want to integrate interactive queries, batch, and streaming processing to handle the ever increasing requirements of their processing pipelines. For example, detecting anomalies in user behavior may require (1) stream processing to compare the behavior of users in real-time across different segments (e.g., genre, ages, location, device), (2) interactive queries to detect differences in user’s daily (or weekly) behavior, and (3) batch processing to build sophisticated predictive models.
In response to the above requirements, more than three years ago we began building BDAS.