ENTRIES TAGGED "strataconf"
Sneak peek at my upcoming session at the Strata Conference in Santa Clara
Visualizing data and extracting it from its data store are two activities that go hand in hand. Typically, when you try to use a data visualization toolkit such as Raphael, Protovis or D3 to create a non-trivial visualization, you spend a significant portion of your time writing code to extract the data. The process may involve querying an external database then transforming the resulting data to the correct structure for your visualization.
In his paper introducing plyr, a data manipulation toolkit for R, Hadley Wickham describes a framework, split-apply-combine, for expressing common data operations. The idea is that most data operations can be seen as splitting the data into a series of buckets, applying some aggregation to each bucket to get an aggregate and then combining the results by sorting and limiting. Wickham argues that most data query languages already rely on an equivalent framework whether explicitly or implicitly.
A Call for Industry-Standard Benchmarks for Big Data Platforms at Strata SC 2013
Big data systems are characterized by their flexibility in processing diverse data genres, such as transaction logs, connection graphs, and natural language text, with algorithms characterized by multiple communication patterns, e.g. scatter-gather, broadcast, multicast, pipelines, and bulk-synchronous. A single benchmark that characterizes a single workload could not be representative of such a multitude of use-cases. However, our systematic study of several use-cases of current big data platforms indicates that most workloads are composed of a common set of stages, which capture the variety of data genres and algorithms commonly used to implement most data-intensive end-to-end workloads. Our upcoming session at Strata SC discusses the BigData Top 100 List, a new community-based initiative for benchmarking big data systems.
Tips for interacting with analytics colleagues
To quote Pride and Prejudice, businesses have for many years “labored under the misapprehension” that their analytics talent was made up of misanthropes with neither the will nor the ability to communicate or work with others on strategic or creative business problems. These employees were meant to be kept in the basement out of sight, fed bad pizza, and pumped for spreadsheets to be interpreted in the sunny offices aboveground.
This perception is changing in industry as the big data phenomenon has elevated data science to a C-level priority. Suddenly folks once stereotyped by characters like Milton in Office Space are now “sexy.” The truth is there have always been well-rounded, articulate, friendly analytics professionals (they may just like Battlestar more than you), and now that analytics is an essential business function, personalities of all types are being attracted to practice the discipline.
Preview of Strata Santa Clara 2013 Session
The 2013 Strata Conference in Santa Clara, CA will be my fifth Strata conference. As always, I’m excited to join so many leaders in the data and data viz communities, and I’m honored that I’ll be speaking there.
I will be presenting my tutorial “Communicating Data Clearly” at 9AM on Tuesday, February 26. This talk will cover methods and principles of creating effective graphs, to ensure they are clear, accurate, and make it easier to understand the data. It will also emphasize how to avoid common graphical mistakes. To give you a preview of a few of the topics I will be covering as well as to provide some information to those who cannot attend, I will now link to some of the blog posts I‘ve written for Forbes. I was invited to blog for Forbes at a New York Strata Conference in 2011 so that my relationships with Forbes and Strata are intertwined.
Preview of an upcoming tutorial at Strata Santa Clara 2013
This month at Strata, the U.C. Berkeley AMPLab will be running a full day of big data tutorials.In this post, we present the motivation and vision for the Berkeley Data Analytics Stack (BDAS), and an overview of several BDAS components that we released over the past two years, including Mesos, Spark, Spark Streaming, and Shark.
While batch processing systems like Hadoop MapReduce paved the way for organizations to ask questions about big datasets, they represent only the beginning of what users need to do with big data. More and more, users wish to move from periodically building reports about datasets to continuously using new data to make informed business decisions in real-time. Achieving these goals imposes three key requirements on big data processing:
- Low latency queries: Interactive ad-hoc queries allows data scientists to find valuable inferences faster, or explore a larger solution space to make better decisions. Furthermore, there is an increasing need for stream processing, as this allows organizations to make decisions in real-time, such as detecting an SLA violation and fixing the problem before the users notice, or deciding what ads to show based on user’s live tweets.
- Sophisticated analysis: People are increasingly looking to use new state of art algorithms, such as predictive machine learning algorithms, to make better forecasts and decisions.
- Unification of existing data computation models: Users want to integrate interactive queries, batch, and streaming processing to handle the ever increasing requirements of their processing pipelines. For example, detecting anomalies in user behavior may require (1) stream processing to compare the behavior of users in real-time across different segments (e.g., genre, ages, location, device), (2) interactive queries to detect differences in user’s daily (or weekly) behavior, and (3) batch processing to build sophisticated predictive models.
In response to the above requirements, more than three years ago we began building BDAS.
Preview of upcoming session at Strata Santa Clara
At the end of 2012, the Federal Trade Commission (“FTC”) hosted the public workshop, “The Big Picture – Comprehensive Online Data Collection,” which focused on privacy concerns relating to the comprehensive collection of consumer online data by Internet service providers (“ISPs”), operating systems, browsers, search engines, and social media. During the workshop, panelists debated the impact of service providers’ ability to collect data about computer and device users across unaffiliated websites, including when some entities have no direct relationship with such users.
As one example of the issues raised by the panelists, Professor Neil Richards, from the Washington University in St. Louis School of Law, stated that, despite its benefits, comprehensive data collection infringes on the concept of “intellectual privacy,” which is predicated on consumers’ ability to freely search, interact, and express themselves online. Professor Richards also stated that comprehensive data collection is creating a transformational power shift in which businesses can effectively persuade consumers based on their knowledge of consumer preferences. Yet, according to Professor Richards, few consumers actually understand “the basis of the bargain,” or the extent to which their information is being collected.
Preview of upcoming session at the Strata Conference
Recommendations are making their way into more and more products. Using larger datasets are significantly improving the recommendations. Hadoop is being increasingly used for building out the recommendation platforms. Some of the examples of Recommendations include product recommendations, merchant recommendations, content recommendations, social recommendations, query recommendation, display and search ads.
With the number of options available to the users ever increasing, the attention span of customers is getting lower and lower at the very fast pace. At any given moment, the customers are getting used to seeing their best choices right in front of them. In such a scenario, we see recommendations powering more and more features of the products and driving user interaction. Hence companies are looking for more ways to minutely target customers at the right time. This brings in big data into the picture. Succeeding with data and building new markets, or changing the existing markets is the game being played in many high stake scenarios. Some companies have found the way to build their big data recommendation/machine learning platform giving them the edge in bringing better and better products ever faster to the market. Hence, there is a strong case for looking at recommendations/machine learning on big data as a platform in a company, rather than something of a black box that magically produces the right results. The platform allows us to build various other features like fraud detection, spam detection, content enrichment and serving etc. making it viable in the long run. It is not just about recommendations.
Preview of The Laws of Data Mining Session at Strata Santa Clara 2013
Many years ago I was taught about the three laws of thermodynamics. When that didn’t stick, I was taught a quick way to remember originally identified by C.P. Snow:
- 1st Law: you can’t win
- 2nd Law: you can’t draw
- 3rd Law: you can’t get out of the game
These laws (well the real ones) were firmly established by the mid 19th century. Yet, it wasn’t until the 1930s that the value of the 0th law was identified.
They may possibly, just possibly, not be as important as the laws of thermodynamics, but at Strata they will be supported by an equally important 0th Law.
Strata Santa Clara session preview on core data science skills
The McKinsey Global Institute forecasts a shortage of over 140,000 data scientists in the U.S. by 2018. I forecast a shortage of 140,000 people to explain to their respective hiring managers that make it Hadoop is not an appropriate articulation of what these people can or should do. If big data is the new bubble, then here’s to the prolonged correct data recession that hopefully follows.
Correct data? Such skills used to be called unsexy names like statistics or scientific experiments, but we now prefer to spice up the job titles (and salaries!) a bit and brand ourselves as data scientists, data storytellers, data prophets, or—if my next promotion comes through—Lord High Chancellor of Data, appointed by the Sovereign on the advice of the Prime Minister to oversee Her Majesty’s Terabytes. Modesty, it sometimes feels, is low on the burgeoning list of big data skills.