ENTRIES TAGGED "MapReduce"
Spark, Storm, HBase, and YARN power large-scale, real-time models.
My favorite session at the recent Hadoop Summit was a keynote by Bruno Fernandez-Ruiz, Senior Fellow & VP Platforms at Yahoo! He gave a nice overview of their analytic and data processing stack, and shared some interesting factoids about the scale of their big data systems. Notably many of their production systems now run on MapReduce 2.0 (MRv2) or YARN – a resource manager that lets multiple frameworks share the same cluster.
Yahoo! was the first company to embrace Hadoop in a big way, and it remains a trendsetter within the Hadoop ecosystem. In the early days the company used Hadoop for large-scale batch processing (the key example being, computing their web index for search). More recently, many of its big data models require low latency alternatives to Hadoop MapReduce. In particular, Yahoo! leverages user and event data to power its targeting, personalization, and other “real-time” analytic systems. Continuous Computing is a term Yahoo! uses to refer to systems that perform computations over small batches of data (over short time windows), in between traditional batch computations that still use Hadoop MapReduce. The goal is to be able to quickly move from raw data, to information, to knowledge:
On a side note: many organizations are beginning to use cluster managers that let multiple frameworks share the same cluster. In particular I’m seeing many companies – notably Twitter – use Mesos1 (instead of YARN) to run similar services (Storm, Spark, Hadoop MapReduce, HBase) on the same cluster.
Going back to Bruno’s presentation, here are some interesting bits – current big data systems at Yahoo! by the numbers:
MapReduce gets easier, a new search engine for data, and now you can monitor the universe's forces on your phone.
Cloudera's Crunch hopes to make MapReduce easier, Datafiniti launches a search engine for data, and the University of Oxford releases an Android app for monitoring CERN data.
MapReduce crunches a million-song dataset, GPS and accident reconstruction, and WWI crowdsourcing.
This week's data stories include a guide to using MapReduce to process the Million Song Dataset, a story about how GPS data can help reconstruct lost memories (and accidents), and evidence that emergency crowdsourcing goes back further than many realize.
Cloudera CEO Mike Olson on Hadoop's architecture and its data applications.
Hadoop gets a lot of buzz in database circles, but some folks are still hazy about what it is and how it works. In this interview, Cloudera CEO and Strata speaker Mike Olson discusses Hadoop's background and its current utility.
The founder of Drawn to Scale explains how his database platform does simple things quickly.
Bradford Stephens, founder of of Drawn to Scale, discusses big data systems that work in "user time."
Digits of pi, extruding images with iPads, and mapping the past on top of the present
In this edition of Strata Week: The 2,000,000,000,000,000th digit of pi is calculated with an assist from Hadoop and MapReduce; a new technique uses iPads to extrude light paintings across a long exposure shot; Historypin links historical photos to Google Street View shots; and this is the last week for Strata Conference proposal submissions.
Storage, MapReduce and Query are ushering in data-driven products and services.
We're at the beginning of a revolution in data-driven products and services, driven by a software stack that enables big data processing on commodity hardware. Learn about the SMAQ stack, and where today's big data tools fit in.
Blue is the color, getting help with email overload.
In the latest edition of Strata Week: Google's introduction of a new search-indexing system highlights an important limitation of MapReduce and Hadoop. Can MapReduce adapt to real-time needs or will others follow Google in creating new architectures for real-time analytics?
Some organizations create their own real-time analysis tools, while others turn to specialized solutions. In a previous post, I highlighted SQL-based real-time analytic tools that can handle large amounts of data. I noted that other big data management systems such as MPP databases and MapReduce/Hadoop were too batch-oriented to deliver analysis in near real-time. At least for MapReduce/Hadoop systems things may have changed slightly. A group of researchers from UC Berkeley and Yahoo recently modified MapReduce to allow for pipelining between operators.
The growing need to manage and make sense of Big Data, has led to a surge in demand for analytic databases, which many companies are attempting to fill. As an alternative to current shared-nothing analytic databases, HadoopDB is a hybrid that combines parallel databases with scalable and fault-tolerant Hadoop/MapReduce systems.