Big data systems are characterized by their flexibility in processing diverse data genres, such as transaction logs, connection graphs, and natural language text, with algorithms characterized by multiple communication patterns, e.g. scatter-gather, broadcast, multicast, pipelines, and bulk-synchronous. A single benchmark that characterizes a single workload could not be representative of such a multitude of use-cases. However, our systematic study of several use-cases of current big data platforms indicates that most workloads are composed of a common set of stages, which capture the variety of data genres and algorithms commonly used to implement most data-intensive end-to-end workloads. Our upcoming session at Strata SC discusses the BigData Top 100 List, a new community-based initiative for benchmarking big data systems.
BigData Top 100 List Goals and Intentions
The BigData Top100 List would pursue a concurrent benchmarking model. Our goal is to pursue an open benchmark development process by soliciting input from the community at large and to be evaluated by a benchmark steering committee with representation from industry, academia, and other sectors. We are proposing the BigData Top100 List as an application-level benchmarking exercise to provide an “end-to-end” view of big data applications. While existing industry-standard benchmarks are also application-level benchmarks, they focus on highly structured (relational) data and are restricted to the functionality strictly provided by SQL. We have developed guidelines for defining a big data benchmark that include
- Simplicity: Following the dictum that “Everything should be made as simple as possible, but no simpler”, the benchmark should be technically simple to implement and execute. This is challenging, given the tendency of any software project to overload the specification and functionality, often straying from the most critical and relevant aspects.
- Ease of benchmarking: The costs of benchmark implementation/execution and any audits should be kept relatively low. The benefits of executing the benchmark should justify its expense—a criterion that is often underestimated during benchmark design.
- Time to market: Benchmark versions should be released in a timely fashion in order to keep pace with the rapid market changes in the Big Data area. A development time of 3-4 years, common for industry consortia, would be unacceptable in the big data application space. The benchmark would be outdated and obsolete before it is released!
- Verifiability of results: Verification of results is important, but the verification process must not be prohibitively expensive. Thus, to ensure correctness of results while also attempting to control audit costs, the BigData Top100 List will provide for automatic verification procedures along with a peer review process via a benchmark steering committee to ensure verifiability of results.
A formal specification of this benchmarking suite is under-way and will be announced at Strata Conference on February 28, 2013, in a session unveiling the BigDataTop100 list. We will be presenting the current status, and calling for your participation in this initiative, so be there.