Search CORE

100,023 research outputs found

BigDataBench: a Big Data Benchmark Suite from Internet Services

Author: Gao Wanling
He Yongqiang
Jia Zhen
Li Xiaona
Lu Gang
Luo Chunjie
Qiu Bizhu
Shi Yingjie
Wang Lei
Yang Qiang
Zhan Jianfeng
Zhan Kent
Zhang Shujie
Zheng Chen
Zhu Yuqing
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 22/02/2014
Field of study

As architecture, systems, and data management communities pay greater attention to innovative big data systems and architectures, the pressure of benchmarking and evaluating these systems rises. Considering the broad use of big data systems, big data benchmarks must include diversity of data and workloads. Most of the state-of-the-art big data benchmarking efforts target evaluating specific types of applications or system software stacks, and hence they are not qualified for serving the purposes mentioned above. This paper presents our joint research efforts on this issue with several industrial partners. Our big data benchmark suite BigDataBench not only covers broad application scenarios, but also includes diverse and representative data sets. BigDataBench is publicly available from http://prof.ict.ac.cn/BigDataBench . Also, we comprehensively characterize 19 big data workloads included in BigDataBench with varying data inputs. On a typical state-of-practice processor, Intel Xeon E5645, we have the following observations: First, in comparison with the traditional benchmarks: including PARSEC, HPCC, and SPECCPU, big data applications have very low operation intensity; Second, the volume of data input has non-negligible impact on micro-architecture characteristics, which may impose challenges for simulation-based big data architecture research; Last but not least, corroborating the observations in CloudSuite and DCBench (which use smaller data inputs), we find that the numbers of L1 instruction cache misses per 1000 instructions of the big data applications are higher than in the traditional benchmarks; also, we find that L3 caches are effective for the big data applications, corroborating the observation in DCBench.Comment: 12 pages, 6 figures, The 20th IEEE International Symposium On High Performance Computer Architecture (HPCA-2014), February 15-19, 2014, Orlando, Florida, US

arXiv.org e-Print Archive

Crossref

The Implications of Diverse Applications and Scalable Data Sets in Benchmarking Big Data Systems

Author: Gao Wanling
Jia Zhen
Shi Yingjie
Wang Lei
Zhan Jianfeng
Zhang Lixin
Zhou Runlin
Zhu Chunge
Publication venue
Publication date: 30/07/2013
Field of study

Now we live in an era of big data, and big data applications are becoming more and more pervasive. How to benchmark data center computer systems running big data applications (in short big data systems) is a hot topic. In this paper, we focus on measuring the performance impacts of diverse applications and scalable volumes of data sets on big data systems. For four typical data analysis applications---an important class of big data applications, we find two major results through experiments: first, the data scale has a significant impact on the performance of big data systems, so we must provide scalable volumes of data sets in big data benchmarks. Second, for the four applications, even all of them use the simple algorithms, the performance trends are different with increasing data scales, and hence we must consider not only variety of data sets but also variety of applications in benchmarking big data systems.Comment: 16 pages, 3 figure

arXiv.org e-Print Archive

CiteSeerX

Characterizing and Subsetting Big Data Workloads

Author: Han Rui
Jia Zhen
Li Jingwei
Luo Chunjie
McKee Sally A.
Wang Lei
Yang Qiang
Zhan Jianfeng
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2014
Field of study

Big data benchmark suites must include a diversity of data and workloads to be useful in fairly evaluating big data systems and architectures. However, using truly comprehensive benchmarks poses great challenges for the architecture community. First, we need to thoroughly understand the behaviors of a variety of workloads. Second, our usual simulation-based research methods become prohibitively expensive for big data. As big data is an emerging field, more and more software stacks are being proposed to facilitate the development of big data applications, which aggravates hese challenges. In this paper, we first use Principle Component Analysis (PCA) to identify the most important characteristics from 45 metrics to characterize big data workloads from BigDataBench, a comprehensive big data benchmark suite. Second, we apply a clustering technique to the principle components obtained from the PCA to investigate the similarity among big data workloads, and we verify the importance of including different software stacks for big data benchmarking. Third, we select seven representative big data workloads by removing redundant ones and release the BigDataBench simulation version, which is publicly available from http://prof.ict.ac.cn/BigDataBench/simulatorversion/.Comment: 11 pages, 6 figures, 2014 IEEE International Symposium on Workload Characterizatio

arXiv.org e-Print Archive

Crossref

Chalmers Research