Big data analytics workloads are very significant ones in modern data centers, and it is more and more important to characterize their representative workloads and understand their behaviors so as to improve the performance of data center computer systems. In this paper, we embark on a comprehensive study to understand the impacts and performance implications of the big data analytics workloads on the systems equipped with modern superscalar out-of-order processors. After investigating three most important application domains in Internet services in terms of page views and daily visitors, we choose 11 representative data analytics workloads and characterize their micro-architectural behaviors by using hardware performance counters. Our study reveals that the big data analytics workloads share many inherent characteristics, which place them in a different class from the traditional workloads and the scale-out services. To further understand the characteristics of big data analytics workloads, we perform correlation analysis to identify the most key factors that affect cycles per instruction (CPI). Also, we reveal that the increasing complexity of the big data software stacks will put higher pressures on the modern processor pipelines.
Ç 1 INTRODUCTION
I N the context of digitalized information explosion, more and more businesses are analyzing massive amount of data-so-called big data-with the goal of converting big data to big value. Typically, data center workloads consist of two categories: services and big data analytics as mentioned in [1] and [2] . Typical big data analytics workloads include business intelligence, machine learning, bio-informatics, and ad hoc analysis [3] , [4] .
The business potential of big data analytics applications is a driving force behind the design of innovative data center computer systems including both hardware and software [5] , [6] , [7] , [8] . For example, the recommendation system is a typical example with huge business implications, aiming at recommending suitable products to buyers with demand through mining user behaviors and system logs. Given that big data analytics is a very important application area, there is a urgent need to identify the representative data analytics algorithms or applications and understand their characteristics. So it is meaningful for both system designers and researchers to characterize big data analytics workloads and understand interactions among those workloads and the underlying micro-architectures so as to optimize data center computer systems.
In this paper, we first single out three most important application domains in Internet services: search engine, social network, and electronic commerce (listed in Fig. 1 ) according to the widely acceptable metrics-number of page views and daily visitors. And then, we choose eleven representative big data analytics workloads (especially the intersection ones) among the three application domains. Considering that our community may feel interest in using those workloads to evaluate the benefits of new computer system designs and implementations, we release those workloads and the corresponding data sets into an open-source big data benchmark suite-BigDataBench [9] , which is publicly available from [10] .
The whole processor architecture can be roughly divided into two parts: an in-order frontend, which fetches, decodes, and issues micro operations, and an out-of-order backend, which executes micro operations and writes data back to the register file. Based on the selected representative big data analytics workloads, we embark on a comprehensive study to understand their behaviors on modern processors using the Top-Down method [11] . The Top-Down method chooses the issue point as the dividing point and categorizes the processor execution time into four basic parts: Retiringthe issued micro operations are retired at last, Bad Speculation-the cycles wasted because of incorrect predictions, Frontend Bound-the processor frontend undersupplies the backend while the backend is willing to accept new micro operations and Backend Bound-no micro operation is issued due to lacking of corresponding resources.
We characterize the big data analytics workloads and compare them with the traditional workloads, including desktop (SPEC CPU2006), HPC (HPCC), the traditional service (SPECweb2005 and TPC-W), chip multiprocessor (PAR-SEC) and the scale-out services (four among six benchmarks in ClousSuite [12] ) workloads. Since we find that the service workloads in data centers (the scale-out service workloads) share many similarities in terms of micro-architecture characteristics with those of the traditional service workloads, so in the rest of this paper, we just use the term service workloads to describe both the scale-out service workloads and the traditional service workloads. In order to identify the potential optimization methods for the big data analytics workloads, we perform a correlation analysis between cycles per instruction (CPI) and other micro-architecture characteristics. At last, we conduct an investigation of modern big data software stack impacts from the perspective of micro-architecture. We have the following observations:
The Top-Down analysis clarifies that the first bottleneck for the big data analytics workloads is backend and the second bottleneck is frontend. This characteristic is similar in the service workloads, but the latter has more stalls than the big data analytics workloads. The correlation analysis clarifies that most of the big data analytics workloads have common performance bottlenecks, which are L2 cache misses and TLB misses. These characteristics are different from the service workloads, whose performance bottlenecks vary. Simplifying the software stacks for the big data analytics workloads improves the performance. Specifically, Table 1 summarizes our main observations and performance implications for the big data analaytics workloads.
The remainder of the paper is organized as follows. Section 2 summarizes the related work. Section 3 states our experiment methodology. Section 4 presents the microarchitectural characteristics of the big data analytics workloads in comparison with other benchmark suites. Section 5 analyzes the correlation of the measured characteristics with CPI. Section 6 investigates a typical big data software stack's impacts on application behaviors. Section 7 draws conclusions of the full paper. Cause more bad speculation stalls.
Simplifying branch predictor appropriately will not cause much performance loss but save power and die area.
There has been much work evaluating data mining algorithms or clusters using data analytics workloads in different aspects, such as [12] , [13] , [14] , [15] , [16] etc. Narayanan et al. [13] characterize traditional data analytics workloads on a single node other than a data center scale. Huang et al. [14] characterize the MapReduce framework performance from a system perspective. Ghazal [18] present a detailed analysis of Google's data center workloads. They focus on service workloads instead of big data analytcis workloads. Yasin et al. [19] characterize a big data analytics workload (i.e., Naive Bayes) from system, application and architectural perspectives. They only focus on a single workload and we consider more workloads so as to find the shared characteristics. Ferdman et al. [12] characterize scale-out (data center) workloads from the micro-architecture perspective. Their work mainly focuses on online service workloads: among the six benchmarks, there are four scale-out service workloads (including Data Serving, Media Streaming, Web Search, Web Serving). Our work also shows that the data analytics workloads are significantly diverse in terms of micro-architectural characteristics (Section 4) on modern processors. Previous work also found that big data analytics applications show varying performance, energy behaviors and preferable system configuration parameters [20] , [21] , [22] .In a word, for comprehensiveness and fairness, diversity of workloads must be used to represent various categories of big data analytics workloads.
EXPERIMENTAL SETUP
This section first describes the experimental environments on which we conduct our study, and then explains our experiment methodology.
Workloads Selection
In order to find representative big data analytics workloads, we first decide and rank the main application domains according to the number of page views and daily visitors, and then single out the main applications from the most important application domains. We investigate the top sites listed in Alexa [25] , of which the rank of sites is calculated using a combination of average daily visitors and page views. We classify the top 20 sites into five categories including search engine, social network, electronic commerce, media streaming and others. Fig. 1 shows the categories and their respective shares. To keep concise and simple, we focus on the top three application domains: search engine, social network and electronic commerce.
We choose the most 11 popular workloads in those three application domains and show them in Table 2 . Table 3 shows the scenarios of those workloads, indicating most of our chosen workloads are intersections among the three domains.
Hardware Configurations
We use a four-node Hadoop cluster (one master and three slaves) to run all big data analytics workloads. The nodes in our Hadoop cluster are connected through 1 Gb Ethernet network. Each node has two Intel Xeon E5-2680 v3 (Haswell) processors and 64 GB memory. A Xeon E5-2680 v3 processor includes 12 physical out-of-order cores with speculative pipelines. Each core has private L1 and L2 caches, and all cores share the L3 cache. Table 4 lists the important hardware configurations of our platforms.
Big Data Analytics Workloads Setups
All the big data analytics applications are implemented on the Hadoop [26] system. The Hadoop and JDK versions are 1.2.1 and 1.6.0, respectively. We use Hive version 0.6. Each node runs Linux CentOS 6.5 with the kernel 4.1.13. Each slave node is configured with 24 map task slots and 12 reduce task slots. For each map and reduce task, we assign 1 GB Java heap in order to achieve better performance. Table 2 presents the input data size and the retired instructions of each big data analytics workload. In comparison with that of CloudSuite described in [12] , our experimental approach is more pragmatic. We adopt larger input data sets that are stored in both memory and disk instead of completely storing data set (only 4.5 GB for Naive Bayes in [12] ) in memory. The size of input data is around 100 GB for each workload. The retired instructions of the big data analytics workloads range from thousands of billions to tens of thousands of billions, indicating those workloads are not trivial.
Compared Benchmarks Setups
In addition to the big data analytics workloads, we deploy several benchmark suites, including SPEC CPU2006, HPCC, PARSEC, TPC-W, SPECweb 2005, and CloudSuite-a scaleout benchmark suite for cloud computing [12] . We compare them with the big data analytics workloads.
Traditional Benchmarks Setups
SPEC CPU2006: we run the official applications with the first reference input, reporting and averaging the results into two groups, integer benchmarks (SPECINT) and floating point benchmarks (SPECFP). The gcc, which we use to compile the SPEC CPU, is version 4.4.7.
HPCC: we deploy a representative HPC benchmark suite-HPCC on our platform. The HPCC version is 1.4. It has seven benchmarks, 1 including HPL, STREAM, PTRANS, RandomAccess, DGEMM, FFT, and COMM. We run each benchmark, respectively.
SPECweb 2005: we run the bank application on a single node with 24 GB data set. We use distributed clients to generate the workloads, and the number of the total simultaneous sessions is 3,000.
PARSEC: we deploy the PARSEC 2.0 release. We run all benchmarks with native input data sets and use gcc version 4.4.7 to compile them.
TPC-W: we deploy a Java-version TPC-W distribution from University of Wisconsin-Madison [27] with MySQL version 5.1.73 and JDK version 1.6.0.
CloudSuite Setups
CloudSuite 1.0 has six benchmarks, including one big data analytics workload-Naive Bayes. We also choose the Naive Bayes as one of the representative big data analytics workloads with a larger input data set (100 GB). In [12] , the input data size is only 4.5 GB.
We set up the other five benchmarks following the introduction on the CloudSuite web site [28] .
Data Serving: we benchmark Cassandra 0.7.3 database with 30 million records. The request is generated by a YCSB [29] client with a 50:50 ratio of read to update.
Media Streaming: we use Darwin streaming server 6.0.6. We generate 20 Java processes and issue 20 client threads by using the Faban driver [30] with GetMediumLow 70 and GetshortHi 30.
Software Testing: we use the Cloud9 execution engine, and run the printf.bc coreutils binary file.
Web Search: we benchmark a distributed Nutch 1.2 index server. The index and data segment sizes are 17 and 35 GB, respectively.
Web Serving: we characterize a front end of Olio server. We simulate 500 concurrent users to send requests with 30 seconds ramp-up time and 300 seconds steady-state time.
Experimental Methodology
In this paper, we use the Top-Down [11] method to analyze processor pipeline behaviors. Modern processor adopts dynamic execution with out of order and speculative engine. The Top-Down method chooses the issue point as the dividing point. If there is a micro operation issued in a certain cycle, it will be classified into Retiring or Bad Speculation depending on whether the micro operation is canceled. And if there is no micro operation issued, it may be due to the fact that the frontend is not ready with more micro operations (Frontend Bound) or the backend is not ready to process a new micro operation (Backend Bound). That is to say, the Top-Down method categorizes the processor execution time into four basic parts: Retiring, Bad Speculation, Frontend Bound and Backend Bound. Among the four categories above, only Retiring represents the "useful work", and the others are all stalls that prevent the processor pipelines from being fully used. In order to perform the Top-Down method, we use the pmu-tools [31], which is a collection of tools for profile collection and performance analysis on Intel CPUs.
We perform a ramp-up period and then start collecting the performance data. Different from the experiment methodology of CloudSuite [12] , which only performs 180 second measurement, the performance data we collect cover the whole lifetime of each workload including map, shuffle, and reduce stages. We collect the data of all three slave nodes and report the mean value.
CHARACTERIZATION RESULTS

Fig. 2 presents the top level breakdown for all workloads.
CloudSuite has six benchmarks, among which we report Naive Bayes on the leftmost side, separated from the other five workloads (in the middle side), since Naive Bayes is also included into our eleven workloads. We find from Fig. 2 that the big data analytics workloads own less percentage of Retiring cycles than the traditional high performance computing workloads, i.e., HPCC. However, they have more percentage of Retiring cycles than the service workloads, including four of CloudSuite, TPC-W and SPECweb. This explains why the big data analytics workloads own middle CPIs, lower than the service workloads but higher than the traditional workloads, which will be further discussed in Section 4.1.
The HPCC workloads consist of micro benchmarks and kernel programs, and different programs are used to stress different aspects of the system. So their Top-Down breakdown data vary dramatically from each other in Fig. 2 .
For both the big data analytics and the services workloads, the first bottleneck is backend bound and the second bottleneck is frontend bound. The service workloads have more stalls than the big data analytics workloads. The Bad Speculation cycles occupy a small fraction for all workloads, indicating it should have the lowest priority to be optimized.
In the following sections, we will perform a deep analysis for each part shown in Fig. 2 so as to identify the potential bottlenecks.
Retired Instructions
Cycles Per Instruction refers to the average number of processor cycles an instruction consumes in the pipeline. Fig. 3 shows the CPI of each workload. From Fig. 3 , we observe that the service workloads have higher CPIs in comparison with the other workloads including most of the big data analytics workloads, PARSEC, SPECFP, SPECINT, and HPCC workloads. Most of big data analytics workloads have middle CPI values, lower than those of the service workloads but higher than HPCC workloads. The CPIs of the eleven big data analytics workloads range from 0.61 to 1.46. The avg bar in Fig. 3 means the average CPI of the eleven big data analytics workloads, which is 0.77. The CPIs of the HPCC workloads have a large discrepancy among each workload since they are all micro benchmarks designed for measuring different aspects of system. For example, HPCC-HPL and HPCC-DGEMM are computationintensive, and hence they have low CPIs (blow 0.5). While HPCC-COMM is designed to measure latency and bandwidth, which needs to generate long-latency memory accesses, and hence it has higher CPI (greater than 1). Fig. 4 illustrates the retired instruction breakdown of each workload. We notice that the service workloads execute a large percentage of kernel-mode instructions, while most of the big data analytics workloads execute a small percentage of kernel-mode instructions. The service workloads have higher percentages of kernel-mode instructions because serving a large amount of requests will result in a large number of network and disk activities.
Among the big data analytics workloads, only Sort has a high proportion (about 23 percent) of kernel-mode instructions, whereas, on the average, the 11 big data analytics workloads only have about 4 percent instructions executed in kernel-mode. This phenomena can be explained as follows: different from most of the big data analytics workloads, the input data size of Sort is equal to the output data size. So in each stage of the MapReduce job, the system will write a large amount of intermediate data to local disks or transfer a large amount of data over network. This characteristic makes Sort have more I/O or network operations than the other workloads, so it is much OS-intensive.
Among the HPCC workloads, RandomAccess has a large percentage of kernel-mode instructions (about 45 percent). RandomAccess measures the rate of integer random updates of (remote) memory. An update is a read-modify-write operation on a table of 64-bit words, and it involves a large amount of copy user generic string system calls. The other factors contributing to a large percentage of kernel-mode instructions need further investigations.
Observations. The big data analytics workloads have lower CPIs than the service workloads, while higher than the traditional workloads, e.g., HPCC-HPL, HPCC-DGEMM. Meanwhile we also observe that the big data analytics workloads involve less kernel-mode instructions than the service workloads.
Backend Behaviors
Backend Bound can be further split into Core Bound and Memory Bound. Core Bound refers to either execution starvation or execution ports under-utilization. And Memory Bound reflects the execution stalls related to memory hierarchy. Fig. 5 shows the Backend Bound breakdown. For the big data analytics workloads, Core Bound and Memory Bound nearly contribute to the Backend Bound stalls equally. However, the service workloads suffer from more Memory Bound stalls.
Core Bound
We find that the execution port under-utilization dominates the Core Bound stalls. 2 Our processors have eight execution ports, and each execution port can execute one micro operation at a time. Each port executes certain kinds of micro operations. An ideal condition is that the micro operations issued to the execution unit are balanced enough to make each port busy. However in the real situation, the issued micro operations can not always be uniformly distributed to every execution port. Fig. 6 3 shows the execution port utilization for each workload. In Fig. 6 , 0 Port Utilized measures the Core cycles fraction when the CPU executes no micro operation on any execution port. 1 Port Utilized or 2 Ports Utilized indicates the Core cycles fraction when the CPU executes 1 or 2 micro operations in total per cycle on all execution ports. 3m Ports Utilized represents Core cycles fraction when the CPU executes 3 or more micro operations in total per cycle on all execution ports.
From Fig. 6 , we find that the big data analytics workloads own less execution port utilization than most of the HPC workloads, whereas they have better port utilization than the service workloads. Even though there are eight ports in our processors, the Core cycle fraction when the CPU uses 3 or more execution ports is around 30 percent for the big data analytics workloads. In most of the cycle fraction, the CPU only executes 2 or less micro operations concurrently. About 30 percent of the cycle fraction in Fig. 6 has no micro operation to execute. For the service workloads, the port under-utilization is even worse. A large percentage of cycles have no micro operation to execute.
Implications. For both the big data analytics workloads and the service workloads, increasing the execution port utilization will improve the whole performance since they have a large amount of Backend Bound cycles when the CPU only use two or less execution ports concurrently. This phenomenon may be relieved by adopting vectorization or better instruction scheduling [11] . For the number of execution ports and their functionalities are specific to the processor architectures, it may offer an opportunity for hardware and software co-optimization.
Memory Bound
The manufacturers of processors introduce a deep memory hierarchy to reduce the performance impacts of memory wall. Nearly all of the modern processors own three-level caches. The miss penalty of last-level cache can reach up to several hundred cycles in modern processors. Fig. 7 shows the L2 cache misses per thousand instructions (MPKI). Fig. 8 reports the ratio of L3 cache hits over L2 cache misses. This ratio is calculated by using Equation (1) . Please note that we do not analyze the L1 data cache statistics because the miss penalty can be hidden by the out-oforder cores [32] ratio ¼ L2 cache misses À L3 cache misses L2 cache misses :
(1) 2. The Core Bound can be further classified intoDivider Bound and Port Utilization Bound. We find that the Divider Bound only occupies a very small fraction for all workloads 3. Fig. 6 represents the Port Utilization fraction under the Backend Core Bound category in the Top-Down method. Here we give the normalized value so as to make it easier to interpret.
For most of the big data analytics workloads, they have less L2 cache misses (the average L2 Cache MPKI is about 9) than the service workloads (the average L2 Cache MPKI is about 30), while more than the PARSEC and HPCC workloads. The L2 cache statistic indicates the big data analytics workloads own better locality than the service workloads. The HPCC workloads have different locality as the official web site mentioned, which can explain the different cache behaviors among the HPCC workloads.
From Fig. 8 , we find that for both the big data analytics workloads and the service workloads, the average ratios of L2 cache misses that hit in L3 cache are higher (89.2 percent for the big data analytics workloads and 90.1 percent for the service workloads) than those of PARSEC and most of the HPCC workloads. We conclude that for most of the big data analytics and the service workloads, modern processor's last level cache (LLC) is large enough to hold most of the data missed from L2 cache. Fig. 9 shows the data TLB misses per thousand instructions. It includes data TLB misses caused by both load operations and store operations. For most of the big data analytics workloads, the data TLB misses are less than those of the service workloads, but more than those of the PAR-SEC and HPCC workloads with the exception of HPCC-RandomAccess.
Implications. The big data analytics workloads have less L2 cache misses than those of the service workloads, while more than those of the PARSEC and HPCC workloads. Considering the data intensive feature of the big data analytics workloads, the number of L2 cache misses is modest. Meanwhile, most of L2 cache misses can hit in the L3 cache, indicating the L3 cache is pretty effective for the big data analytics workloads. Modern processors dedicate approximately half of the die area for caches, and hence optimizing the last level cache capacity properly may not only reduce the memory access latency but also improve the energy-efficiency of processor and save the die area.
Frontend Behaviors
When we drill down into the Frontend Bound part, there are two categories: Frontend Latency Bound and Frontend Bandwidth Bound. For instance, an instruction cache miss will be classified into the frontend latency bound, whereas, a stall caused by the instruction decoder's inefficiency belongs to frontend bandwidth bound [11] . Fig. 10 shows the Frontend Bound breakdown for each workload. We find that both the big data analytics workloads and the service workloads have more Frontend Latency Bound than Frontend Bandwidth Bound. The service workloads are more sensitive to Frontend Latency Bound than the big data analytics workloads, whereas the traditional HPC workloads are more Frontend Bandwidth Bound.
Instruction cache misses and instruction Translation Look-aside Buffer (TLB) misses will directly increase the instruction fetch latency, and they are classified into Frontend Latency Bound. Also, they are two fundamental components, which must be accessed when the frontend fetches instructions from the memory. Instruction cache is the place where the fetch unit directly gets instructions. Meanwhile TLB stores page table entries (PTEs), which are used to translate virtual addresses to physical addresses. Once the virtual memory is accessed, the processor searches the TLB for the virtual page number of the page that is accessed. If a TLB entry is found with a matching virtual page number, a TLB hit occurs and the processor uses the retrieved physical address to access memory. Otherwise there is a TLB miss, and the processor has to look up the page table, which is called a page walk. The page walk is an expensive operation. For a three-level page table, three memory accesses would be required. In other words, a page walk needs four physical memory accesses.
Figs. 11 and 12 present the L1 instruction cache misses and the instruction TLB misses per thousand instructions, respectively. On average, the big data analytics workloads generate about two L1 instruction cache misses per thousand instructions. They own more L1 instruction cache misses than SPECINT, SPECFP, PARSEC and all the HPCC workloads. Most of the big data analytics applications have less L1 instruction cache misses than the service workloads including Media Streaming, Data Severing, Web Serving, TPC-W and SPECweb. Data Severing has a larger instruction footprint and suffers from severe L1 instruction cache misses, whose L1 instruction cache misses are 15 times more than the average of the big data analytics workloads. Higher L1 instruction cache misses indicate less efficiency of the frontend. For most of the other benchmarks, the L1 instruction cache misses are really rare. Especially the HPCC workloads have relatively small instruction footprints.
Consistent with the performance trend of L1 instruction cache misses, the big data analytics workloads' instruction TLB misses are more frequent than those of the SPECFP, PARSEC, and HPCC workloads. Most of the service workloads have more instruction TLB misses than those of the big data analytics workloads. Lots of TLB misses will cause long latency instruction fetch stalls, and hence result in more Frontend Latency Bound stalls.
Implications. For both the big data analytics and the services workloads, though the frontend bound is only the second bottleneck (after the backend bound) as discussed in Section 4, they indeed suffer from notable instruction cache and instruction TLB misses. Those misses prolong frontend latency and further increase the frontend stalls. This may be caused by two factors: deep memory hierarchy with long latency in modern processors [12] , and large binary size complicated by high-level languages, third-party libraries and deep software stacks [18] . The third-party libraries and software stacks used by the big data analytics workloads may enlarge the binary size of applications and further aggravate the inefficiency of instruction cache and TLB. Enhancing L1 instruction cache and instruction TLB efficiency can improve the performance of the big data analytics workloads, especially for the service workloads. Several possible methods can be applied to address the instruction cache issue as summarized by Kanev et al. [18] , including larger instruction cache, more sophisticated instruction prefetchers [33] , cache partitioning [34] and advanced replacement policies [35] .
Speculation Execution
Modern out-of-order processors introduce a functional unit (e.g., Branch Target Buffer (BTB)) to predict the next instruction address to avoid pipeline stalls due to branch instructions. If the prediction is correct, the pipeline will continue. However, if a branch is mispredicted, the pipeline must flush the wrong instructions and fetch the correct ones, which will cause at least a dozen of cycles' penalty. So the branch prediction is not a trivial issue in the pipeline. Branch mispredictions dominate the bad speculations, which can directly affect the pipeline performance. Fig. 13 presents the branch misprediction ratios of each workload. We find that most of the big data analytics workloads own low branch misprediction ratios in comparison with those of the service workloads and the SPEC CPU workloads. The low misprediction ratios of the big data analytics workloads indicate that most of the branch instructions in the big data analytics workloads have simple patterns. The simple patterns are conducive to Branch Target Buffer to predict whether the next branch needs to jump or not. For big data, simple algorithms always beat sophisticated algorithms [36] , which may be the possible reason for their low misprediction ratios. The HPCC workloads own very low misprediction ratios because the branch logic code of the seven micro benchmarks is simple and the branch behaviors have great regularity.
Implications. A predictor with 100 percent prediction accuracy is unpractical. Modern processor's branch predictor is pretty effective for the big data analytics workloads. The misprediction ratio (less than 2 percent on average) is lower than most of the compared workloads including the CPU benchmark-SPECINT. For the big data analytics workloads, since they have small fractions of Bad Speculation in the Top-Down breakdown (Fig. 2) and low misprediction ratios, putting a great effort to increase the branch prediction accuracy may not gain performance improvement. On the other hand, modern processors have invested heavily in silicon real estate and algorithms for the branch prediction unit in order to minimize the frequency and the impact of branch mispredictions. So for the big data analytics workloads, a simpler branch predictor may be preferred so as to save power and die area.
CORRELATION ANALYSIS ON THE SEQUENCES OF SAMPLED PERFORMANCE DATA
For all the workloads investigated in this paper, we capture the hardware performance counters at an interval of one second. So for each metric of each workload, there is a sequence of values with different time stamps. In Section 4 we characterize the big data analytics workloads by analyzing the average micro-architectural data of each workload. In this section, with the sequences of sampled data, we would like to perform a correlation analysis between cycles per instruction and other micro-architecture characteristics. Correlation analysis is used to measure the relationship between two items and show the statistical relationships involving dependence [37] . Correlations are useful because they can indicate a predictive relationship that can be exploited in practice. The Pearson's correlation coefficient is one of the most popular methods for correlation analysis. It is a measurement of the linear correlation between two variables. It is defined as the covariance of the two variables divided by the product of their standard deviations, which is represented by
The Pearson's correlation coefficient ranges from À1 to 1. The absolute value of the correlation coefficient shows the dependency. The bigger the absolute value, the stronger the correlation between the two variables is. The positive number means a positive correlation, and vice versa. A value of 1 implies that a linear equation describes the relationship between X and Y perfectly. A value of 0 implies that there is no linear correlation between the variables.
As explained in Section 4.1, Cycles Per Instruction can be identified as the metric to evaluate the application performance on a pipeline from the perspective of micro-architecture. The more cycles a processor takes to complete an instruction, the poorer the performance of the application in the pipeline. In order to decide which factors affect the CPI, we compute the correlation coefficients of the above microarchitectural data in Section 4 with CPI and show them in Tables 5 and 6 . Here we perform correlation coefficients with both the metrics defined in the Top-Down method, which calculate the cost of micro-architecture events' penalty in cycles, and the conventional metrics, e.g., miss rates. All the metrics defined in the Top-Down method have a suffix of Bound in those tables.
In Tables 5 and 6 , we only present the first five metrics that have the highest correlation coefficients with CPI. The metrics and the corresponding correlation coefficients are shown in a descending order. From Tables 5 and 6 , we have the observations as follows.
For most of the big data analytics workloads, the CPI is sensitive to L2 cache misses and TLB misses including instruction TLB and data TLB. That is to say that they have strong positive correlations with CPI. We find that most of the metrics listed in the tables belong to backend bound metrics, including data TLB miss, L2 cache miss, which means the backend needs to be optimized with high priority. This observation corroborates the ones we give in Section 4 that those workloads spend the major pipeline time stalled on backend. For the service workloads, the metrics that affect the CPIs significantly are more diverse. Some are very sensitive to frontend latency, such as Web Serving. Some are sensitive to branch instruction execution, including both branch misprediction ratio and branch instruction ratio.
Also, we find that the CPIs of the big data analytics workloads have strong positive correlations with instruction TLB misses, however the Frontend Bound related metrics, e.g., Frontend Latency Bound, do not appear in the table. This is because that the Frontend Latency Bound also contains other events, such as Length Changing Prefix stalls, in addition to instruction TLB miss. That is to say, the TLB misses will increase the CPI for the big data analytics workloads, however the impact of instruction TLB misses is not so significant and the frontend is not a primary bottleneck of the whole pipeline.
The chip multiprocessor (PARSEC) and the high performance (HPCC) workloads are also very sensitive to L3 cache performance including the metrics of L3 cache miss (L3 cache misses per kilo instructions) and L3 Bound (stalled cycles due to L3 accesses) in addition to L2 cache performance. We also find that the HPCC workloads are sensitive to nearly the whole memory hierarchy performance. In Table 6 , we observe the metrics of Memory Bound (stalled cycles due to memory accesses), L3 Bound, L2 Bound (stalled cycles due to L2 accesses) and data TLB miss for HPCC workloads appear. The execution port utilization, i.e., 0 Port Utilized Bound, also has a positive correlation with CPI for PARSEC and several HPCC workloads, which indicates the execution ports' under-utilization has a significant impact on CPI. For almost all the big data analytic workloads, most of the service workloads and the traditional workloads, the proportion of kernel-mode instruction has a positive correlation with CPI. Most of the correlation coefficients between kernel-mode instructions and CPI are no less than 0.6, which indicates strong positive correlations. Both instruction and data TLB performance also have impacts on CPI for most workloads we investigate, because of the large miss penalty generated by the TLB miss. 4 Implications. There are many potential optimization points in modern superscalar processors as discussed in the previous work, e.g., on-chip bandwidth, die area and etc [12] . The 4. Our Xeon processor have a two-level TLB. The first level has separate instruction and data TLBs. The second level is shared. above observations can give us some implications about how to alleviate the bottlenecks in pipelines, although one well known consequence is that right after alleviating a bottleneck, the next bottleneck emerges [38] . According to our analysis in this section, architects should focus on improving TLB performance and L2 cache performance with the highest priority for the big data analytics workloads. Just as a page walk is a very expensive operation, optimizations should focus on reducing the miss penalty either by enlarging the TLB capacity to hold more entries or by accelerating the speed that refills the TLB. For the L2 cache, considering that the big data analytics workloads have modest L2 cache misses (the average L2 cache MPKI is about 9 in Section 4.2), further reducing the number of cache misses may need a large amount of efforts but achieve limited performance gains. Therefore the miss penalty of L2 cache should have higher priority to be optimized than miss ratio. We have found that the last level cache can hold most of the misses from the L2 cache as mentioned in Section 4.2, so reducing the capacity of the last level cache appropriately may be a good choice, just as we suggested in Section 4.2. A smaller last level cache can not only reduce the L2 cache miss penalty but also improve the energy efficiency and save the die area. However, for chip multiprocessor (PARSEC) and high performance (HPCC) workloads, reducing the last level cache capacity may not be a good choice since their performance is very sensitive to the L3 cache miss ratio.
HEAVIER SOFTWARE STACK'S IMPACT
Complex software stacks are being proposed to facilitate the development of big data analytics applications. Those software stacks, such as Spark [39] and Hadoop [26] , have attracted a large number of users and companies in a short period of time [4] , [40] . On one hand, the big data software stack facilitates programmers to write applications without considering the messy details of data partitioning, task distribution, load balancing, failure handling and other data center-wide system details [41] . On the other hand, the big data software stacks may affect the application behaviors for the heavier software stacks increase the call hierarchy. Since all the big data analytics workloads we characterize in Hadoop has three different operation modes: standalone (local) mode, pseudo-distributed mode and fully distributed mode [42] . In a standalone mode, Hadoop will run completely on the local machine. It does not use Hadoop Distributed File System (HDFS), nor will it launch any of the Hadoop daemons. The pseudo-distributed mode runs Hadoop in a "cluster of one" with all daemons running on a single machine. And the fully distributed mode provides a production environment, which can manage a large number of nodes. In the previous sections of this paper, all the big data analytics workloads are executed in this mode.
We take a glimpse at the heavier big data software stack's impacts by comparing the application behaviors between a stand-alone mode and a pseudo distributed mode. The stand-alone mode eliminates the HDFS and daemon processes' impacts, and provides us the chance of executing the same user application code with less call hierarchy. Table 7 shows the call hierarchies for those two modes. The pseudo-distributed mode eliminates the inter-node network factor brought by a fully distributed mode.
We choose eight applications, including IBCF, Sort, WordCount, Grep, PageRank, K-means, Fuzzy K-means and Naive Bayes, to investigate the impacts of the typical big data software stack. We remove the other three applications that uses the third party libraries heavily. 5 For all the eight applications are running on a single node, we must drive them with smaller data sets to avoid overloading. We generate about 10 GB data set for each workload and use the same data set to drive applications running in different operation modes. We run the same application on different modes and collect the micro-architectural metrics.
Top Level Breakdown
Fig. 14 shows the top level breakdown for both the pseudodistributed mode and the standalone mode workloads. We find that for all applications, the stand-alone mode workloads have less Frontend Bound stalls than their pseudo-distributed mode counterparts. This indicates that the heavier software stack puts more pressures on the pipeline frontend. It seems that even though the speculation execution engine is pretty efficient for the big data analytics workloads, which own small fractions of Bad Specutation (Section 4.4), a thinner software stack still further reduces Bad Speculation cycles. In Fig. 14, we find that nearly for all applications, the proportion of Bad Speculation is reduced when we change from the pseudo-distributed mode to the stand-alone one. The standalone mode workloads always have more Backend Bound stalls than the pseudo-distributed counterparts for all applications. In the following sections, we will drill down into the pipeline backend and frontend behaviors. We do not perform deeper analysis on bad speculation because of its small fraction in Fig. 14 and the space limitations. 
"Y" means the corresponding mode will invoke the item. "N" means the corresponding mode will not invoke the item.
5. HMM invokes ICTCLASS [43] ; SVM invokes LIBSVM [44] and Hive-bench invokes Hive.
Backend Impacts
The pipeline backend suffers from a large percentage of stalled cycles for all workloads we investigate in this paper and owns plenty of potential room for optimization. Fig. 15 shows the Backend Bound breakdown for both the pseudodistributed mode and the stand-alone mode workloads. We find that the stand-alone mode workloads have less Backend Memory Bound stalls than their pseudo-distributed counterparts. The Backend Memory Bound refers to the execution stalls related to the memory subsystem, e.g., long latency memory accesses that miss all level caches. This phenomenon implies that the participation of the heavier software stack increases memory access latency for the big data analytics applications. In order to process big data, the software stack needs to provide a unified scalable file system for user applications, such as HDFS-a scalable, fault-tolerant, distributed storage system that is integrated into the Hadoop software stack. The unified file system is a new layer upon the Linux native file system and it increases the length of data access path. At the same time, the fault-tolerant mechanism must be executed to replicate newly generated data. 6 All those features make the heavier big data analytics applications (pseudo-distributed mode) incur larger working sets and longer memory access latency.
From Fig. 15 , we also find that the stand-alone workloads have more Backend Core Bound stalls than their pseudo-distributed counterparts. This phenomenon can be explained by the added pressures on certain execution ports that serve specific kinds of micro operations. For instance, in our Haswell processors, there are eight ports and only No. 4 port can process data store operations. If lots of instructions in the pipeline need to store data, No. 4 port will be very busy and other ports may be idle. So the overall result is that the execution ports are under-utilized. The pseudo-distributed mode workloads owning less pressures on certain execution ports may be caused by two factors. The first one is that the heavier software stack diversifies the kinds of micro operations issued simultaneously and amortizes the pressures on certain execution ports. The second one is that the heavier software stack shifts the pressures from the pipeline backend to the pipeline frontend. For the pseudo-distributed mode workloads, more micro operations are stalled at the pipeline frontend and less micro operations can be issued to the backend in comparison with the standalone counterparts, which alleviate the pressures of certain ports in the pipeline backend.
The pseudo-distributed mode workloads have more Backend Memory Bound stalls but own much less Backend Core Bound stalls when compared with their stand-alone counterparts.
Frontend Impacts
In Section 4.3, we infer that the heavier software stack increases the instruction footprint and incurs more frontend stalls. Fig. 16 verifies our inference in some degree, which shows the Frontend Bound breakdown when the applications run at the stand-alone mode and the pseudodistributed mode. We find that the heavier software stack has impacts on both frontend latency and frontend bandwidth. For all the applications, the pseudo-distributed mode has more Frontend Latency Bound stalls than that of the stand-alone mode. For some applications, the pseudodistributed workloads also have more Frontend Bandwidth Boundstalls than those of the stand-alone workloads. This 6. In the pseudo-distributed mode, there is an HDFS but with no data replication, we mention the data replication here since it is a common feature for a regular big data system. implies that the heavier software stack not only puts more pressures on the instruction fetch unit and prolongs frontend fetch latency but also increases the burden of the instruction decoder and incurs more Frontend Bandwidth Bound stalls. Fig. 17 illustrates the retired kernel-mode instruction ratios. We find that the applications running in a pseudo-distributed mode have less kernel-mode instructions than those of their stand-alone counterparts for all big data analytics workloads. This implies that most of the functions are implemented in the application level for the pseudo-distributed mode. So most of the instructions introduced by the heavier software stack are executed in user-mode (i.e., executed on ring 1 to ring 3). The kernel-mode instruction ratios are diluted. The notable one is Sort, which triggers a lot of system calls as explained in Section 4.1. With the heavier software stack, a large amount of user-mode instructions dilute the kernel-mode instruction ratio and make its kernel-mode instruction ratio reduce from 30 to 16 percent.
Kernel-Mode Instruction Ratios
Observations and Implications
From the above experiments, we find that the heavier software stack has the following impacts on application behaviors from the perspective of micro-architecture. 1) The heavier big data software stack increases the instruction footprint and puts more pressures on pipeline frontend. 2) The heavier big data software stack reduces the backend core bound. 3) The heavier big data software stack enlarges the working set and prolongs memory access latency. 4) The heavier big data software stack reduces the whole applications' kernel-mode instruction ratio.
At the software layer, reducing the software stack definitely relieves the impacts brought by it, e.g., frontend pressures, longer memory access latency. Unfortunately, the current tendency is on the reverse: the software stacks are more and more complicated [45] . Better scheduling the micro operations introduced by the heavier big data software stacks to fill the under-utilized execution ports may be one of the methods that increase the port utilization and achieve performance gains. At the same time, the potential burden introduced by the heavier software stacks should be noticed and kept in mind when writing big data analytics applications or performing optimizations.
CONCLUSION
In this paper, after investigating three most important application domains in terms of the page views and the daily visitors, we choose eleven representative big data analytics workloads and use the Top-Down analysis method to characterize their micro-architectural characteristics on the systems equipped with modern superscalar out-of-order processors. Our study reveals that the big data analytics workloads share many inherent characteristics, which place them in a different class from the desktop, HPC, chip multiprocessor, and service workloads. Also, we perform correlation analysis to identify the most key factors that affect cycles per instruction. We find increasing complexity of the big data software stacks will put higher pressures on the modern processor pipelines.
Wangling Gao received the BS degree from Huazhong University of Science and Technology, in 2012. She is working toward the PhD degree in computer science in the Institute of Computing Technology, Chinese Academy of Sciences, and University of Chinese Academy of Sciences. Her research interests focus on big data benchmark and big data analytics.
Lixin Zhang received the BS degree in computer science from Fudan University, in 1993, and the PhD degree in computer science from the University of Utah, in 2001. He is a professor in the Institute of Computing Technology, Chinese Academy of Sciences. His main research areas include computer architecture, data center computing, high performance computing, advanced memory systems, and workload characterization. He was previously a research staff member with IBM Austin Research Lab and a master inventor of IBM.
" For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib.
