Abstract Cache has long been used to minimize the latency of main memory accesses by storing frequently used data near the processor. Processor performance depends on the underlying cache performance. Therefore, significant research has been done to identify the most crucial metrics of cache performance. Although the majority of research focuses on measuring cache hit rates and data movement as the primary cache performance metrics, cache utilization is significantly important. We investigate the application's locality using cache utilization metrics. Furthermore, we present cache utilization and traditional cache performance metrics as the program progresses proThis work was performed in part under the auspices of the US Government contract DE-AC52-06NA25396 for Los Alamos National Laboratory, operated by Los Alamos National Security, LLC, for the US Department of Energy (LA-UR-17-24198). Also, this work was partially supported through funding provided by the U. viding detailed insights into the dynamic application behavior on parallel applications from four benchmark suites running on multiple cores. We explore cache utilization for APEX, Mantevo, NAS, and PARSEC, mostly scientific benchmark suites. Our results indicate that 40% of the data bytes in a cache line are accessed at least once before line eviction. Also, on average a byte is accessed two times before the cache line is evicted for these applications. Moreover, we present runtime cache utilization, as well as, conventional performance metrics that illustrate a holistic understanding of cache behavior. To facilitate this research, we build a memory simulator incorporated into the Structural Simulation Toolkit (Rodrigues et al. in SIGMETRICS Perform Eval Rev 38(4):37-42, 2011). Our results suggest that variable cache line size can result in better performance and can also conserve power.
Introduction
The memory system is one of the main components of modern processors and one of the main bottlenecks of processor performance, due to the memory wall [2] . Memory wall is the speed gap between the ultra-fast central processing unit (CPU) and the relatively slow main memory. Therefore, to compensate for the speed gap, computer architects introduced on-chip cache, a small and fast memory comparatively near the processor. However, cache can alleviate the performance degradation caused by the speed gap but not completely eliminate it.
Typically, cache improves the performance of the processor by exploiting locality of reference (referred to as locality [3] ) exhibited by all applications. There are two types of locality:
1. spatial locality-the use of data items in memory near recently used items 2. temporal locality-the reuse of a data item over time.
Generally, the compiler optimizes the code and maps data items in such a way to achieve optimal data locality during execution. Although the cache performance of an application is predominantly dependent on how it exploits locality, there is no direct method to determine locality at runtime.
Traditionally, the performance of cache is explained by performance metrics such as cache miss rate, data movement between memory hierarchies, data access latency, and power consumption. Besides these performance metrics, another metric, cache utilization can be used to demonstrate an application's behavior in cache [4] . Cache utilization is a combination of two utilization metrics:
1. cache line utilization-the fraction of a cache line accessed once before eviction 2. cache byte utilization-the frequency of accesses of data bytes in a cache line before eviction Therefore, line utilization and byte utilization are measures of spatial locality and temporal locality, respectively. In the context of performance perspective, poor cache utilization implies that the application is wasting both cache space and memory bandwidth.
Even though the idea of cache line utilization has been explored in the literature [5] [6] [7] [8] to design energy efficient caches with variable cache line sizes, the methodology and the purpose of studying cache utilization in this work are different. Most research on evaluation of cache performance reports the performance for total benchmark execution but does not demonstrate the variation of cache performance throughout execution. Due to variation in locality during application lifetime, cache performance can vary throughout execution. Therefore, it may be misleading for the application performance analyst and memory designers to rely on only the overall performance metrics. Specifically, power aware cache designs that sacrifice storage space in order to decrease power usage will benefit from runtime performance metrics. Moreover, runtime cache performance evaluation can aid in determining the most critical portions of an application that need to be simulated for testing new architectural designs, minimizing simulation time.
Contemporary research on reducing execution time for simulation and testing purposes uses representative samples [9] . Instead of executing the entire application, only the critical portion is targeted for execution. Currently, this procedure has been enhanced by SimPoint [10] . In SimPoint, the basic block distribution of a program is identified for a set of inputs to execute. PinPoint [11] is a precise implementation of SimPoint implemented with the Pin tool [12] . The methodology of SimPoint differs from our work since our work focuses on cache performance instead of basic block representation. Further, we investigate the time-varying nature of cache utilization.
Among various types of applications, we focus on scientific, graph, and stream applications to study the utility of both cache and byte utilization. Typically, scientific applications are compute intensive. Such applications normally fully utilize the processor computer power as long as the cache hierarchy and underlying memory system can support the application requirements adequately to exploit spatial and temporal locality. On the other hand, graph applications are memory intensive [13] .
Our objectives in this article are two: first, to identify the locality of the applications through measuring both cache and byte utilization and second, to investigate timevarying behavior of cache utilization and general cache performance metrics.
The rest of the paper is organized as follows: Sect. 2 discusses cache performance metrics and application properties. Methodology and experimental setup are explained in Sect. 3. Our results are presented in Sect. 4, and Sects. 5 and 6 discuss related work and conclusions, respectively.
Methodology
In this section, we define cache utilization performance metrics and explore the relationship to the application's inherent cache access characteristics, spatial and temporal locality. Moreover, we explain the methodology that we use to gather time-varying cache utilization and conventional cache metrics.
Cache performance metrics
Cache is small, fast memory that stores frequently used data in fixed size data blocks called cache lines. The access pattern of variables in cache lines may not match the mapping of the variables in virtual address space. These factors cause eviction and replacement of cache lines. The cache line replacement policy controls which cache lines are to be replaced. Cache line replacement policies are implemented in hardware and include: least recently used (LRU), most recently used (MRU), and random. All bytes of a cache line might not get accessed before the line is replaced, possibly resulting in inefficient use of the cache line [3] . This phenomenon can be assessed using the performance cache metrics explained in the following sections.
Cache utilization
Cache utilization is the efficiency of the use of the cache lines during the span of execution or interval of interest for an application. It can also be used to estimate the proportion of power wasted in the cache due to storing and transferring the unused portions of cache lines. We define two types of cache utilization:
1. cache line utilization-the fraction of data in a cache line that is accessed at least once before the cache line is evicted 2. cache byte utilization-the access frequency of data bytes in a cache line Cache line utilization (CLU) is the cache line bytes that are accessed, per cache line, divided by the cache line size (64 bytes in our experiments), and averaged across the number of cache lines in the cache, Eq. 1. B n is the number of bytes accessed in cache line n, and N L is the total number of cache lines.
The unused data items in a cache line represent potential savings in terms of used bandwidth and power consumed to fetch, store, and evict unused data bytes. This is one major source of power wastage in caches. For most applications, CLU fluctuates at the beginning due to cold misses, cache misses that occur initially due to empty cache. Cold misses cause an increase in cache misses initially that often stabilizes as the application progresses.
Cache byte utilization (CBU) is the access frequency of a data byte before the cache line is evicted and is determined using Eq. 2. A n is number of times byte n is accessed, the cache line size is 64 bytes, and N a is the number of times the cache line is accessed.
We use two additional cache performance metrics, line access distance and byte access distance for application characterization in the computation of degree of locality as explained in Sect. 2.2. Line access distance is the number of references between consecutive accesses to the same cache line. Similarly, byte access distance is the number of references between consecutive accesses to the same byte.
Data miss rate
Miss rate is defined as the fraction of memory references not found in cache, Eq. 3. If a memory reference is not found in cache, it is brought from higher levels of the memory system (other cache levels or main memory). The new cache line will replace another that the replacement policy predicts will not be used in the near future [3] .
Miss Rate = Cache Misses ÷ References (3)
Data movement
Data movement is the movement of data between levels of the cache hierarchy. Typically, when a cache miss occurs, the missed data are read from a lower level cache and stored in the cache line. When replacing a cache line is necessary, dirty cache lines (i.e., cache lines with modified data) from higher level caches are written back to lower levels and are also included in data movement measurements. For this article, only the data movement between L1 and L2 cache is shown.
Application characteristics
Among many properties, the two most important application properties that affect cache performance are spatial locality and temporal locality [14] . We define two metrics: degree of spatial locality and degree of temporal locality to quantify these properties for each application. Another property that can drive cache performance is the working set size of the application. Details of these three properties are explained as follows:
Spatial locality (locality in space)
Spatial locality is the property of accessing data located near recently accessed data. Therefore, if many of the bytes in a cache line are accessed at least once before the cache line is evicted from the cache, this cache line exhibits spatial locality. Thus, the metric cache line utilization describes spatial locality. Furthermore, if most of the bytes of a cache line are accessed at least once, then the access pattern of the application exhibits significant spatial locality [3] . Typically, spatial locality of an application changes as the application progresses in time [3] . Spatial locality is quantified by the metric degree of spatial locality the ratio of line utilization to line access distance. Due to the limited size of cache, the cache line is replaced by a requested line that is not currently allocated. The metric for this is line access distance, the average number of references between consecutive accesses to the same cache line.
Temporal locality (locality in time)
Temporal locality is the property of accessing recently accessed data again in the near future. Therefore, if a byte or a group of consecutive bytes in a cache line is accessed multiple times before the cache line is evicted, the data exhibit temporal locality. Thus, byte utilization corresponds to temporal locality. Moreover, if most of the bytes of an application are accessed multiple times, then the access pattern of the application exhibits significant temporal locality [3] .
Temporal locality is quantified by the metric degree of temporal locality (DL tmp ), the ratio of byte utilization to byte access distance, the average number of references between consecutive accesses to the same byte.
Working set
The working set of an application is the summation of pages accessed during the entire execution. In our experiment, the page size is 4 KBytes. This definition is partially different from the conventional definition, since the conventional working set size is the summation of pages during a specific time interval [3] . Most applications access the majority of pages at the beginning of execution.
Runtime performance metrics
Characterization of application cache performance by using metrics accumulated for the entire execution, especially for applications with irregular runtime behaviors, can be misleading. Therefore, we investigate the time-varying nature of cache performance metrics. Typically, cache utilization, data movement, miss rate, and working set size change during execution. We place checkpoints for every million references and collect the events to assess cache performance metrics throughout execution.
Experimental setup
The hardware model of the cache hierarchy is implemented in the Structural Simulation Toolkit (SST) [1] created by Sandia National Laboratory (SNL). We implement our cache functional simulator in Ariel, a processor framework incorporated in SST. In this framework, the application executes on multiple Ariel cores, as illustrated in Fig. 1 . Memory references incurred by the application for each core of the Ariel processor model are recorded using PIN [12] , integrated in Ariel by design.
Cache configuration
We implement a cache model with similar base configuration to the cache hierarchy of the Intel Sandy Bridge processor. The cache configuration is listed in Table 1 . Each core has private L1 data and instruction cache, as well as private L2 cache. However, L3, last level cache, is shared by all cores. 
Benchmarks
In this section, we present the runtime performance of applications from four benchmark suites representing a variety of domains. The application suites are APEX [15] , Mantevo [16] , NAS [17] , and PARSEC [18] . All the benchmarks are written in C/C++ or Fortran and parallelized with OpenMP. Table 2 presents details about each benchmark.
APEX
Application Performance at Extreme Scale (APEX) [15] is an alliance consisting of teams from Los Alamos National Laboratory (LANL), Sandia National Laboratory (SNL), and the National Energy Research Scientific Computing Center (NERSC) at Berkeley Laboratory. The APEX benchmarks are used to assess the performance of various workflows expected for high-performance computer systems. We use three benchmarks from APEX to characterize cache behavior.
DGEMM (dense-matrix multiply)
is an example of a simple, multithreaded, densematrix multiply application. The code is designed to measure the sustained, floating point computational rate of a single node.
HPCG (high-performance conjugate gradient) is designed to exercise computational and data access patterns of a broad set of important scientific applications. The scientific applications include sparse matrix-vector multiplication, sparse triangular solve vector updates, global dot products, and local symmetric Gauss-Seidel smoother.
PENNANT is a mini-app with data structures for manipulating 2-D unstructured finite element meshes containing arbitrary polygons. It implements a Lagrangian staggered-grid hydrodynamics algorithm on a 2-D unstructured finite-volume mesh composed of arbitrary polygons with arbitrary connectivity. Due to Staggered-grid algorithm, some state variables are defined on mesh points, while others are defined on zones that show irregular memory access patterns.
Mantevo
The Mantevo mini-apps [16] represent classes of scientific applications. The following descriptions are for the four mini-apps we use to illustrate characterization of cache behavior:
CoMD (co-design molecular dynamics) is an example of a typical classical molecular dynamics algorithm and workload featuring the Lennard-Jones potential and the embedded-atom method potential.
HPCCG (high-performance computing conjugate gradient) generates a synthetic linear system that generates a 27-point finite difference matrix in a sparse iterative format.
miniFE (mini-application finite element) assembles a sparse linear system from the steady-state conduction equation on a brick-shaped problem domain of linear 8-node hex elements and solves the linear system using a simple un-preconditioned conjugate gradient algorithm. After the end, the solution gets compared with the analytic solution of the problem.
miniMD (mini-application molecular dynamics) is a typical molecular dynamics application that computes the force.
NAS
The NAS Parallel Benchmarks (NPB) are a small set of programs designed to help evaluate the performance of parallel supercomputers developed by NASA [17] . The benchmarks are derived from computational fluid dynamics (CFD) applications. In our analysis, we use seven applications. Selection of application input (i.e., W (90's work station), A, B, C, D, E, or F) for NAS benchmarks is determined by the working set size and number of memory references. We select the input size in such a way that the application generates a working set size greater than 1 MByte and results in more than one billion memory references.
BT (block tri-diagonal) is a computational fluid dynamics (CFD) application that uses an implicit algorithm to solve 3-dimensional (3-D) compressible Navier-Stokes equations.
CG (conjugate gradient) uses a conjugate gradient method to compute the smallest eigenvalue of a large, sparse, unstructured matrix. Further, it tests unstructured grid computations and communications by using a matrix with randomly generated locations of entries.
EP (embarrassingly parallel) is an embarrassingly parallel benchmark that generates pairs of Gaussian random deviates according to a specific scheme. As a result, this can establish the reference point for peak performance of a given platform.
FT (Fourier transform) performs 3-D partial differential equation using FFTs. FT performs three one-dimensional (1-D) FFTs, one for each dimension.
IS (integer sorting) performs integer sorting algorithm in particle method codes that are currently present in physics. The sorting operation is used to reassign particles to the appropriate cells.
MG (multigrid) computes the solution of the 3-D scalar Poisson equation. The algorithm works continuously on a set of grids to find the shortest and longest distances.
PARSEC
The Princeton Application Repository for Shared-Memory Computers (PARSEC) is a well-known benchmark suite composed of multithreaded programs [18] . The suite is commonly used to evaluate shared-memory chip-multiprocessors. Most of the applications of this suite are Intel RMS benchmarks and are related to practical life.
Blackscholes calculates portfolio prices using Black-Scholes partial differential equation (PDE). There is no closed-form expression for the Black-Scholes equation, and it must be computed numerically, making computationally intensive.
Bodytrack tracks a human body with multiple cameras through an image sequence. This benchmark is representative of computer vision algorithms in areas such as video surveillance, character animation, and computer interfaces.
Freqmine employs an array-based version of the FP-growth (frequent patterngrowth) method for frequent itemset mining (FIMI). Freqmine represents data mining techniques.
Cache performance and analysis
In this section, we present the performance results for cache utilization, miss rate, and data movement for a set of applications from the APEX, Mantevo, NAS, and PARSEC suites running on a multicore system. Also, we measure the overall working set size of the applications. Moreover, we report performance metrics as the program progresses (i.e., runtime evaluation), characterizing the time-varying cache performance for each application. In our experiments, we run the applications with 1, 2, 4, and 8 threads on 1, 2, 4, and 8 cores, respectively. Figure 2 presents the average cache line utilization for all the benchmarks running on 1 to 8 cores. For almost all applications, cache line utilization decreases with the core count since cache lines are evicted to keep the cache coherent in a multicore system. However, for some of the applications, DGEMM, HPCCG, CG, and BODYTRACK, cache line utilization improves slightly with the number of cores. These applications show overall poor utilization, illustrating poor spatial locality. Therefore, while the number of cores used increases, multiple threads are scheduled to run on different cores, thus data are evicted from cache less frequently. As a result, the utilization increases slightly with the number of cores. This observation also predicts the high miss rate of these benchmarks that decreases as the number of cores increase, Fig. 8 . However, although BODYTRACK possesses poor line utilization, the miss rate is low (2.78) due to sequential access to the same data item (1.2 times on average), achieving good byte utilization (Fig. 6) . Additionally, these benchmarks share less data resulting in less issues to keep the data coherent in a multicore system. DGEMM exhibits the lowest utilization (10%) while FT shows the best utilization (80%). DGEMM is a basic matrix multiplication with floating point data. Since in matrix multiplication rows of one matrix are multiplied by columns of another matrix, the access distance of unique cache line is large (18 memory requests), introducing poor cache line utilization. On the other hand, FT performs a fast Fourier transform (FFT) and does all-to-all data computation that achieves the best cache line utilization. However, cache line utilization decreases as the number of cores increases for all other applications. Generally, multithreaded applications, running on multiple cores, exercise the cache coherence mechanism that keeps data coherent and removes non-updated shared data. Therefore, for most cases, line utilization decreases with an increase in the number of cores.
Cache utilization
The distribution of cache line utilization has been grouped into five bins, as shown in Fig. 3 . The first bin contains the distribution of cache line accesses that between 1 and 13 bytes of the cache line have been touched before the line is evicted. Similarly, the second, third, fourth, and fifth bins contain distribution of cache lines with 14-26, 27-38, 39-51, and 52-64 bytes of data that have been touched before the corresponding cache line is evicted. This cache line utilization histogram is collected while the benchmarks are running on 8 cores. DGEMM and BODYTRACK perform poorly, showing low levels of cache utilization. About 95% of their cache line accesses fall in the lowest bin. In addition, CoMD, HPCCG, and IS access only 1-13 bytes for more than 60% of cache lines. On the other hand, FT exhibits the best cache line utilization. About 82% of the total cache lines are in the bins with 52 or more bytes touched, followed closely by HPCG, BT, and MG with approximately 65% of cache lines having 39 or more bytes accessed. The rest of the applications show moderate cache line utilization.
The cause of variability in cache line utilization between these applications is explained partially in Fig. 4 . The y-axis presents reuse distance, the average number of memory requests between accesses to the same cache line. One cause of poor cache utilization is due to high reuse distance with high data miss rate. For example, CG and IS access the same cache line after more than 20 references, demonstrating poor utilization (more than 80% of the cache blocks in CG and IS utilize less than 26 bytes). As the associativity of the cache is 4-way, a reuse distance of 20 references indicates that the cache line(s) can get evicted before reuse, leading directly to poor utilization. The pattern of cache line reuse distance varies with the number of cores for each application.
The average cache line reuse distance for the applications running on 8 cores is shown in histogram format in Fig. 5 . Typically, references accessing the same cache line that has been accessed for the previous reference may improve utilization. However, accessing the same data item does not improve line utilization. For example, although 80% of references in DGEMM and CS access the same cache line that has been accessed 1-2 references before, they achieve poor cache line utilization. On the other hand, FT accesses the same line for 80% references but shows 80% cache line utilization. DGEMM accesses the same data item, and FT accesses a different data item in the same cache line. However, HPCG achieves 75% line utilization when 85% of the references access the just previously accessed cache line. Figure 6 presents the average accesses of data bytes in a cache line. CoMD exhibits the maximum average number of accesses per data byte (7-8 accesses), and DGEMM (0.1 access on average) exhibits the minimum number of times per data byte. Typically, the average number of accesses of data bytes decreases with an increased number of cores due to early eviction to keep the data coherent in private caches. However, for miniMD, BT, and MG the average number of accesses increases slightly. Figure 7 presents byte reuse distance, the average number of memory requests between accesses (1-4) . For the majority of the applications, CBU is between 1.2 and 1.5. However, the average CBU for all applications is 2.11 of the same data byte. The reuse distance of the same data byte for IS is the largest compared to all other benchmarks, whereas FREQMINE exhibits the smallest distance. DGEMM also shows moderate distance (1900 references). Figures 8 and 9 show the cache miss rate and the average data movement, respectively, for the different core counts. The data movement is measured by summing the total data movement between L1 and L2 caches in both directions. These results suggest that, due to the coherence mechanism, static/global variable dominated applications possess (1-8) . Data movement shows the same behavior as cache miss rate (Fig. 8)   Fig. 10 Working set size (WSS). In most cases, the working set size increases slightly with number of cores (1) (2) (3) (4) (5) (6) (7) (8) frequent miss rate and consequently high data movement. Typically, due to coherence misses, data can get evicted from private caches, increasing the miss rate. However, local variable dominated applications may take advantage of the cumulative size of caches in a multicore system where the cache miss rate is reduced with increased core count. DGEMM shows highest miss rate (50%), and BLACKSCHOLES possesses lowest miss rate. MiniFE and IS show increased miss rate as the core count increases, while HPCG and PENNANT exhibit reduced miss rates as the core count increases. The pattern of average data movement corresponds to the pattern of miss rate.
Cache miss rate and data movement

Application properties
In this subsection, we investigate the inherent properties of the applications that influence the cache performance and describe them with the results presented in the previous subsection(s). Furthermore, we estimate the degree of locality from the performance data and measure the working set size with our simulation infrastructure. Figure 10 describes the working set size of the applications running on multiple cores. It is evident that the working set size increases with the number of cores for almost all applications. Our experimental results show that BT and DGEMM have the smallest working set, and FREQMINE has the largest working set.
The results also shows that EP extends its working set about 6 times while running on 8 cores compared to running on single core, whereas CoMD extends it by about only 1.03 times. Typically, local and temporary variables have private copies for each thread in virtual address space, increasing the size of the working set. On the contrary, Fig. 11 Degree of spatial locality, versus degree of temporal locality, representing application locality: Applications farthest from the origin achieve the best locality. However, applications with higher degree of temporal locality show better performance static and global variables share the same virtual address space with multiple threads. CoMD has more static and global variables that are shared between threads. Table 3 presents a summary of performance for each of the applications running on 8-cores. DGEMM has the highest miss rate and data movement resulting in the lowest degrees of locality, even though its working set size is small (1.77 MBytes). Similarly, with slightly better degrees of locality and a large working set size (13.8 MByte), CG shows lower miss rate and data movement compared to DGEMM. This demonstrates that the effect of locality dominates working set size in cache performance for the applications. On the other hand, the best performing application (in the context of miss rate and data movement), BLACKSCHOLES has low degree of spatial locality and moderate degree of temporal locality. However, due to small working set (4.29 MB) the effect of low degree of spatial locality is very limited in performance. Moreover, FREQMINE, the application with the largest working set size (80 MB), possesses a high degree of temporal locality and a low degree of spatial locality and has low miss rate and data movement. The effect of degree of temporal locality over the degree of spatial locality can be shown with the application FT. Though it achieves a very high degree of spatial locality (10) , the degree of temporal locality is low (0.41), resulting in poor performance (miss rate 9.44%). Likewise, applications with similar sized working sets vary in performance due to locality. For example, HPCG (11 MB), miniMD (13 MB), and CG (14 MB) possess roughly the same working set size but cache performance varies (miss rate for HPCG, miniMD and CG are 7.6, 2, and 20%, respectively). The degrees of locality follow the same behavior as miss rate.
The comparison for degree of spatial and temporal localities of all the applications is presented in Fig. 11 . Applications farthest from the origin achieve the best locality. Similarly, applications near the origin indicate poor locality. However, degree of temporal locality is the dominating effect on application performance in cache compared to degree of spatial locality.
We clearly can conclude from the studied benchmarks that some applications suffer from low utilization at the cache line and byte levels. We have conducted some experiments to show that a cache with a variable cache line size [6] or a scratch pad memory [19] that is hardware-managed with some limited compiler support can improve performance and lower power consumption.
We build a 32-KB data structure for each core functioning as a buffer that stores variable sized data blocks according to the size of the data sequences. Initially, we store each of the unique memory addresses in the buffer and add the consecutive memory addresses of the memory requests that have neighboring memory addresses already in the buffer. This tunes to the spatial locality of the application. We keep the effective buffer size half of the total L1 cache size as it stores only the data that are accessed at least once (most common case). The cache utilization is less than 50%. Moreover, the restriction on size eliminates having to keep data blocks for the entire program execution in the buffer. Thus, we remove the least recently used data blocks when a new unique request appears that does not have neighbors stored in the buffer and the buffer is full. However, the generated data blocks possess variable sizes according to the spatial locality of the data. For a design with variable sized blocks, the possible block sizes can vary from one byte, the size of a single char variable, to 4 KB, the size of an entire page. However, our experimental results suggest that the best block size ranges from 4 bytes to 1 KB. Since most of the variables in an application are integers (4 bytes), this is the minimum block size. If the block size is larger than 1 Kbyte, the on-chip memory can become bloated and performance degrades. Therefore, our experiments suggest lower and upper limits of block sizes as 4 bytes and 1 KB, respectively. Therefore, to estimate the best block size, we create 11 bins that contain blocks from 1 byte to 1 KBytes by powers of two. The first bin contains the blocks of size one byte, second bin contains the blocks with size of two bytes, the third bin contains the blocks of 3-4 bytes. The 11th bin contains blocks with sizes from 513 bytes to 1 KByte. Then we identify the bin that contains the maximum percentage and chose the maximum value of the range of blocks size for that bin as best line size for the application. In Table 4 , we present the best line size for all the applications. However, for BT and MG the best line size is not determined, since for these applications there are multiple bins that have a significant number of blocks.
(a) (b) Fig. 12 Distribution of block sizes: usage of wide ranges of block sizes shows the inefficiency of having fixed size cache line, suggesting that using variable size cache line (e.g., scratchpad memory) can improve performance Figure 12a shows the distribution of the data block count with block sizes for HPCG, CoMD, and DGEMM running on eight cores. We select the applications as representative of three cases of locality presented in Fig. 11 . HPCG represents the applications with a high degree of spatial locality and CoMD represents applications with high degree of temporal locality. On the other hand, DGEMM represents the applications with low degree of both temporal and spatial locality. For HPCG, the size of about 80% of the data blocks is 64-128 bytes, whereas the size of 40% of the data blocks is 256 bytes for CoMD. Further, the size of 97% data blocks is 8 bytes for DGEMM. Therefore, from this observation the best line size estimated for HPCG, CoMD, and DGEMM are 128, 256, and 8 bytes, respectively. Further, Fig. 12b presents the distribution of block counts for the applications, BT and MG, that do not have any specific cache line size that contains a significant number of blocks. As an example, BT has 30% blocks with 32 bytes and 10% blocks each for 8, 64, 256, and 1K bytes. So, there is no best cache line size for BT.
Application runtime performance
In this section, we present the cache performance metrics as the program progresses while running on eight cores. We compare the runtime performance with overall performance results for cache utilization, miss rate, and data movement for a set of applications from the APEX, Mantevo, NAS, and PARSEC suites. We also measure the overall working set size of each application. These data sets illustrate time-varying cache performance for each application. In all the graphs, the x-axis shows the percentage of application progress, the proportion of the total number of memory references that have occurred up to that point of execution. Figure 13 presents the runtime cache line utilization for the four application suites. Typically, within the first 25% of execution, most applications reach their stable phase of cache utilization. However, some of the applications, such as DGEMM and MG, reach stable utilization around 1 to 2% of execution. HPCCG and miniFE possess better utilization at the beginning of the application. However, due to the large working set of HPCCG, the utilization decreases as the program progresses. On the other hand, due to the low degree of temporal locality (0.50) for miniFE, as the application progresses utilization decreases. CG, IS, and BODYTRACK possess better utilization at the beginning of the application. For HPCG, FREQMINE, and miniMD utilization improves as the program progresses.
Cache line utilization
To further investigate cache line utilization, we attempt to identify the best line size for the application as the application progresses, Fig. 14 . Our experimental results show that BT, FT, and BODYTRACK can achieve improved performance with a cache line from 256 bytes to 1K bytes during the first 15% of execution. The best cache line size for MG is 512 bytes for the entire execution. However, all other applications except HPCG can perform better with less than the conventional 64 byte cache line.
We select two applications, HPCG and CoMD, to show runtime utilization for variable core counts, since for HPCG and CoMD the degree of locality changes significantly as the number of cores changes. For all other applications, the core count has very small effect on utilization. Figure 15 shows the runtime byte utilization for HPCG and CoMD benchmarks running on 1, 2, 4, and 8 cores. Utilization decreases as the number of cores increases for HPCG and CoMD showing the effect of coherence. Shared data can cause data to be removed prematurely to maintain coherence, reducing utilization.
Working set size
The working sets covered by the applications as the program progresses are shown in Fig. 16 . For many of the applications, the whole working set is accessed within 5% of execution. However, BODYTRACK takes the entire execution to touch the complete working set, and PENNANT takes 50% of execution.
Miss rate and data movement
From our previous observation in Table 3 , we observe that miss rate and data movement are dependent on utilization and working set size. Therefore, we present the runtime profiling of miss rate and data movement in Figs. 17 and 18 , respectively. Typically, at initial condition, miss rate increases due to inevitable compulsory misses, trigging high data movement. After the initial condition, it decreases for some applications, CoMD, miniMD, MG, PENNANT, and HPCG. However, for HPCCG, BODYTRACK, and CG miss rate increases as they cannot exploit spatial locality with a 64-byte cache line. For HPCG due to poor temporal locality, it performs poorly for the entire execution. At the beginning, miniFE has high spatial locality and maintains a low miss rate after 15% of execution progress and then increases drastically after 40% of execution. Figure 18 illustrates the runtime behavior of observation data movement normalized to data movement for the entire execution. Data movement increases linearly with application progression, except PENNANT, MiniFE, and FREQMINE. The data movement stops increasing for PENNANT after 65% of execution, as the application collects its most required data in cache and performs computation on that data. This observation is supported by the behavior of working set size and miss rate as shown in Figs. 16a and 17a for PENNANT. The working set size stops increasing and miss rate decreases after 65% of application progression. Similarly, FREQMINE shows very similar pattern of data movement to PENNANT. On the other hand, MiniFE shows poor locality after 40% of progress, triggering an increased number of misses and eventually high data movement. 
Related work
In this section, we summarize the most relevant related work. Most related works study benchmark memory performance metrics to find similarities between applications [14, [20] [21] [22] [23] . They investigate application characteristics such as miss rate, data movement, and locality. However, they do not investigate utilization as a performance metric. Moreover, we investigate runtime performance of cache metrics for characterization. However, little research is done on specific application suites to characterize the benchmarks for future architecture concepts [18, 24, 25] . Coplin et al. [25] characterize 34 applications from five GPGPU benchmark suites. They measure energy and execution time for different GPU frequencies, memory frequencies, and ECC code configurations. Seo et al. [24] characterize OpenCL implementations of the NAS benchmarks running in heterogeneous parallel platforms consisting of generalpurpose CPUs and GPUs. Bienia et al. [18] characterize the PARSEC benchmark and its implication in future manycore memory architectures.
There is no concrete way to identify the locality of an application accurately. However, researchers propose investigating different performance metrics to estimate locality. Most research focuses on "reuse distance" of a data item or data block to determine locality [3, 7, [26] [27] [28] [29] [30] . Among this research, Kumar et al. [7] , Johnson et al. [29] , Ding et al. [28] , and Torrellas et al. [30] propose to use the reuse distance to identify spatial locality. Song et al. [31] use representation of the entire application code using compilers to improve code locality that identifies temporal locality. On the other hand, Matterson et al. [32] , Beyls et al. [27] , and Bunt et al. [26] identify the stack distance to define locality.
Another type of previous research presents only memory performance metrics such as cache miss rates obtained from hardware performance counters or simulation events to understand the application performance on a specific simulator [33] [34] [35] [36] [37] . Some approaches measure processor-dependent metrics to find the similarity between benchmarks [23, 24, [38] [39] [40] [41] . However, Conway et al. [40] measure performance for an AMD Opteron processor but do not explore cache utilization as a performance metric. Panda et al. [23] characterize benchmarks from SPEC CPU2006 and SPECjbb2013 using hardware performance counters and microarchitecture independent performance metrics. Ye et al. [41] characterize SPEC CPU2006 running on an AMD x86-64 OPTERON processor.
In our previous work, we investigated cache utilization for some of the Mantevo mini-applications [4] . In this article, we extend that work to include three other suites (APEX, NAS, and PARSEC). Additionally, we also extend our investigation of the runtime performance of Mantevo and APEX suites running on single core [42] . We further investigate the performance metrics of all the benchmarks when run on a multicore system.
Computer architects have proposed several techniques to improve cache utilization. However, such techniques increase overhead, or access time. Kumar et al. [6] did an extensive study on cache line utilization for L1 and L2 caches and tracked cache lines in word granularity. Srinivasan et al. [8] tracked cache utilization in cache line granularity and found that most of the benchmarks utilize less than 70% of a cache line. Both Srinivasan and Kumar estimated power consumption with CACTI. These studies do not relate data movement or miss rate to cache utilization.
Alkohlani et al. [43] did a thorough analysis of temporal and spatial locality for both Spec and the Mantevo benchmarks on a single core. Our study on cache utilization expands that portion of their study to include cache utilization and locality for multicores. Weinberg et al. [14] thoroughly investigate the memory system performance of benchmarks from HPC Challenge and APEX-MAP benchmark suites and compared them with each other. The main motivation of the comparison is the degree of spatial and temporal locality tuned from L (a parameter for spatial locality) and K (a parameter for temporal reuse) values of the applications. On the other hand, our degree of locality is estimated from the cache utilization. Perarnau et al. [44] propose user suggested software cache mechanism to improve overall cache utilization. Deshpande et al. [45] propose modeling approaches to help capture and better understand cache utilization in the various levels of the memory hierarchy to estimate data movement between multiple levels of cache. They define Average Cache References per Eviction (ACRE), as a measure of cache utilization for Mantevo and GraphBIG [46] benchmarks. However, this work clearly differs from our work as we define utilization as a footprint of memory addresses.
Conclusions and future directions
This work builds upon our previous work on cache utilization as a performance metric to estimate locality and to compare the data movement and miss rates between different applications. We report results on some of the benchmarks in the APEX, Mantevo, NAS, and PARSEC benchmark suites and find that the benchmarks exhibit diverse behavior in all dimensions, prescribing different memory architectures for different applications. This work presents cache utilization as a performance metric to estimate locality and compare performance for running the benchmarks in parallel. Additionally, we use two metrics, namely degree of spatial locality and degree of temporal locality to estimate the locality of an application based on measured utilization of cache lines. Our measured cache performance follows the estimates of these two parameters. Furthermore, our experiments show that performance metrics collected during the execution of a benchmark can give better insights into its behavior.
Our results show that the average cache line utilization for the benchmarks used is 41%, varying from 10 to 80%. Additionally, on average each cache byte is accessed a little more than twice before the cache line is evicted. However, byte utilization varies from 0.18 to 6.97 times depending on the application's locality. Our runtime cache performance depicts the time-varying nature of the performance of the applications. This evaluation suggests that adapting the cache line size can lead to better performance.
We can assert from our results that a fixed size cache line is not a good design choice for many applications. A variable sized cache line or a scratch pad memory that is hardware-managed with compiler support can attain improved performance by avoiding under utilized cache lines that are energy inefficient. An example of such a solution is our Local Memory Store (LMStr) [47] [48] [49] . In our preliminary research on LMStr, we identify variable sized data blocks at compile time of the same type variables and store the data blocks in the scratchpad at runtime. LMStr is a memory hierarchy composed of both a scratchpad and a regular caching hierarchy. In a multicore system, each core possesses private cache like in a conventional system, and a scratchpad is shared among multiple cores. Our initial estimates show that LMStr reduces data movement by 21% and utilizes on-chip scratchpad memory by 100% compared to cache only for some of the Mantevo benchmarks running on 8-cores [48, 49] . However, for an 8-core system, the main advantage is that the area overhead of the scratchpad memory used is only 5% compared to area of the cache. This in turn reduces power consumption by approximately 40%.
Our future plans for this work are to enhance our approach by extending our methodology to include more memory performance aspects. We also intend to apply our characterization methodology to non-scientific application domains and to other benchmark suites. Additionally, this work can be extended to evaluate performance for other processor models. We recognize that in reality multiple applications run simultaneously, we evaluate the performance for each benchmark separately to have an untainted picture of how the application behaves and to look at each benchmark intrinsic characteristics without destructive interference from other applications that are competing for cache resources.
We plan to perform a verification study to compare our results on a physical system using hardware performance counters.
