Chip Multi-Processors (CMPs) emerge as a mainstream architectural design alternative for high performance parallel and distributed computing. Last Level Cache (LLC) management is critical to CMPs because off-chip accesses often require a long latency. Due to its short access latency, well performance isolation and easy scalability, private cache is an attractive design alternative for LLC of CMPs. This paper proposes program Behavior Identification-based Cache Sharing (BICS) for LLC management. BICS is based on a private cache organization for the shorter access latency. Meanwhile, BICS tries to simulate a shared cache organization by allowing evicted blocks of one private LLC to be saved at peer LLCs. This technique is called spilling. BICS identifies cache behavior types of applications at runtime. When a cache block is evicted from a private LLC, cache behavior characteristics of the local application are evaluated so as to determine whether the block is to be spilled. Spilled blocks are allowed to replace some valid blocks of the peer LLCs as long as the interference is within a reasonable level. Experimental results using a full system CMP simulator show that BICS improves the overall throughput by as much as 14.5%, 12.6%, 11.0% and 11.7% (on average 8.8%, 4.8%, 4.0% and 6.8%) over private cache, shared cache, Utility-based Cache Partitioning (UCP) scheme and the baseline spilling-based organization Cooperative Caching (CC) respectively on a 4-core CMP for SPEC CPU2006 benchmarks. key words: chip multi-processors (CMPs), performance, program behavior, last-level cache (LLC), spilling
Introduction
Chip Multi-Processors (CMPs) emerge as a mainstream architectural design alternative for high performance parallel and distributed computing. Consequently, a key design issue facing processor architects is the organization and management of on-chip Last Level Cache (LLC) [1] . Since compared to shared caches, private cache organization has the advantages of short access latency, well performance isolation, easy scalability and non-uniform access latency [2] in its nature [3] , it's becoming an attractive design alternative for large on-chip LLCs of CMPs. However, in a private organization, cache resources are statically partitioned among cores without regard to the diverse application mixes running on them. This may lead to undesirable low utilization of the precious on-chip cache resources.
Cooperative Caching [4] (CC) improves capacity utilization of private cache design by allowing evicted blocks of one private LLC to be saved at peer private LLCs [4] . This technique is called spilling [4] . We call peer caches which can spill blocks to the others as spillers, and peer caches which can receive blocks from the others as receivers [4] . Spilling plays a significant role in private cache design optimization. However, current spilling techniques [3] , [4] do not take the cache behavior characteristics of applications into account when spilling. Thus, it may spill blocks that are prone to cache pollution and omit blocks that truly promise significant performance boost. Another drawback of the current spilling techniques is that they spill evicted blocks only into invalid blocks or replicate blocks of receivers. Thus, there is great potential for performance improvement.
To address these shortcomings, this paper proposes a new LLC organization, i.e. program Behavior Identificationbased Cache Sharing (BICS). BICS bases its organization on a private cache design to take full advantage of the short access latency, but tries to simulate a shared cache organization by adopting spilling. BICS treats all the private LLCs on-chip as a whole and each one is called an LLC slice. BICS monitors the cache behavior characteristics of applications running on each LLC slice online independently and identifies cache behavior types of applications dynamically for spilling evicted blocks smartly. Receivers allow evicted blocks to replace some of their valid blocks as long as the interference is under control. We evaluate BICS on a 4-core CMP with 16 multi-programmed workloads from SPEC CPU2006 using a full system simulator. Experimental results show that BICS improves the overall throughput by as much as 14.5%, 12.6%, 11.0% and 11.7% (on average 8.8%, 4.8%, 4.0% and 6.8%) respectively over private cache, shared cache, Utility-based Cache Partitioning (UCP) scheme and the baseline spilling technique CC.
Throughout this paper, we assume each core in CMP executes one application and L2 is the LLC. The rest of this paper is organized as follows. Section 2 elaborates the related work. Section 3 motivates the program behavior identification. Section 4 details the basic ideas, algorithm and hardware support of BICS. Section 5 describes experimental methodology used and evaluation results are shown and analyzed in Sect. 6. Section 7 presents the detailed sensitivity studies on algorithm parameters used in BICS. Finally, Sect. 8 concludes our work and gives a short discussion about future work.
Copyright c 2010 The Institute of Electronics, Information and Communication Engineers

Related Work
Related Work on Cache Spilling
As a private cache organization based cache sharing mechanism for LLC of CMPs, Cooperative Caching [4] (CC) is the first to explicitly introduce spilling for cache capacity utilization improvement. When a block is evicted from a private LLC, CC [4] decides whether to spill it randomly with a pre-set probability. However, one problem with CC is that neither the control with a probability nor the probability value selection takes into account the cache behaviors of applications, whereas actually the benefit that can be obtained through spilling depends highly on the cache behaviors of applications.
CMP-NuRAPID [6] suggests stealing capacity of neighboring caches when there is no enough capacity for private data in local cache, thus it supports spilling implicitly. But the spilling in CMP-NuRAPID is simply demandbased, with no proper control. In Dynamic Spill-Receive [3] (DSR), each private cache learns whether it should act as a spiller or a receiver using set dueling monitors. DSR does not take into account the cache behaviors of applications, either.
To the best of our knowledge, existing studies on spilling technique spill evicted blocks only into invalid or replicate blocks of peer caches. More benefit could be derived if we allow evicted blocks to replace some valid blocks of peer caches but keep the interference under control.
Related Work on Program Cache Behavior Identification
Qureshi et al. [7] and Jaleel et al. [8] discuss the cache behaviors of applications on large on-chip caches. Qureshi et al. [7] classify cache behaviors into three types: low utility, saturating utility and high utility. Jaleel et al. [8] classify cache behaviors into four types: cache friendly, cache fitting, cache thrashing and streaming. Neither of the two works provides an online mechanism to identify cache behaviors of applications dynamically. Xie et al. [9] are the first to propose an online mechanism for dynamic cache behavior identification. Our identification method is partially inspired by this work. However, classification of cache behaviors in [9] is not concise enough that they identify some though cache intensive but also cache friendly applications as "devil". Thus, there is great potential for improvement.
Related Work on Cache Partitioning
Cache partitioning schemes try to optimize capacity utilization of shared cache by partitioning it among the concurrently running threads/applications to minimize the total misses or other targets, with each application holding a fraction of the cache space. A large fraction of cache partitioning schemes partition the cache using way partitioning [4] , that is, partition the cache at way granularity. Each application is only allowed to use specific ways or specific number of ways within each cache set. In this way, cache partitioning can greatly reduce the inter-application interferences. Suh et al. [5] introduce hardware LRU counters to dynamically adjust cache partitions based on online miss rate statistics to account for the dynamic behaviors of applications. Qureshi et al. [7] propose Utility-based Cache Partitioning (UCP), in which they introduce another hardware architecture Auxiliary Tag Directory (ATD), i.e. an extra tag array, into the dynamic cache partitioning of shared caches in CMPs. ATD can capture the miss statistics of each individual application exclusively online, allowing for more precise cache space partitioning.
BICS is inspired by cache partitioning in that it sets a lower bound for the local application of private LLC that is designated as a receiver. Besides, it introduces the hardware structure ATD and LRU counters proposed in UCP [7] to collect statistics used by online application cache behavior identification.
Cache Behavior Identification
Matick [10] models the Cycles Per Instruction (CPI) calculation equation, as shown by Eq. (1).
CPI ideal is the ideal CPI assuming an infinite cache, and FCP (Finite Cache Penalty) is the extra CPI caused by finite cache capacity. Let api l1 be the average L1 accesses per instruction, mr l1 , mr l2 be the miss rates of L1 and L2, mpi l1 , mpi l2 be the average misses per instruction of L1 and L2, L o f f chip be the off-chip memory access latency, then FCP can be obtained using Eq. (2). For our study, the former two items and L o f f chip can be regarded as constants, thus Eq. (2) can be reformulated as Eq. (3).
Let CPI 0 and CPI 1 be the CPI before and after a strategy is applied respectively, then speedup (denoted by Sp) can be obtained following Eq. (4) . Δmpi l2 denotes the reduction of mpi l2 , and ipc 0 stands for the IPC (Instructions Per Cycle) before the strategy is applied. Equation (4) indicates that speedup is mainly determined by variables Δmpi l2 and ipc 0 , both of which correlate with the cache behavior characteristics of applications. Figure 1 shows the L2 cache capacity sensitivity of several SPEC CPU2006 benchmarks, collected using a full system simulator Virtutech Simics [11] . It can be observed that the value of MPKI (Misses Per Kilo Instructions) and its varying trends with cache size differ from application to application. Recall the two key variables of Eq. (4), of which Δmpi l2 corresponds to the varying trends of miss rate with cache size, i.e. slope of the cache capacity sensitivity curve, and ipc 0 is related to the amount of L2 misses. This means speedups of applications when a new strategy is applied are highly dependent on their amount of L2 misses and their varying trends of miss rate with cache size.
We form our classification of cache behaviors according to the varying trends of miss rate with cache size as well as the amount of misses:
• Light Weight (LW): These applications have very few L2 misses, either because they rarely access L2 or they only make use of a few ways of L2. For example, the average misses of gamess.c and dealII.r (as shown in Fig. 1 ) in 5 million cycles is no more than 100. These applications do not benefit from more cache capacity. Moreover, they can "lend" some cache space to other applications.
• Moderate Friendly (MF): These applications have moderate L2 accesses and they continue to benefit from more cache space, such as gobmk.1 in Fig. 1 .
• Moderate Streaming (MS): These applications have moderate L2 accesses, but their L2 miss rates cease to fall at certain point of the cache sensitivity curve. An example is cactusADM.r in Fig. 1 . Extra cache space won't benefit this kind of applications.
• Intensive Friendly (IF): There are some applications that though they have high L2 access rates and high miss rates, they benefit significantly from more cache space, such as omnetpp.r in Fig. 1 . It's because these applications have a working set size larger than the baseline cache size. They cause cache thrashing on the baseline cache but work well if sufficient cache space is given.
• Intensive Streaming (IS): These applications access L2 very frequently, and have very high miss rates. But they just don't benefit from more cache space due to their poor cache reuse, such as libquantum.r in Fig. 1 . If permitted to use the cache space of other applications, these applications will cause severe cache pollution which may lead to overall performance degradation.
Note that, cache behavior type of an application does not always remain the same. It differs with different input sets, or from execution phase to execution phase [12] , which adds to the necessity to identify cache behavior types of applications online.
Design of BICS
General Framework
BICS bases its organization on LRU managed private cache design to take full advantage of the short access latency, but tries to simulate a shared cache organization by adopting spilling. Figure 2 illustrates the general framework of BICS. BICS consists of three major parts: ATD [7] along with LRU counters [7] , spill decision algorithm and spilling enforcement, as depicted by Fig. 2 (b) , (a) and (c) respectively.
ATD along with LRU counters is used to collect the Miss Rate Curves (MRCs) of each core, which is required for spill decision generation. Each private L2 is appended with one ATD along with associated LRU counters, implemented in hardware. Spill decision algorithm generates the spill decision for each private L2 independently according to MRCs tracked by each individual ATD. Spill decision algorithm can either be supported at the OS level, or realized using hardware. Enforcement of spilling requires supports from replacement policies and coherence protocols. ATD and spill decision algorithm act independently for each core, requiring no inter-core interaction. Enforcement of spilling requires rather small changes to replacement policies and coherence protocols. This structure makes BICS easy to scale to larger systems. BICS spills an evicted block only to the corresponding set of peer L2s so as not to add much complexity to search policy.
Basic Ideas
BICS is proposed based on two key insights. One is that program cache behavior identification is promising for more concise spilling. Another is that more blocks could be "lent" to spillers to exploit the potential for higher performance.
Behavior identification could be exerted based on the varying trends of miss rate with cache size and the amount of misses for each application, according to Sect. 3. More blocks could be dedicated to spillers by associating receivers with an appropriate Lower Bound (LB) (in ways). Evicted blocks of spillers are allowed to replace the LRU block of the local application in a receiver if and only if the local application uses more ways than LB. The use of LB is sort of inspired by cache partitioning schemes [5] , [7] used in shared caches. However, we partition each private LLC only into two parts, i.e. between local application and remote applications.
We define the following policy metrics [9] , [13] , [15] for quantifying BICS:
• Misses local : The total number of local L2 misses if the application had sole use of the entire ways of the local L2.
• MR local : The miss rate if the application had sole use of the entire ways of the local L2.
• Rate m : The % reduction in miss rate if the cache space (in ways) allocated to an application increased to m times of that of local L2.
• W +k% : The least number of ways required to achieve a miss rate increase no more than k% of Misses local .
Rate m quantifies the varying trends of L2 miss rate against cache size and Misses local represents the amount of L2 misses. W +k% is used to select LB. These policy metrics could be easily calculated using MRCs tracked by ATD and LRU counters [7] . For an A-way associative cache, MRCs of an application consists of A elements, with the kth one standing for the miss rate if k out of A ways are allocated to the application. ATD is an extra copy of the main private L2 tag array except that it has different associativity. All local L2 accesses are also directed to ATD and ATD uses LRU for replacement decisions. Each set in an LRU managed cache can be treated as an LRU stack. A hit in the kth LRU position will cause the kth LRU counter C k to be increased by one. Whereas a miss will increase the Ath LRU counter C A by one. MRCs can be easily calculated using the numbers tracked in LRU counters.
BICS uses LRU counters only for calculation of Rate m and W +k% , thus requires an ATD with associativity no more than mA. Using LRU counters (C 0 , C 1 , . . . , C mA−1 , C mA ), Rate m and W +k% can be calculated following Eq. (5) and Eq. (6) .
Qureshi et al. [7] prove that as few as 32 sets are sufficient for ATD to track MRCs accurately by using Dynamic Set Sampling (DSS). This conclusion helps to reduce the storage overhead.
Spill Decision Algorithm
Qureshi [3] argues that a given private L2 should either be a spiller or a receiver but not both at the same time. Otherwise, the given L2 tries to get more cache capacity from remote L2s while at the same time gives away its own local capacity to others. Thus, BICS allows a given private L2 to act as a spiller or a receiver, but not both. The decision that whether a private L2 should act as a spiller or a receiver is called spill decision.
According to discussion in Sect. 3, LW, IS and MS applications don't benefit much from extra cache space, so we call them receiver applications and make private L2s which execute these applications work as receivers. For IF and MF applications, more cache space is beneficial, so we call them spiller applications and allow private L2s that run these applications to spill. Figure 3 describes the simple heuristics used for spill decision algorithm. Each private L2 invokes spill decision algorithm periodically to adapt to the dynamic change of program behaviors. Related work suggests that setting invoke period to 5 million cycles could trade off well between accuracy and overhead. A Spill Decision Register (SDR) is appended to each private L2 to hold the spill decision and LB of the local L2.
(Thr LW , Thr I , m, k, R S TR ) are parameters used in spill decision algorithm. Thr LW and Thr I are used to distinguish the L2 access intensiveness of applications. An application is identified as an LW when its Misses local is smaller than Thr LW , identified as an intensive application when its Misses local is larger than Thr I . Otherwise, it's a moderate application. The miss rate reduction against MR local when m times the capacity of local private L2 is allocated to an application (i.e. Rate m ) is adopted to approximate the miss rate descending trends of the application. k is used to ensure extra misses caused by spilling are within k% of Misses local . Of the five parameters, we expect R S TR to be the most important one because it sets the boundary between cache friendly applications, which are allowed to spill, and streaming applications, whose cache capacity is allowed to be "stolen". Thus the aggressiveness of BICS is tuned by R S TR .
Although the parameter values could be determined empirically based on the cache size sensitivity curves of SPEC CPU2006 benchmarks and corresponding qualitative analysis, we perform a series of sensitivity studies of BICS on each parameter to locate the optimal parameter value more precisely. The detailed sensitivity studies and analysis is deferred to Sect. 7.
Enforcement of Spilling
In BICS, an evicted block of a spiller is spilled only if it holds private data. Spilled blocks are allowed to replace invalid blocks, replicate blocks and some local blocks of receivers on the condition that their local applications occupy more ways than LB. A spill bit is added to each block for the spilling technique to distinguish spilled blocks from local blocks.
When there is a miss in the local private L2, all the other L2s are snooped. This is also required in the baseline private cache for coherence [3] . Assuming the local L2 a spiller, if a copy of this miss is found in a remote L2, then the remote copy is exchanged with the evicted block. Otherwise, the block is fetched from off-chip memory and in the meantime the evicted block is spilled into one of the receivers in the system. A receiver is chosen according to the access latency. Private cache organization is a NonUniform Cache Architecture (NUCA) in nature. In view of that, BICS tries to spill an evicted block to the nearest peer L2. Figure 4 illustrates the spill process.
From the above discussion, we can see that BICS requires rather small changes to coherence protocols and replacement policies. It should be mentioned that spill process is done concurrently with the fill operation of the miss, thus it's not in the critical path of memory access.
Experimental Methodology
Configurations
We use g-cache of Virtutech Simics [11] , a full system simulator for our performance studies. To fully evaluate BICS, we extend the simulator with four other schemes for comparison, i.e. Private, Shared, CC [4] and UCP [7] . Evaluation is performed on a 4-core CMP with parameters given in Table 1 . An in-order core model is used so that we can evaluate our proposal within a reasonable time.
For all schemes, each core is associated with a 512 KB L2. For private based organizations, i.e. Private, CC and BICS, this means a 512 KB private L2 per core with a 10-cycle hit latency to the local L2, a 38-cycle hit latency to two peer L2s and a 46-cycle hit latency to one peer L2. As for shared based organizations, i.e. Shared and UCP, this results in a unified 2 MB L2 with a hit latency of 19 cycles. We use relatively small cache configurations so that the cache will be under more access pressure, with more obvious contentions and interferences. The basic philosophy behind is that a large fraction of the current real-world applications have working sets much larger than those of the selected benchmarks.
Mesh networks are used for intra-chip data transfers, modelling non-uniform access latencies for private organizations. A unified organization for shared based schemes is used, with access latency bounded by the slowest bank. We model on-chip network and cache latencies with CACTI 6.0 [14] . All caches are of a uniform block size of 64 B and use LRU as baseline replacement policy. Off-chip memory access latency is 350 cycles.
The optimal values for BICS parameters, as shown in Table 2 , are determined experimentally. BICS with experimentally determined optimal parameters is denoted as BICS-S. We defer the parameter selection process to Sect. 7, and focus on evaluation of BICS-S through comparison with other schemes in Sect. 6.
Workloads
For our study, we use 19 SPEC CPU2006 benchmarks to create 16 4-benchmark workloads, as listed in Table 3 . To fully evaluate BICS, we randomly select several benchmarks from each program behavior type respectively, according to the statistics of SPEC CPU2006 benchmarks in Sect. 3. WL 00 through WL 12 are formed so that they contain both benchmarks predicted as spillers and benchmarks predicted as receivers (denoted by SR). WL 13 and WL 14 are formed with benchmarks all predicted as spillers (denoted by 4S). WL 15 are formed with benchmarks all predicted as receivers (denoted by 4R). All workloads are simulated till each benchmark in the workload executes at-least 250M instructions. When a benchmark reaches 250M instructions, its statistics are "frozen". But it continues to execute so that it still competes for cache resources.
Metrics
Throughput, Weighted Speedup [17] (WS) and Harmonic Mean [18] (Hmean) are the three metrics commonly used to quantify the aggregate performance of a system with multiple applications running concurrently. Throughput reflects the overall performance boost but may favor high IPC applications too much. WS weights the relative speedups of all applications evenly and indicates the improvement on execution time [17] . Hmean is a fairness metric and balances both fairness and performance [18] . We use all three metrics for performance comparisons. Besides, since the number of accesses to each private cache is different, the commonly used miss rate metric is not applicable. For the purpose of memory access analysis, we adopt the L1 misses breakdowns and the average memory access latency breakdowns used in CC [4] to show how much memory access is done to each level.
Results and Analysis
We compare the performance of BICS-S to four other schemes: the baseline configuration Private cache without spilling, Shared cache, UCP and CC. Then, we examine the sensitivity of BICS-S to cache size and associativity as well as its scalability to larger CMP systems. Figure 5 shows the throughput of Shared, UCP, CC and BICS-S, normalized to Private. Geomean is the geometric mean of all 16 workloads. We expect BICS to outperform Shared and UCP due to several reasons: (1) first and foremost, we adopt online cache access behavior identification to guide the spill decision generation; (2) the private based organization of BICS allows a shorter hit latency for most of the cache accesses as opposed to shared based organization, whose hit latency is bounded by the slowest bank and which may sometimes store local blocks of a core far away due to its global addressing style; (3) although BICS is inspired by UCP, the "partitioning" of BICS is quite conservative, i.e. is highly guided by the online application cache behavior identification and confined by parameter k%. Thus it is more likely to provide better performance isolation.
Performance on Throughput Metric
As can be seen in Fig. 5 , BICS-S outperforms both Shared and UCP for 14 of 16 workloads, with an average improvement of 5% and 4% respectively, which confirms our expectation. In contrast, CC degrades throughput when compared to Shared or UCP for 11 out of the 16 workloads. This suggests that BICS can effectively simulate a shared cache organization and explore the capacity sharing potential by using spilling, improving the cache resource utilization of private cache organization. There are less replicate blocks in multi-programmed workloads, confining the efficacy of CC. This explains why the average throughput of CC is lower than that of Shared. Shared and UCP work better than BICS-S for WL 11 and WL 13. According to cache size sensitivity curves collected in Sect. 3 (not shown due to space limitation), mcf.i, milc.s, soplex.r, hmmer.r and omnetpp.r are roughly cache friendly applications, whereas perlbench.c and dealII.r are LW applications. We perceive BICS-S fails to achieve comparable performance to Shared and UCP because of its sort of conservative "partitioning", i.e. receivers are conservative at giving away space to spilled blocks. For this kind of workload mixes, cache friendly applications need to be provided with more space for extra performance profit, and special efforts could be made to explore more performance potential.
All benchmarks in WL 04 work as receivers throughout the whole statistical stage on BICS-S. Theoretically, BICS equals a private cache organization under this circumstance. As confirmed by Fig. 5 , the throughput of BICS-S for WL 04 is equal to that of both CC and Private. On the other hand, although WL 13 and WL 14 are predicted as all-spiller workloads, and WL 15 is predicted as an allreceiver workload during workload creation, the actual execution disapproves our assumption. The three workloads all have execution phases that some of their benchmarks are identified as spillers but meanwhile the others work as receivers. This discrepancy is caused by the different execution phases [12] of applications. Cache behavior may differ greatly from phase to phase, which further necessitates online cache behavior identification. Nevertheless, as shown by Fig. 5 , BICS-S outperforms Private and CC for all workloads except for WL 04. Several factors attribute to BICS's benefit over CC: (1) use of online application behavior identification for spill decisions; (2) use of partitioning in receivers to provide more space to spilled blocks.
In general, BICS-S improves throughput by as much as 14.5%, 12.6%, 11.0% and 11.7% (on average 8.8%, 4.8%, 4.0% and 6.8%) over Private, Shared, UCP and CC respectively. The results indicate that cache behavior identification is of great importance to throughput improvement, and it's necessary that cache management strategies take it into account. Figure 6 shows the WS of Shared, UCP, CC and BICS-S, normalized to Private. The WS with BICS-S is higher than those with Shared and UCP for 14 out of 16 workloads. Over all 16 workloads, BICS-S achieves better WS than Private and CC, except for WL 04. As depicted by Fig. 6 , BICS-S achieves a WS equivalent to those of Private and CC for WL 04, which reconfirms that BICS equals a private cache organization when all applications work as receivers. The results of WS are consistent with those of throughput. Generally, BICS-S improves WS by as much as 12.7%, 12.1%, 11.1% and 11.9% (on average 8.4%, 4.8%, 4.2% and 6.1%) over Private, Shared, UCP and CC respectively. This indicates that BICS can bring a considerable reduction in overall execution time. Although BICS-S improves the overall throughput and WS significantly, it's important that this does not come at the expense of fairness of the system. Figure 7 shows the Hmean fairness of Shared, CC, UCP and BICS-S, normalized to Private. For all 16 workloads, BICS-S has better Hmean value over Private and CC except for WL 04. BICS-S also has better Hmean value over Shared and UCP for most workloads. The results are consistent with those for throughput and WS. To sum up, BICS-S improves Hmean fairness by 9.9%, 5.2%, 5.5% and 7.2% on average over Private, Shared, UCP and CC respectively. Thus, BICS not only improves performance but also balances fairness well. Figure 8 shows L1 miss breakdowns for Private, Shared, UCP, CC and BICS-S schemes. BICS-S can effectively reduce the amount of off-chip accesses and increase the on- chip L2 hit rates when compared to Private and CC. For most of the workloads, BICS-S increases both local and remote L2 hit rates. This suggests that the spilled blocks truly promise more hits for most applications. For several workloads, the on-chip hit rates (local combined with remote) of BICS-S are even higher than that of Shared or UCP.
Performance on Weighted Speedup and Hmean
Memory Access Breakdowns
The average memory access latency breakdowns for Private, Shared, UCP, CC and BICS-S (normalized to Private) are shown in Fig. 9 . As in CC [4] , we break down the average memory access latency into L1, local L2, remote L2 and off-chip access latency respectively. The results are roughly consistent with those of Fig. 5 , with lower average access latency resulting in better performance. One notable phenomenon is that for some workloads, although UCP tries to minimize the total misses, it does not necessarily result in less misses over Shared, as in WL 10. Besides, under certain circumstances, though off-chip access latency takes up a large fraction of the total memory access latency, it may correspond to a relatively small off-chip miss rate. This discrepancy is caused by the large amount of L2 memory accesses of the workloads, such as in WL 13 and WL 14. In general, BICS-S reduces off-chip miss rates and results in a smaller average access latency for most of the workloads.
Sensitivity Studies
We now evaluate the performance robustness of BICS. The main idea here is to examine the benefit of BICS across a spectrum of memory configurations (e.g. different ways, different size).
The performance of BICS-S is not quite dependent on the associativity of cache. Figure 10 compares the throughput of Shared, UCP, CC and BICS-S with different associativity (4-way, 16-way or 32-way 512 KB L2 per core, respectively). Combining Fig. 5 and Fig. 10 , we can see that BICS-S achieves comparable average performance boost for all 4 associativity configurations over Private, though the performance benefit of several individual workloads varies slightly across configurations. It also can be observed that with associativity increasing under constant cache size, the performance benefit of Shared, UCP and CC decreases gen- erally. In contrast, BICS-S has more stable performance boost with increasing ways, thus is suitable for caches with large associativity, which is the case for large on-chip LLCs in many of modern CMP platforms. In general, BICS-S improves the average throughput by 6.6%, 2.3%, 2.4% and 3.9% over Private, Shared, UCP and CC for 4-way 512 KB L2 per core respectively, by 6.9%, 4.4%, 3.3% and 4.8% for 16-way 512 KB L2 per core respectively, by 6.9%, 5.1%, 5.1% and 5.2% for 32-way 512 KB L2 per core respectively.
The benefit of BICS-S is more impacted by the aggregate cache capacity. Figure 11 shows the throughput of Shared, UCP, CC and BICS-S with larger cache size (the number of sets is kept as constant, and cache size is varied by varying associativity). As the aggregate cache size increases, the advantage of BICS-S over Private gradually decreases. This is easy to understand that with larger onchip cache capacity, the working sets of more applications can fit in the cache, resulting in less contention and interferences that could be reduced by cache management schemes for performance boost. In fact, as the aggregate cache size increases, the benefit of UCP is also diminishing, as shown by Fig. 11 . However, we believe that with the increasing problem size and number of applications, access pressure on future on-chip large caches is more likely to increase rather than decrease even with larger cache capacity. In general, BICS-S improves the average throughput by 5.4%, 5.4%, 5.0% and 2.4% over Private, Shared, UCP and CC for 16-way 1 MB L2 per core respectively, by 3.9%, 3.5%, 3.9% and 2.0% for 32-way 2 MB L2 per core respectively.
Scalability of BICS to Larger Systems
We also evaluate BICS for 8-core system with 10 8-benchmark workloads formed by randomly combining from 19 SPEC CPU2006 benchmarks. Figure 12 shows the throughput of Shared, CC and BICS-S, also normalized to Private. BICS-S outperforms CC for all 8-core workloads except for 8pWL 07. BICS-S also achieves significantly better performance over Shared for all workloads considered. In general, BICS-S improves throughput by up to 7.8%, 10.7% and 11.7% (on average 4.0%, 5.9% and 4.4%) over Private, Shared and CC respectively.
Hardware Overhead
For the baseline 8-way 512 KB private LLC per core with 64 B cache block, assuming a 40-bit physical address space, the hardware storage overhead of BICS-S is detailed in Table 4 . BICS-S requires no more than 0.6% of the LLC storage overhead. And this value does not increase with increasing cores. Note that none of the structures or operations required by BICS is in the critical path, resource-intensive, complex, or power hungry.
Parameter Sensitivity Studies
(Thr LW , Thr I , m, k, R S TR ) are the very five parameters used in BICS. The general range of each of them can be obtained based on analysis of cache capacity sensitivity curves of SPEC CPU2006 benchmarks collected in Sect. 3. However, to find out the optimal parameter values, we also conduct a series of parameter sensitivity studies. For each parameter under concern, we first define its general range according to cache capacity sensitivity curves of SPEC CPU2006, then vary its value across the range while holding the other parameters as constants in BICS-S configuration.
Sensitivity Study on R S TR
R S TR reflects the aggressiveness of BICS. Smaller value of R S TR identifies more applications as cache friendly, so they will work as spillers in BICS. For sensitivity study, we set R S TR to three different values: 5%, 8% and 10%. Figure 13 shows the throughput of each workload for CC and BICS with different R S TR values, normalized to Private. Figure 14 provides the geometric mean of WS and Hmean for CC and BICS with different R S TR values, also normalized to Private. As shown by Fig. 13 , BICS achieves better or comparable performance over Private and CC with all three R S TR values for all workloads except for WL 07. For WL 07, although BICS improves throughput significantly when R S TR = 5% and R S TR = 10%, it degrades throughput slightly compared to CC when R S TR = 8%. This indicates an improper value of R S TR is chosen so that the benefit from spilled blocks is not sufficient to counteract their interference to receivers. This kind of degradation can be circumvented by selecting a more proper value for R S TR .
No matter in terms of throughput or WS, none of the three BICS configurations with different R S TR values excels the other two significantly. As shown by Fig. 13 and Fig. 14 , they have comparable average throughput and WS promotion. However, BICS with R S TR = 5% is more stable than the other two, for the performance boost of it for different workloads does not vary across workloads as much as the other two. Given that BICS with R S TR = 5% also improves average Hmean fairness more than the other two, as shown by Fig. 14 , R S TR = 5% is more preferable for BICS.
Sensitivity Study on m
Parameter m is also important because it determines the maximum cache size that online identification explores to distinguish applications' cache behaviors. Since m is mainly used by policy metric Rate m , a too small value won't be sufficient to explore the varying trends of miss rate with increasing size, whereas a too large value would go too far to reflect the varying trends of miss rate at points right above applications' current share. Another reason that m should be chosen carefully is that the size of ATDs is directly proportional to m. Larger m means larger storage overhead.
As m increases, Rate m increases correspondingly, thus a larger R S TR is needed to maintain the same aggressiveness of BICS. We consider different (m, R S TR ) combinations rather than only vary m. Figure 15 shows the throughput of each workload for BICS with different (m, R S TR ) values, normalized to Private. It is clear that larger m does not necessarily result in better performance. Although for some workloads, larger m enhances the performance boost notably, it harms the performance of some others significantly. This is because larger m goes too far to reflect the current varying trends of certain workloads. Overall, m = 2 provides the best average performance boost, thus it is the most agreeable value for our system configuration with regard to both performance and storage overhead. However, for larger system with more cores, we expect a larger m would be more appropriate, which could be determined experimentally.
Sensitivity Study on k
Parameter k determines how much space of receivers could be dedicated to spilled blocks, i.e. it determines the partitioning between local blocks and spilled blocks in receiver private LLCs. A good value of k should balance well between the extra misses of local applications and the increased hits due to spilling. We set k% to 1 / 64 , 1 / 32 , 1 / 16 , 1 / 8 respectively. Figure 16 shows the throughput of each workload for BICS with different k% values, normalized to Private. We can see that BICS with k% = 1 / 32 achieves the best performance for almost all the workloads, with the best average performance across all candidates.
Sensitivity Studies on Thr LW and Thr I
Thr LW and Thr I set the boundaries to distinguish LW, moderate and intensive applications. We also search their optimal values experimentally. We vary Thr LW from 100 to 1000, with a step of 100, and vary Thr I from 3000 to 8000, with a step of 1000.
We evaluate all 16 workloads for all candidate values, and Fig. 17 compares the average throughput of BICS with all candidate values for Thr LW and Thr I , respectively. Different value of Thr LW or Thr I merely impacts the overall throughput as long as it is within the general range. The global optimums for Thr LW and Thr I are 500 and 4000, respectively.
Conclusion and Future Work
Traditional private LLC organizations use spilling to improve cache resource utilization in CMPs. However, the benefit that an application gets from spilling is highly dependent on its cache behaviors. This paper proposes program Behavior Identification-based Cache Sharing (BICS) to exploit the benefit of spilling based on the cache behavior types of applications. Main contributions are as follows:
• It proposes a new dynamic program behavior classification method, based on the analysis of cache statistics of SPEC CPU2006 benchmarks.
• It proposes a new LLC organization, i.e. BICS, which uses online program behavior type identification to guide the spilling decision generation of private caches, and then adopts the idea of cache partitioning [7] to provide more ways to spilled blocks. Evaluation shows that BICS outperforms private cache by up to 14.5% and on average 8.8%.
Although we adapt online program behavior identification to a private cache organization in this paper, it could be extended to a shared LLC. For example, we could make cache partitioning [7] of shared cache behavior-aware. The proposed framework also could be adapted so as to be QoSaware. Exploring these extensions is a part of our future work.
