Resource sharing can cause unfair and unpredictable performance of concurrently executing applications in Chip-Multiprocessors (CMP 
Introduction
Data intensive applications usually spend a large proportion of total execution cycles on memory accessing because of the long latency of off-chip requests. The on-chip last level cache (usually L2 or L3 cache), which is the key component to hide off-chip request latency, is typically shared by multiple cores in a Chip-Multiprocessors(CMP) architecture. As a result, the sharing of last level cache has significant impacts on the performance of concurrently executing applications on different cores. The technologies of server consolidation and virtual machine are calling for the demand of scheduling heterogeneous workloads together. The different memory accessing characteristics of heterogeneous workloads will lead to unfair cache sharing, which breaks the hardware fairness assumption of the scheduler and may bring thread starvation, priority inversion and other problems to operating system's process scheduler [6] .
Prior work has noticed the problem of unfair cache sharing and the consequential result of unfair performance. [6] proposed a cache partitioning mechanism to enhance fairness of cache sharing. Although intuitively the ideal fairness of cache sharing should be equal slowdowns relative to running with a dedicated cache for all co-scheduled applications, the proposed mechanism uses equal increment of cache miss numbers or miss rates as fairness metrics, because it is pointed out that performance fairness (ideal fairness) is hard to measure and cache miss fairness usually highly correlates with the performance fairness.
The mechanism proposed in [6] is an improvement over unmanaged caches. However, because the cache miss fairness is not identical to performance fairness, the improvement on cache miss fairness can not guarantee the same degree of improvement for performance fairness, and enforcing fairness on cache miss does not necessarily lead to the situation of performance fairness. To get a deeper understanding, we should note that the correlation between cache miss fairness and performance fairness actually depends on two factors: (1) The performance sensitivity of each application to cache misses varies, as applications spend differing fractions of execution time stalled on cache misses. The performance of those applications with a large fraction of execution cycles stalled on misses will be more sensitive to cache miss varies. The fractions of cache miss stalls are diverse; for example, data centric applications will spend much more cycles on memory access (stalled by cache misses) than computation oriented applications. ( 2) The stalls arising from each miss vary as a function of Memory Level Parallelism (MLP) [13] [2] and ILP. Clustered cache misses' latency cycles overlapped with recent misses, and the average penalty of each cache miss can be reduced. So the actual penalty of each cache miss is smaller than memory access latency and may differ according to different memory access behaviors. Besides, computation operation cycles and memory access latency cycles may overlap, which can reduce the actual penalty of cache misses, too.
Thus, in order to enforcing performance fairness on cache sharing, we must firstly build an analytic model to account for the performance impact of cache sharing. The analytic model should be able to help to divide applications' whole execution cycles into "private part", which is not related to cache sharing, and "vulnerable part", which is susceptible to additional cache misses caused by cache sharing. Then, according to the analytic performance model we can develop a mechanism to provide a reference point for performance fairness metric by measuring or estimating the application's performance in a dedicated cache when it is actually running in a shared cache.
This paper makes two contributions: (1) we builds a model to analyze the performance impact of cache sharing for CMP, considering not only additional cache misses caused by cache interference of concurrent workloads, but also the variation of the actual penalty for each cache miss. (2) To the best of our knowledge, this paper proposes the first mechanism that partitions shared cache and enforces fairness for the goal of performance fairness. This mechanism is dynamic and adaptive, no static profile needed. The proposed mechanism always improves the performance fairness metric, and can provide no worse throughput than the cache without any management mechanism.
The rest of the paper is organized as follows. Section 2 gives a definition of performance fairness as well as cache miss fairness which is used in prior work. Section 3 introduces an performance model, which connects cache miss rate and overall performance, to account for the performance impact of cache sharing. In Section 4, the hardware mechanism of enforcing fairness on shared cache in a typical CMP architecture is described in detail. Section 5 describes the experiment methodology and discusses experiment results. Section 6 introduces related work briefly and Section 7 gives a conclusion.
Defining Performance Fairness
We define performance fairness, which is the ideal fairness discussed above, as identical slowdown for each concurrent workload running with shared cache compared to running with separate, dedicated caches, which is called execution time fairness in [6] . Let T ded i denotes the execution time (count of execution cycles) of workload i with dedicated cache and T shr i for the execution time with shared cache, then performance fairness is achieved when:
for every pair of concurrent workloads i and j. T is the slowdown of workload i running with shared cache compared to dedicated cache. And we define performance fairness metric of a pair of concurrently running workloads i and j under a certain cache partition as M perf :
Intuitively, M perf is the sum of the slowdown difference between every two co-scheduled workloads. The smaller M perf is, the smaller the slowdown difference among all coscheduled workloads, thus the better the performance fairness. If M perf equals zero, perfect performance fairness is achieved.
For the purpose of comparison, we also define that cache miss fairness is achieved when:
for every pair of concurrent workloads i and j, in which MPKI shr i and MPKI ded i denote the miss count per thousand instructions when workload i running with dedicated cache and shared cache. Cache miss fairness metric is defined as:
Similar to M perf , the smaller M miss is, the better the cache miss fairness, and perfect cache miss fairness is achieved when M miss equals zero. M miss is the same as M 3 in [6] and F M 3 in [7] . [6] proposed five fairness metrics, but M 2 has been shown to have a poor correlation with performance fairness; M 1 and M 3 contribute the most highly correlation with performance fairness for most cases. If the workload runs the same count of instructions with dedicated cache and shared cache, M miss is the same as M 1 , too. So M miss is representative enough.
Modeling Performance Impact of Cache Sharing
We use a typical CMP architecture configuration for following analysis: N cores on chip; each core has private L1 instruction cache and data cache; unified on-chip L2 cache shared by all on-chip cores is the last level cache, and a miss in L2 cache will issue an off-chip request. All caches are set-associated.
To model performance impact of cache sharing, the cycles consumed during application's running time can be categorized into two classes: private operation cycles (T pri ) and vulnerable operation cycles (T vul ). Intuitively, private operation cycles are the part of execution cycles which only depends on the characteristics of the workload. Vulnerable operation cycles are sensitive to different co-schedulers on other cores and may varies diversely because of resource sharing. Private operation cycles are consumed by those operations which only need private resources of the core, such as computation time, the latency of fetching instruction and data from private L1 cache. A L2 request is a hybrid operation. The tag looking up latency is accounted in private operation cycles for it is constant no matter whether this request misses or not. However, if the L2 cache misses, the cycles of off-chip request latency are vulnerable operation cycles because this cache miss may be caused by cache sharing and the off-chip request is extra cycles.
However, the whole execution cycles is not simply equal to the sum of T pri and T vul because there are overlap cycles of private operations and vulnerable operations. For example, in an out-of-order processor, even if a instruction is stalled by a L2 cache miss, other instructions in the schedule windows still can enter pipeline if there is no data dependence. As a result, the overlap cycles (T ovl ) must be introduced, then:
The example in Figure 1 illustrates Equation 6 . In this example, during the execution time of T , two L2 cache misses occur. After the first miss, the instruction window still has instructions with no data dependence on this miss, so the pipeline is not stalled yet until the second miss occurs. The computation time of pipeline is certainly part of T pri , and the off-chip request latency of L2 misses is T vul . Note that when L1 misses occur, the L1 request latency (tag looking up latency in L2 cache), such as slice A and slice B in the figure, is part of T pri ; however, it ends up as an L2 miss, the latency of off-chip request does not belong to T pri any more. T ovl is formed by those cycles when T pri and T vul overlap.
Because the vulnerable operation cycles are mainly the cycles of off-chip request latency, T vul can be approximated by the total penalty of off-chip requests. Average MLP is needed to consider to estimate the average penalty of each cache miss. We derive the average MLP definition in [2] as the average number of useful outstanding off-chip requests when there is at least one outstanding off-chip requests, denoted by M LP avg . We have the following equation:
Then Equation 6 can be:
In Equation 8 the latency of off-chip request (M iss Latency) can be treated as a constant (ignoring other factors such as bus congestion). Equation 6 and 8 shows that the total execution time of an application including three parts: T pri is the inherit part that does not change for cache sharing. T ovl and T vul are related to cache sharing; they are the cause of unfair performance and the part of execution time that cache partition mechanisms want to adjust. According Equation 8 , if we can get the parameters of T ovl , N miss and M LP avg through hardware profiler, the total execution time can be estimated dynamically.
Hardware Mechanism
The necessary hardware support includes two interdependent parts: cache partition mechanism and hardware profiler. Figure 2 shows how it works. The hardware profiler gather the statistics of each core's pipeline, each private L1 instruction cache and data cache, and unified L2 cache during last period. At the end of period, cache partition decision is made and the cache partition mechanism applies the decided partition to cache. 
Cache Partition Mechanism
The cache partition mechanism is the simpler part. Assuming there are N cores on chip sharing L2 cache, we add log 2 N bits for each cache block to mark which core this cache block belongs to. And each core has a bit of overAlloc flag to indicate whether this core has been allocated too much cache space or not. When a cache miss from core i occurred, firstly found the correct cache set; if the overAlloc flag of core i shows that core i is not over allocated, use original cache replacement policy to find a victim cache block, and change the mark bits of this cache block as belonging to core i; if the overAlloc flag shows that core i has been allocated too much cache space, randomly select one cache block with mark of core i as victim cache block. This mechanism guarantees over allocated cores can not gain more cache lines, and the additional cache block will be gradually taken by other under allocated cores, and the partition target set by overAlloc flag of each cores. A special situation is that though the register of core i shows over allocated, there is no cache block marked as belonging to core i, we still allocate a cache line for core i. 
Hardware Profiler
The hardware profiler is more complex. To achieve performance fairness for a shared cache CMP, identical slowdowns compared to running with dedicated cache is the ideal goal. The key problem is to estimate the cycles needed by the workload with dedicated cache when it is actually running with a shared cache. T 
According to the definition of T pri we can get Figure 4 . 
Profiling parameters for shared cache
The profiler calculate statistics in every P T cycles (profiling period), then T shr = P T . To get T shr vul , we add a log 2 N bits flag to each entry of L2 MSHR (Miss Status Holding Register) to identify which core causes this miss; then monitor the count of entries with the specified flag to see if there is request of this core in MSHR. At the beginning of profiling period, T shr vul is initialized to zero. If MSHR has at least one entry for this core, T shr vul should be increased because there is at least one off-chip request on this cycle.
To decide whether a cycle of the whole execution time belongs to T shr ovl , three conditions should be considered: (1) whether the pipeline is stalled in this cycle; (2) whether there is at least an L1 request in this cycle; (3) whether there is at least an L2 request for this core in this cycle. Condition (2) only means the on-chip request; if the L1 request ends up to be an L2 miss, the off-chip latency cycles is not condition (2) (referring the example showed in Figure 1 ). Condition (1) can be easily got by monitoring the status of pipeline and condition (3) by monitoring the whether the L2 MSHR has entries for this core. To decide condition (2), we add a timer for each L1 cache, including instruction cache and data cache. Let L2 LAT denotes the lookup latency of L2 (L2 LAT is always a fixed value). If this L1 cache misses and issues a request to L2, the timer is set to L2 LAT . For each cycle, the timer is decreased by 1. The condition (2) can be decided by checking whether the timer is zero or not. Algorithm 2 described the logic in detail.
Profiling and estimating parameters for dedicated cache
The three parameters of T Figure 4 are responsible to the application running with a dedicated cache. However, because the workload is actually running with a shared cache, we need special hardware to help to estimate the situation when it is running with a dedicated cache. To achieve this, we uses the technology of Auxiliary Tag Directory (ATD) [13] to attach a virtual "private" L2 cache to each core. An ATD has the same associativity as the main tag directory of the shared L2 cache and uses the same replacement policy. However, an ATD only contains the tag and other functional bits of each cache block but do not keep data. We also add an auxiliary MSHR to each ATD, then an ATD can act just as a private L2 cache. Each entry in the auxiliary MSHR has a timer to simulate memory accessing request. When a memory request is issued to an auxiliary MSHR, the timer of the inserted entry is set to the round trip cycles of memory accessing latency. For every cycle, all entries' timers are automatically decreased by 1, A timer's value equals 0 means this memory request has returned and the request block is ready for ATD.
Because there is an ATD for each core, the total hardware cost is substantial. We employ set sampling technology [13] to reduce the storage cost. The ATD with set sampling only selects cache set samples in a specified interval, and the be- havior of the whole can be approximated by sampled cache sets. An ATD with large sampling interval requires less hardware overhead, but gives less accurate statistics. So, there is a tradeoff between hardware cost and accuracy in choosing sampling interval.
With an ATD and an auxiliary MSHR, it is possible to simulate a private L2 cache for each core. When a request from upper level L1 cache is coming, it is forwarded to the really L2 cache as well as the attached ATD of the core which the request sender (L1 cache) belongs to. Then the ATD respond to this request just as L2 cache does, including touching MRU bits, replacement, counting miss count and issue requests to the auxiliary MSHR. So N ded miss and T ded ovl can be gained by the ATD and auxiliary MSHR attached to each core using similar mechanism which is used to get N shr miss and T shr ovl . M LP ded avg can also be gained with the help of auxiliary MSHR. M LP avg is defined as the average number of useful outstanding off-chip requests when there is at least one outstanding off-chip requests. Two counter M LP sum and M LP cycles are used to store the accumulated off-chip latency and the count of cycles when there is at least one outstanding off-chip request. The algorithm is described as follows:
Note that T for each core i. The slowdown for core i can be calculated at the end of profiling period by:
The core which has the least slowdown is selected as being allocated too much cache space by setting the this core's overAlloc flag. Through the cache partition mechanism, the slowdown of each core will be closer and performance fairness will be improved.
Hardware cost
For each cache line, log 2 N bits are added to indicate which core this cache line belongs to, in which N is the number of cores sharing the cache. In addition, there should be some monitor circuits in L2 MSHR, L1 cache and pipeline. For each core, there are 2 additional L1 timers and 1 auxiliary MSHR, typically 32-entry.
ATDs are the major part of cost. Set sampling can significantly reduce the storage of ATD. For the configuration of 2-way CMP with 1M, 64 bytes line size, 8-way associative L2 cache, and sampling for each 16 cache sets, the hardware cost is:
• For each ATD entry: (24 bits tag)+(4 bits LRU)+(1bit valid)+(2 bits in-addition)=31 bits 
Experimental Methodology

Configuration
The evaluation is performad using Simics [8] , which is a whole system simulator supporting CMPs. The memory timing model is derived from GEMS [9] , which enables detailed cycle-accurate simulation of multiprocessor systems for Simics. Table 1 shows the basic parameters of the simulated architecture. The simulated CMP cores are out-of-order superscalar processors with private L1 instruction and data caches, sharing unified L2 cache and all lower levels of memory hierarchy. All caches are set-associated using Pseudo LRU policy for replacement decision. 
Metrics
To evaluate the fairness improvement of different schemes at the baseline of uncontrolled L2 cache sharing (original Pseudo LRU policy), we define: would be infinite unless the cache partition scheme also achieves perfect fairness.
Benchmarks
We select a set of most memory-intensive benchmarks from SPEC CPU 2000 benchmark suite, and run them concurrently in pairs in the simulated dual-core CMP system to see the benefit of different cache partition schemes, including uncontrolled L2 cache sharing using original Pseudo LRU policy, enforcing cache miss fairness on cache partitioning and enforcing performance fairness on cache partitioning. We compare the normalized cache miss fairness metric (F scheme miss ) and normalized performance fairness metric (F scheme perf ) for all cache partition schemes at the baseline of uncontrolled L2 cache sharing. For each selected benchmark, a representative slice of instructions is obtained using a tool SimPoint [10] for simulation.
To categorize the selected benchmarks, we consider two features of each benchmark that can affect the correlation between cache miss fairness and performance fairness: (1) the portion which vulnerable time took in the whole execution time, which is T vul /T according to Equation 6; (2) the average MLP during execution of the benchmark. Workloads with high value of T vul /T are more sensitive to interleaving, while workloads with high average MLP are more tolerant for last level cache misses. These two factors must be combined in analysis. Table 2 shows the classification of the ten selected benchmarks based on the values of T vul /T and average MLP: mcf, ammp, art, applu, gzip, swim, apsi, equake, vpr and sixtrack. The pair of numbers shown in parenthesis denotes the benchmark's values of T vul /T and average MLP. These data are collected by running a single benchmark with dedicated cache of original Pseudo-LRU replacement policy. The eight benchmarks are categorized into four classes: high-vulnerability/high-MLP, high-vulnerability/low-MLP, low-vulnerability/high-MLP and low-vulnerability/low-MLP.
We pair the benchmarks presented above and running them concurrently in our simulated system with shared cache to evaluate fairness metrics of different schemes. Table 3 shows the classification parameters when the benchmarks are running concurrently with shared cache of original replacement policy. We can see that although the two parameters may vary diversely (especially the portion of cache miss latency), the relationship of parameter values are kept. That is, if a benchmark from the class of high/high is running concurrently with another benchmark from the class of low/low, the first benchmark usually still has a larger portion of cache miss latency and a larger MLP. So is other combinations of benchmarks from the rest classes. Table 3 . Benchmark pairs and parameters when running concurrently.
Results and Analysis
Model Accuracy Verification
Before evaluating fairness metrics, we need to verify how accurate the model described in above sections could be, as well as the impact of set sampling in ATD. In this verification experiment, we ran benchmarks for two passes. For the first pass, we used the hardware configuration in Table 1 , but only applied profiler component of the hardware mechanism to the shared L2 cache; the cache replacement policy is not modified. We ran the benchmark pairs in Table 3 in the share cache configuration and get necessary statistics for estimating the benchmarks' performance (IP C ) when running with dedicated cache. In the second pass, we ran these benchmarks again in identical hardware except using dedicated L2 cache for each core (the dedicated cache has the same configurations as the shared cache). In this pass of running, we can get the the benchmarks' actual (IP C) performance when running with dedicated cache. Then we can compare IP C and IP C, and get the relative error:
To review the effect of employing set sampling technology on ATD, we compared the estimating error of different sampling interval. From Figure 5 we can see that, if do not use set sampling in ATD (sampling interval 1), the average E is 4.7% and max E is 10.9% (equake running with sixtrack). As the sampling interval increasing, E increases as well. When sampling ATD set with interval 16, the average E is 6.7% and max E is 13.3% (art running with gzip), which is still accurate enough. So the following experiments used interval 16 for set sampling in ATD.
Evaluation of fairness metrics
We use F scheme perf and F scheme miss described in Section 4.2 to measure the fairness improvement for two cache partition schemes: the scheme that enforces cache miss fairness (SECF) and the scheme that enforces performance fairness (SEPF). The baseline is uncontrolled cache sharing using originally Pseudo LRU policy, which is widely used in production processors. ). For each benchmark pair, the two benchmarks are executed concurrently for three passes: in the first pass, shared L2 cache uses original, uncontrolled Pseudo LRU replacement policy; in the second and third pass, cache partition schemes of enforcing cache miss fairness (SECF) and enforcing performance fairness (SEPF) are used. The two fairness metrics of (F scheme miss ) and (F scheme perf ) are measured for each pass of execution.
In Figure 6 (a), we can see that when using SECF to en- force cache miss fairness on shared cache, cache miss fairness metric (F scheme miss
) is improved significantly compared to uncontrolled Pseudo LRU replacement policy (note that shorter bar means better fairness). However, when using SEPF to enforce performance miss fairness on shared cache, though there is still cache miss fairness metric (F is not improved as much as SECF. Even there are benchmark pairs which show worse cache miss fairness than uncontrolled Pseudo LRU cache when using SEPF(ammp+equake and equake+mcf ). On average, SECF gains a cache miss fairness improvement of F scheme miss = 0.14, while SEPF gains an improvement of F scheme miss = 0.45. Figure 6 (b) shows the performance fairness metric (F scheme perf ) when using different cache partition schemes. The result is different from the result of cache miss fairness metric. Although SECF always shows better F scheme miss than the SEPF, it fails to gain a better performance fairness metric (F scheme perf ) than SEPF. Whats more, there are three benchmark pairs suffer great performance fairness degradation when using SECF to enforce cache miss fairness: swim+art (3.87), equake+mcf (18.3) and equake+sixtrack(3.32), which means that SECF may make the problem of unfair performance of concurrent workloads even worse in some cases. In contrast, although SEPF got poor F scheme miss for ammp+equake and equake+mcf, it ends up that SEPF gains better performance fairness in these benchmark pairs compared to SECF. On average, SECF gains a performance fairness improvement of F Figure 6 (a) and Figure 6 (b), we can get a deeper insight into the correlation between cache miss fairness and performance fairness. Merely enforcing cache miss fairness on concurrent workloads can improve performance fairness in most cases, but can not guarantee performance fairness and sometimes may suffer performance fairness instead of improving it, especially when the co-scheduled workloads have different features. For example, when benchmarks of type 4 (lowvulnerability/low-MLP) are running with benchmarks with type 1 (high-vulnerability/high-MLP) such as benchmark pairs of gzip+mcf, gzip+art and apsi+art, SECF gains a much poorer performance fairness than SEPF; when benchmarks of type 2 (high-vulnerability/low-MLP) are running with benchmarks with type 1(high-vulnerability/high-MLP) such as benchmark pairs of swim+art, equake+mcf and equake+sixtrack, SECF even suffers performance fairness a lot. The reason is that when a low-MLP workload is running concurrently with another high-MLP workload, more shared cache resource should be allocated to low-MLP workload to guarantee the whole performance fairness because low-MLP workload has a relatively larger cache miss penalty and enforcing cache miss fairness can not guarantee performance fairness. And if both of the concurrently running workloads are highly sensitive to cache interference (type 1 benchmarks running with type 2 benchmarks), the degradation of performance will be more obvious when using SECF to enforcing cache miss fairness. By taking more factors into account, SEPF can achieve better performance fairness in most cases. Figure 7 shows the throughput of selected benchmark pairs with uncontrolled Pseudo LRU and the two cache partition schemes. We use fair speedup defined in Equation 14 to measure the throughput of concurrent running benchmark pairs. Note that taller bar in the figure means higher throughput. We can see that SEPF gains a competitive throughput for most benchmark pairs. And for gzip+mcf, ammp+applu and swim+mcf, SEPF can provider higher throughput than original Pseudo LRU and SECF. On average, SECF gains a fair speedup of 1.04 while SEPF gains a fair speedup of 1.07 on the base line of original Pseudo LRU replacement policy.
Related Work
Prior work has noticed the impact of cache sharing for concurrent threads. [15] proposed to use hardware counters to estimate the cache miss-rate as a function of cache size, which can be used to optimize cache partition to minimum overall miss rate. [13] designed a runtime mechanism that partitions a shared cache according to the cache utility of concurrent multiple applications. [15] and [13] both focused on optimization of overall miss rate. [6] pointed out the necessity of enforce fairness for co-scheduled workloads. [6] defined a set of fairness metric for cache sharing and proposed both static strategy and dynamic mechanism to improve fairness. [4] proposed a framework to provide QoS for resources including shared caches. [14] designed architectural support for OS to manage shared caches.
Some researches indicate that decreasing in cache miss rate does not necessarily lead to performance improvement. [12] showed that not every cache miss has an equal penalty because of the existing of MLP. By taking MLP related cost of each cache miss into account, [12] modified the standard LRU replacement policy to achieve higher performance. [2] analyzed the microarchitecture impact on MLP and developed a detailed model to relating MLP to overall performance.
There are researches of performance models for other architectures. [3] proposed a cycle accounting architecture for SMT processors to estimate the performance of all coscheduled threads had they ran alone. [5] proposed a performance model for superscalar processors.
Conclusion
Uncontrolled cache sharing usually leads to unfair performance of concurrent workloads. That is, some workloads suffer a much more significant slowdown than other workloads. This phenomenon brings other problems such as priority inversion and thread starvation to operating system's process scheduler. Instead of enforcing performance fairness directly, prior work addressing fairness issue of cache sharing mainly focuses on the fairness metrics of cache miss numbers or miss rates. However, because of the variation of cache miss penalty, fairness on cache miss cannot guarantee performance fairness. Cache sharing management which directly addresses performance fairness is needed for CMP systems.
This paper proposed the concept of performance fairness metric. We built a detailed model to analyze the performance impact of cache sharing. Guided by this model, we designed a hardware mechanism to enforcing performance fairness on shared cache. The mechanism proposed in this paper is adaptive and hardware efficient. For comparison, the concept of cache miss fairness metric and a hardware mechanism to enforcing cache miss fairness are also introduced. We implemented these two cache partition schemes in a simulator. The experiment results showed that the mechanism enforcing performance fairness always improves the performance fairness metric, and can provide no worse throughput than the scenario without any management mechanism.
