Chip Multiprocessors (CMPs) allow different applications to share LLC (Last Level Cache). Since each application has different cache capacity demand, LLC capacity should be partitioned in accordance with the demands. Existing partitioning algorithms estimate the capacity demand of each core by stack processing considering the LRU (Least Recently Used) replacement policy only. However, anti-thrashing replacement algorithms like BIP (Binary Insertion Policy) and BIP-Bypass emerged to overcome the thrashing problem of LRU replacement policy in a working set greater than the available cache size. Since existing stack processing cannot estimate the capacity demand with anti-thrashing replacement policy, partitioning algorithms also cannot partition cache space with anti-thrashing replacement policy. In this letter, we prove that BIP replacement policy is not feasible to stack processing but BIP-bypass is. We modify stack processing to accommodate BIP-Bypass. In addition, we propose the pipelined hardware of modified stack processing. With this hardware, we can get the success function of the various capacities with anti-thrashing replacement policy and assess the cache capacity of shared cache adequate to each core in real time.
Introduction
performance for high locality workloads. However, it can show thrashing behavior for a working set greater than the available cache size. To alleviate thrashing problem, BIP (Bimode Insertion Policy) [4] inserts most of new cache blocks to the LRU position to preserve the cache contents and inserts the rest of new blocks to the MRU position to adapt to working set changes. BIP-Bypass [4] bypasses new blocks instead of inserting them to LRU position.
In TADIP [2] , each core determines whether or not it use LRU or BIP as a replacement policy to attack thrashing problem in CMP. However, TADIP cannot partition the shared cache and each core cannot preserve its working set in cache space.
Until now, there is no cache partitioning method with anti-thrashing replacement policy, since there is no stack processing method for anti-thrashing replacement policy. In this letter, we prove that BIP is not feasible to stack processing but BIP-Bypass is. In addition, we modify stack processing to accommodate BIP-Bypass, and we propose new hardware to implement stack processing for BIP-Bypass.
Stack Processing

Inclusion Property, Stack and Success Function
Let x t be the address of a trace in time t. Inclusion property means that the cache contents B t (C) must be a subset of B t (C + 1) for any time t and any capacity C on a trace as follows.
Stack algorithms are replacement algorithms that satisfy inclusion property [1] .
From the inclusion property, the cache contents for all capacities can be represented in the following way. Stack is an ordered address list S t = S t (1), S t (2), S t (3), . . ., where
This can be used to efficiently determine the success function F(C). Let C t denote the least buffer capacity such that x t ∈ B t−1 (C). C t is called critical capacity because all buffers larger than C t contains x t from inclusion property. C t is simply the position of page x t in the stack S t−1 . This position is called stack distance Δ t .
Let n(Δ) denote the number of times the stack distance Δ is observed in processing a trace. Since stack distance equals critical capacity, the number of times that the referenced address is found in the cache with capacity C is
Copyright c 2013 The Institute of Electronics, Information and Communication Engineers
and success function is given by the expression where L is the length of the trace
n(Δ) can be determined from a set of distance counters. All counters are set initially to zero, and the counter for each distance Δ is incremented whenever that distance occurs. Figure 1 shows the example of the stack distance frequency function and the success function after the trace of 10 addresses. To find the number of hits for the buffer capacity 3, N(3) = n(1) + n(2) + n(3) = 4 as indicated in Fig. 1 (b) . In this case, hit rate is F(3) = N(3)/10 = 0.4.
Total Ordering
Replacement algorithms that induce a total ordering on all previous referenced addresses and use this ordering to make replacement decisions satisfy inclusion property [1] . Two representative replacement policies satisfying total ordering are LRU and LFU (Least Frequently Used) with a break tie scheme. They maintain just one priority list independent on the cache capacity. When there is a new block to insert, they choose the lowest priority block in the cache for replacement. Replacement policies that do not hold total ordering like FIFO should maintain one priority list per each cache capacity. Therefore, replacement policies with total ordering are more area efficient when we want to evaluate various cache capacities. shows that BIP-Bypass updates LRU priority list and stack differently. Therefore, BIP-Bypass must maintain two lists (LRU priority list and stack). Unlike BIP-Bypass, stack processing for LRU replacement policy needs only one list, since LRU priority list is equivalent to stack at any time as in Fig. 3 (b) .
Correlation between Insertion and Promotion
Recent
Bypass Extended Stack Processing
Conventional stack processing is based on the demand pag- 
Inclusion condition is also sufficient, because if inclusion condition holds, B t (C) is a subset of B t (C + 1).
Figure 4 is an example of the insertion 'e' to caches with size 3 and 4. Figure 4 (a) is the initial state. When 'e' is inserted to the both caches, inclusion property is satisfied when (Fig. 4 (b)) 'b' or (Fig. 4 (c) ) 'a' is the victim of cache with size 4. Choosing another block like 'd' (or 'c') violates inclusion property since 'd' (or 'c') is in the cache with size 3, but is not in the cache with size 4. We can see that among Y t (C) and S t−1 (C + 1), one becomes Y t (C + 1) and the other becomes S t (C + 1). Figure 5 is an example of a bypass 'e' to the cache with size 3. Figure 5 (a) is an initial state. When 'e' is bypassed by the cache with size 3, inclusion property is satisfied when (Fig. 5 (b) ) 'a' is the victim of the cache with size 4 or (Fig. 5 (c) ) the cache with size 4 bypasses 'e'. Choosing another block violates inclusion property. From inclusion condition, we can see that only when B t−1 (C) bypasses new block, B t−1 (C + 1) can bypass. Figure 6 shows the inclusion property violation when we choose a victim by total ordering list without total ordering inclusion condition. Figure 6 (a) is the initial state after the last reference 'd' is bypassed in the cache with size 3 and inserted to the cache with size 4. Assume that a new reference 'e' occurs and the cache with size 3 bypasses 'e'. In Fig. 6 (b) , if we insert 'e' to the cache with size 4, 'd' should be replaced to preserve the inclusion property. However, 'e' is not the victim which total ordering priority list determines. In Fig. 6 (c) , since 'a' is the lowest priority block, 'a' is replaced instead of 'd' and inclusion property is violated. To avoid this violation, cache with size 4 should also bypass 'e'.
Stack Update
At every reference, we need to update stack to maintain upto-date state as follows.
If Y t (C) is not ∅, by inclusion condition,
is also ∅ by total ordering inclusion condition. Since the contents of B t−1 (C) and B t−1 (C +1) do not change, S t (C + 1) = S t−1 (C + 1).
Hardware for Bypass Extended Stack Processing
To evaluate the success function of various cache capacities with BIP-Bypass replacement policy in real time, we need the hardware implementation of bypass extended stack processing.
In this implementation, pipelined architecture is natural, since S t (C) and Y t (C) values of cache capacity C are dependent on the values of capacity C − 1. In addition, it also provides high bandwidth and scalability. Priority list is also pipelined to match pipelined stack. Figure 7 shows the conceptual hardware structure of priority list and stack. Since all stages of stack need the priority value of new block, the pipelined priority list is accessed first, and then the pipelined stack is accessed second. Each stage of stack augments pri field which represents the priority of stack entry. In our implementation, larger pri value means lower priority. First stage is slightly different from other stages. It contains bypass decision logic which decides whether or not it bypasses the newly referenced block. It generates bypass signal randomly with high probability to preserve cache contents.
Conclusion
BIP and BIP-Bypass are anti-thrashing replacement policies. In this letter, we proved that BIP replacement policy is not feasible to stack processing but BIP-bypass is. We modified stack processing to accommodate bypass, and we proposed the pipelined hardware architecture of stack processing for BIP-Bypass. With this hardware, we can assess the capacity demand of each core with anti-thrashing replacement policy in real time. Using this information, shared cache can be partitioned by anti-thrashing replacement policy as well as LRU replacement policy.
