ABSTRACT A dynamic shared cache partitioning scheme for multi-core processors is presented. Capacity misses produced by the running processes are continuously evaluated and used to assign the available space in a shared cache memory. The administration of the resources gets a reduction of up a 20% in the global miss rate of all the executed processes when compared to a Capitalist management policy. Also IPC and bandwidth were evaluated. The dynamic management proposed fulfills its objective of managing the shared cache space for every process while improving the performance.
Introduction
Multi-core architectures have been established as the processor organization that can maintain a sustained growth in performance with the increasing availability of transistors. Processors with multiple cores increases in all computer systems markets and apply to smart phones, tablets, desktops, servers and so on. Each processor core has private and shared memory available [3] . At present, some processor architectures count on multiple cache memory levels. For the sake of simplicity and performance, some levels are kept private for each processor core, while other architectures explore sharing the last level of cache among different cores of the processor as it's illustrated in figure 1 [7, 4, 15, 13] . Shared caches memories are highly beneficial to performance, because in the extreme case of a single core active while others inactive, the process being executed would count with all shared cache memory, which is larger than all those used as private memories for each core. Moreover, they provide high performance when cores share information among themselves, reducing latency and coherence. However, they represent a major challenge in terms of administration, since multiple processes access it simultaneously giving place to situations that in private caches never occur, such as over-writing information from another process (causing inter-process misses) or anything that may affect performance also, such as changes in the execution context of different programs when executed [5] .
For this reason, it is desirable to grant to each process in execution an amount of shared cache memory so as to achieve the best possible performance without causing an efficiency decrease of other processes that are also in progress.
Shared Cache Memory Management
Studying the behavior of the last level of cache memory hierarchies is of great interest as this is where the running processes of each one of the processor cores share space and information. This causes interference between processes as those accessing large amounts of memory generally take up more space in the shared cache in detriment of space occupied by the other processes. Also the execution context changes generally involve a change in the rate of memory accesses reflecting more or less space occupied by each process, which indicates the need to adopt a policy on the management of space that each process can be used in the shared cache. Management policies of shared cache memories can be classified as [8] :
• Without Management or "Capitalist": the processes that make use of a cache without management compete for the cache space causing a lot of interference with data and a consequent decrement in performance for the whole system. A shared cache of unmanaged type will be simpler to develop, it will occupy less space and have only the logic of a common cache, with the provision that should have different interfaces for all processor cores that can simultaneously accessing it. The "Capitalist" cache policy is an unregulated free-for-all and the most common policy in use today.
• Static Management: the static administration seeks to allocate a certain amount of space in the cache memory to each process in the time prior to its execution, to avoid interference between them in the cache memory usage. While this management policy achieves an improvement in performance in specific cases, they don't take into account changes in context and the consequent variation in the use of memory that have all processes throughout its execution, resulting in poor performance, which makes this policy is discarded in practice.
• Dynamic Management: This policy seeks to provide a certain amount of cache space to each running process as their need dynamically, i.e. while running. Different objectives could be used to dynamically allocate space to each process. For example find all processes having an appropriate amount of space so that everyone has similar performance, is a policy known as fairness or equity. Other, where a process could be marked as a prominent for give more space over the others, is known as Utilitarian.
In this work a new dynamic cache management policy is given, which main objective is to improve the applications overall performance by using a metric based on the capacity misses, so as to measure and manage the space needed for each process in shared cache memories. In the next section the proposed dynamic management policy is introduced. 
Algorithm for Dynamic Management
In multi-core architectures, the last level of cache memory is typically shared by multiple cores to maximize the use of resources by avoiding low latency information duplication and by reducing traffic information by lowering coherence. Furthermore, shared cache memories allow dynamic allocation of more flexible resources; in an extreme case, a core can use all the cache space when all the other cores are in idle state [12] . Shared cache memory uses space in a more flexibly way as it reduces the number of accesses to the main memory. However, it has a higher latency that propagates to higher levels. On the contrary, private cache memory has lower latency but it inefficiently uses space because the fixed assignation of it to the cores and also the hierarchy contains multiple copies of data shared among processes. Finally, the absence of isolation and administration on the use of space in cache memories results in performance degradation. The above described effects are increased by the competition for resources in multi-process environments. For these reasons, it is necessary to propose resources administration policies in order to minimize the above mentioned effects [16] .
Metrics to Evaluate Performance Fairness
Since it is necessary to evaluate fairness of resources allocation for each process during execution, it must be provided a definition of a metric to determine the actual performance of each process according to their allocated memory resources and the miss rate it produces in a time slice. Metric for performance fairness in [9, 17] are expressed as follows:
This is the measure of fairness for two processes i, j. T
Sha i
time is the run time for the process i in an environment where they share resources with the process j. T
Ded i
is the execution time of process i with all resources for itself (dedicated). Of the variables included in the calculation of runtime dynamic management proposed in this paper, it is aimed to reduce the miss rate in the shared cache, leaving invariant the other variables involved. Therefore, the equation (1) can be expressed in terms of the miss rate in the shared cache. Because the resource to be managed is space in the shared cache only capacity misses is considered for the metric. The operational definition for the capacity type miss given by the D3C classification for cache memory misses [6] is used in this work. Misses classify as capacity or not based on the value of their LRU Stack Distances. 
M isses
In order to obtain the ideal performance, the fraction value for each process is sought to be as close to unity as possible. The equation for multiple processes in progress (2) is defined by the expression FMDCM: Fairness Metric for Deterministic Cache Misses:
The practical technique for implementing (3) is described in the following sequence of steps:
1. Define LRU stack size equal to the number of blocks in the cache memory. Define a mark with the number of cache blocks allocated to the process to be analyzed.
2. If the memory reference miss is not found in the stack, a dedicated and shared capacity miss will be indicated (or compulsory).
3. If the memory reference is in the stack, and its LRU distance value is higher to the mark indicated (which is the amount of cache blocks allocated to the process at that moment) it will be a shared capacity miss.
4. Finally, if the memory reference is in the stack and its LRU distance value is lower than the mark indicated, it indicates a conflict miss.
To achieve the best performance of all processes being executed, the value of equation (3) must be minimized each time the amount of cache sets allocated to each process is administrated.
Metric variation (3) to remove or add a certain amount of cache sets as necessary is still to be defined, in order to accomplish the objectives in (4). Each constant time slices the equation (3) is applied for every executing process.
When using a technique similar to that proposed in [10] to manage the space allocated to each process, tiles containing a fixed number of cache sets are created, and also exchanged with other processes. Each tile contains information about the number of hits and shared or dedicated capacity misses that occurred in the execution period before the management heuristics was applied.
By removing a tile allocated to a process, an increase in the metric (2) for that process must be obtained. This value is the result of the addition of the deallocated tile shared misses (misses having a LRU distance greater than the amount of cache blocks allocated and lower than the total number of shared cache blocks) and the total number of shared misses, preserving the amount of dedicated misses. Experimentally, it was found that this calculated value is a good predictor of performance decline in the process whose cache available space was reduced. The calculation of capacity misses reduction for the process which was allocated a new tile (resulting in a performance increase) is less accurate since this value cannot be determined in a simple way because it depends on the use the process makes of it (in terms of number of hits). The value adopted was the tile value with less shared misses. So, the final value calculated was obtained by subtracting the shared misses from the available tiles and the shared misses from the tiles with less shared misses, preserving the amount of dedicated misses. Though this may be objected claiming that total average of capacity misses or misses from tiles with the lowest and highest number of accesses could be more accurate, the test results demonstrates that this metric is effective, practical and easy to implement.
The last consideration to take into account is that, in certain iterations, the value of dedicated misses is equal to 0. This can happen because the process had mainly shared misses or only conflict misses, which would cause shared misses being equal to 0. In the first case in which dedicated misses are equal to 0, a metric was adopted to obtain the amount of shared misses. The reason for this decision is that if a process only has this kind of misses, it will be benefited with increased space, since data size is the same or lower than the shared cache. If multiple processes are in this state, the metrics will give more space to the one with the highest number of misses. In the second case, the metric obtains the ideal value of 1, since having no shared dedicated misses indicates the ideal value for (2).
Metrics applied for evaluating performance on simultaneous processes in progress were presented as well as the shared cache dynamic management policies used.
Experimental Methodologies

Simulated Architecture
Bearing in mind the memory hierarchies used in some recent multi-core processor models, the architecture chosen for the simulation is the following:
• 2 cores.
• Private split L1. 32KB, 4 ways y 64B block size. 1 cycle for hits.
• Private and unified L2. 256KB, 8 ways y 64B block size. 7 cycles for hits.
• Shared and unified L3. 2MB, 16 ways y 64B block size. 20 cycles for hits.
• 380 cycles for main memory latency.
• 4 instructions in parallel execution per core.
• 50 cycles of time of time of administration (if applicable).
• 300000 cycles run between administrations (time slice).
Three test sets consisting of two benchmarks each were applied.
(A) H264 and Libquantum.
(B) Gcc and Libquantum.
(C) H264 and Gcc.
Every test set was executed for 1000 million instructions without being instrumented to eliminate the startup effects of every benchmark, and then instrumented with the PIN tool [1, 11] for a total of 200 million instructions.
Simulation Results
In order to achieve the main objective of the proposed management policy, different parameters for measuring performance of processes in progress are affected. For this reason, three metrics were considered to analyze the simulation results obtained. The metrics used were:
1. Shared cache miss rate
Instructions Per Cycles (IPC)
Bandwidth used
These three metrics proposed were compared with the capitalist management policy which is used in related and reference works.
Shared Cache Miss Rate
The first results to be analyzed correspond to the shared cache miss rate obtained using the dynamic management proposed and the capitalist management policy. Minimizing the amount of misses is sought as these entail very high penalties that produce the IPC decrease. For each test set assessed two figures are shown. The first figure corresponds to the miss rate in shared cache for each individual process, while the last figure shows the miss rate of the global test set analyzed.
In the test set A, it can be observed the miss rate continuous decline up to the execution cycle 33M, from which it gets into a steady state. This is the result of the number of hits needed to stabilize the shared cache memory. These early misses can turn into compulsory misses, so it can be assumed that as from execution cycle 37M shared cache memory gets into an execution "regime".
After execution cycle 40M the heuristic developed here starts to differentiate from the capitalist one, by eliminating inter-process misses and giving space to each benchmark. When H264 benchmark reaches this point, it has the necessary space for data, does not have interferences with Libquantum and the miss rate decreases while in Libquantum space for data is diminished, available space is restricted and the miss rate increases. Inter-process misses represent 34.4% of the total misses produced in shared cache under the Capitalist management and 0% under the administration proposed. The maximum global miss rate reduction achieved for this test set under the administration proposed is about 20%, and according to the information obtained after the simulation, Libquantum ended up occupying 51.6% of space and H264 the remaining 48.4% of the shared cache.
Each tile includes two sets of simulated shared cache. This results in a size of 2KB per tile. Thus, it is concluded that the heuristic provided more space to Libquantum benchmark, which has lower data locality for data than H264, but the latter was assigned the necessary space to achieve its optimal performance according to the metrics applied. Regarding the global miss rate for this test set, it can be observed that space restriction, when separated into disjoint tiles, made no difference to the miss rate in the first execution cycle (the heuristic did not perform reallocations), while when space managing process started, it caused a decrease in the global miss rate. Although H264 miss rate decreases and Libquantum increases, taken globally, the benefits of successfully applying the dynamic management policy to the shared cache memory can be seen.
The test set B analysis shows that a number of execution cycles similar to those of the test set A is required to get into a cache regime. For individual miss rates in each benchmark, it can be observed the same effect as in the test set A.
However, when compared to the achievements of implementing test set A, it is clear that a lower improvement was achieved in Gcc miss rate, and that the range of increase in the Libquantum rate was preserved. This is because Gcc has 67.2% more accesses to the shared cache than H264 does, resulting in inter-processes miss rates for this test set of 35.5% implemented under the capitalist management policy.
The latter, ended providing less space to Gcc than to H264, since the former has higher data locality up to the execution cycle 85M, whereas Libquantum has much lower data locality than the size of the shared cache studied, and all the same it fails to make a considerable difference, because the space is higher than the one obtained in the test set A. The maximum global miss rate reduction achieved for this test set using the administration proposed is about 11%. Tile allocation by dynamic management for this test set confirms what was concluded when analyzing individually each benchmark miss rate. Apart from efficiently allocating space in cache memories, space allocation rates managed to adapt quickly enough to changes in access rates of the test set.
In order to evaluate a more exploited shared cache memory, a three core system under the three benchmarks H264, Libquantum and Gcc were simulated. The obtained total miss rate is showed in figure 6 . The maximum global miss rate reduction achieved for this test set using the administration proposed is about 13% and this is consistent with the results obtained with two cores systems.
Instructions Per Cycles (IPC)
Comparing the IPC metric of an executed process under the management policy with the one under the Capitalist policy gives an idea of the increase in performance improvement. This increase can be calculated using the following expression:
The metric analysis for the test set A, indicates that the decrease in benchmark H264 miss rate is clearly reflected in its IPC increase, while for Libquantum, having a lower rate of memory hits, the increase in the miss rate does not affect its performance significantly. The maximum improvement achieved for benchmark H264 was 10%, while the maximum performance decrease for Libquantum was 3%. It is observed that the results obtained comply with the corollary stated in [14] , which indicates that fairness improving in resource management, results in an increase of global performance in multi-process environments.
In the test set B, the Gcc benchmark is benefited by the proposed dynamic management since the rate of cache hits is higher than in H264. By efficiently managing the space allotted by each benchmark, varying it at the necessary speed to match the changes in the shared cache rate of hits, and also by eliminating inter-process misses with space distribution in tiles, it is possible to achieve the increase in the IPC metrics for Gcc. This is lower than that obtained by H264, because Gcc has lower data locality, but it is benefited by the removal of interferences produced by Libquantum.
Bandwidth Used
This metric provides a measure of usage of main memory by execution processes. By reducing the bandwidth used by processing cores, other devices may have access to it, thus, improving the general system performance. Decreasing miss rates in shared cache levels, implies less accesses to the main memory, which should decrease the bandwidth used by the test sets. As the global miss rate of the shared cache is reduced significantly by the proposed administration, so it is reduced the bandwidth used. This is clearly seen in the following figures, showing the bandwidth of each test set when capitalist management policies and the policy developed in this paper are implemented. For test set A, the reduction of the bandwidth usage is about 20% and for the test set B is about 12%.
Test set C deserves being analyzed soon afterwards. Gcc and H264 benchmarks showed no variation when applying the proposed dynamic management policy as regards the capitalist one. This result is due to the fact that both benchmarks did not present enough inter-process misses so as the proposed heuristics could prove a performance improvement when varying the allocated space to each process. When capitalist management policies were implemented, inter-process misses represented only 2.4% of shared cache misses, so that when implementing the policy developed, it assigned the same amount of tiles to each benchmark and made no tiles exchanges, resulting in a negligible variation of the miss rate, IPC and bandwidth metrics.
Conclusion
It was observed that the proposed policy has a lower global miss rate in the shared cache for each test set studied as compared with the capitalist management policy. Certain processes increased their miss rates by space restrictions, so that other processes could make better use of that space, thus increasing the global IPC and reducing the required memory bandwidth. Also, it was proved that a more efficient use of space in the shared cache memory is achieved by eliminating inter-process misses and adapting available space for each process in progress at a suitable speed according to the hit rate to shared cache memory. The maximum global miss rate reduction achieved for the test set A was 20% and for the test set B was 12%. The dynamic management heuristic implemented fulfills its objective of managing the shared cache space for every process while improving the overall performance.
Future Work
Future works will concentrate on evaluating dynamic management policies in working environments provided with more cores, in order to analyze the behavior of the heuristic developed in over-exploited shared cache memories. Another aspect to be studied is the application of the heuristic to simultaneous multi-threading (SMT) environments and in multiple cores with multiple processing threads.
