Abstract-General purpose architectures are designed to offer average high performance regardless of the particular application that is being run. Performance and power inefficiencies appear as a consequence for some programs. Reconfigurable hardware (cache hierarchy, branch predictor, execution units, bandwidth, etc.) has been proposed to overcome these inefficiencies by dynamically adapting the architecture to the application needs. However, nearly all the proposals use indirect measures or heuristics of performance to decide new configurations, what may lead to inefficiencies.
Abstract-General purpose architectures are designed to offer average high performance regardless of the particular application that is being run. Performance and power inefficiencies appear as a consequence for some programs. Reconfigurable hardware (cache hierarchy, branch predictor, execution units, bandwidth, etc.) has been proposed to overcome these inefficiencies by dynamically adapting the architecture to the application needs. However, nearly all the proposals use indirect measures or heuristics of performance to decide new configurations, what may lead to inefficiencies.
In this paper we propose a runtime mechanism that allows to predict the throughput of an application on an architecture using a reconfigurable L2 cache. L2 cache size varies at a way granularity and we predict the performance of the same application on all other L2 cache sizes at the same time. We obtain for different L2 cache sizes an average error of 3.11%, a maximum error of 16.4% and standard deviation of 3.7%. No profiling or Operating System participation is needed in this mechanism. We also give a hardware implementation that allows to reduce the hardware cost under 0.4% of the total L2 size and maintains high accuracy. This mechanism can be used to reduce power consumption in single threaded architectures and improve performance in multithreaded architectures that dynamically partition shared L2 caches.
I. INTRODUCTION
The limitation imposed by instruction-level parallelism has motivated the appearance of thread-level parallelism (TLP) as a common strategy for improving processor performance. TLP paradigms such as simultaneous multithreading (SMT) [1] , [2] , chip multiprocessing (CMP) [3] and combinations of both offer the opportunity to obtain higher throughput, but they also face the challenge of sharing architecture resources. Some studies deal with the resource sharing problem in SMTs at the core resources level [4] like issue queues, registers, etc. In CMPs, resource sharing is lower than in SMT, focusing in the cache hierarchy. Several mechanisms have been proposed to dynamically split the L2 cache in a CMP architecture in order to maximize throughput or fairness [5] - [9] .
However, the problem of adapting resources to program needs is not only a problem of multithreaded architectures. Several mechanisms have been proposed in superscalar architectures to use reconfigurable hardware that adapts microarchitecture features when the characteristics of the program change [10] - [14] . The common problem with all these selftuning techniques is that decisions are based on indirect performance metrics or empirical heuristics.
In this paper we focus on dynamic configuration of the cache hierarchy. In particular, we propose a mechanism that allows to accurately predict the Instruction Per Cycle (IPC) of an application as we vary the amount of cache we devote to it. We vary L2 cache size by activating/deactivating some ways of a set associative cache. Our mechanism combines Stack Distance Histograms [15] and ananalytical model for predicting processor IPC introduced in [16] . We also have improved the model [16] in order to increase our IPC prediction accuracy. On average over all SPEC CPU 2000 benchmarks, our mechanism obtains an average error of 3.11%, with a maximum error of 16.4% (twolf) and a standard deviation of 3.7%. The ability to predict IPC as we change the cache configuration can be applied in two different scenarios:
Cache sharing in MT architectures. A better sharing of the L2 cache among the running threads can be obtained. Previous work on this topic proposes static and dynamic partitioning of a shared L2 cache in CMP/SMT architectures in order to maximize throughput or fairness [5] - [9] . These proposals use indirect metrics of throughput such as total number of misses, or data reuse. The mechanism we propose provides a direct estimation of performance for different cache configurations, which is the appropriate metric to maximize total throughput.
Power reduction. A better reduction in cache energy dissipation can be obtained by adjusting the hardware resources. By having a direct estimate on the performance of the application, it is possible to obtain the desired tradeoff between power consunmption and performance. Previous work consisted in statically switching on or off L2 ways [13] or switching off lines after a number of cycles without being accessed [14] . However, these proposals cannot bound performance losses and rely on empirical heuristics. Giving the real contribution of each way to the final IPC can be used to bound performance losses while saving power.
In both areas, previous work relies on empirical heuristics and thresholds to make decisions. To our knowledge, we are the first to mix runtime measurements with analytical models to dynamically predict the actual performance impact of such decisions. The main contributions of this work are: 1) A runtime mechanism to predict IPC for different cache configurations with high accuracy. This proposal can help to reconfigure L2 caches in CMP/SMT scenarios for dynamic cache partitioning and in single core scenarios to reduce power. The important difference with previous work in these areas is that new configurations are based on real measures of throughput instead of indirect measures or adhoc heuristics.
1-4244-1058-4/07/$25.00 C 2007 IEEE 2) An modified version of the memory model in [16] that allows to better predict the cost of an L2 miss.
3) A sampling technique to reduce hardware cost under 0.4% of the total L2 size without excessive accuracy loss (the average error raises to 4%). 4) A systematic benchmark classification so that results are consistent in every benchmark group. This classification extends previous intuitive classifications that were obtained by hand. l The rest of this paper is structured as follows. In Section II we introduce the methods that we employ to predict IPC curves and in Section III we show how to combine them to obtain IPC predictions. Next, in Section IV we describe the experimental environment and in Section V we discuss simulation results. In Section VI we deal with a practical implementation in hardware as well as a sampling technique to reduce this cost. In Section VII we report related work and, finally, we conclude with Section VIII.
II. BASIS OF IPC CURVES PREDICTION
We define the IPC curve of an application as the set of IPC values that the application obtains for different configurations of the L2 cache. These configurations have a way granularity. Thus, in a K-way L2 cache, IPC curves have exactly K points (as we assume that at least one way is assigned to the application). In order to accurately predict IPC curves, we have combined two instruments: the stack distance histogram (SDH) [15] , and an analytical model for superscalar processors performance [16] . Stack Distance Histogram. In [15] the concept of stack distance is introduced to study the behavior of storage hierarchies. [5] or by adding some hardware counters that profile this information [6] , [7] . A characteristic of these histograms 'This classification is not used in IPC predictions.
is that the number of cache misses for a smaller cache with the same number of sets can be easily computed. For example, for a K'-way associative cache, where K' < K, the new number of misses can be computed as:
(1)
As an example, in Table II we show a SDH for a set with 4 ways. In the example, 5 accesses would have missed in the cache. However, if we had reduced the number of ways to 2 (keeping the number of sets constant), we would have experienced 20 misses (5 + 5 + 10). The basic difference between instruction and data cache misses is that instruction fetch and issue continue after the data cache miss, and so several data misses can occur in parallel. After AD cycles, data is delivered from main memory. In [16] it is shown that the misspenalty for an isolated data L2 miss can be approximated by AD.
When we have a burst of n L2 cache data misses, they will overlap if they all fit in the reorder buffer (ROB). In this situation, the misspenalty of AD cycles is shared among all misses. Then, being MD the total number of L2 data misses and Ni the number of times that we have a burst of i misses that fit in the ROB, we can approximate the average L2 data misspenalty per miss (DCM) with the following formula:
1.52 instructions per cycle. Using the previous formula, we predict the throughput for a configuration with 16 ways as 1.637, which is very close to the real value of 1.647. The relative error of this prediction is 0.65%. 
III. PREDICTION OF IPC CURVES In this section, we detail how to obtain an Online Prediction of Applications Cache Utility. We call this methodology OPACU. We use SDHs to compute the number of misses for each possible L2 cache size, and the memory model described in the previous Section to determine the miss penalty of each L2 miss. OPACU Methodology. In our baseline configuration, the L2 cache has a variable number of active ways. We start by assigning w ways and computing the throughput of an application during C cycles. We denote this value as IPCreal,w, with IPCrea,w I , where I is the number of committed instructions in this period. This IPC value is valid for the given number of ways that are being used. Thanks to the SDH, we know whether an access would be a miss or a hit with a different number of ways w' C [1, K] .
Independently of the number of active ways, we store the LRU counters of the last K accesses of the thread and obtain the SDH for the whole K-way associativity L2 cache, as we explain in Section VI. On the other hand using the analytical model explained in Section II, we can estimate at runtime the number of times that we have a burst of i L2 data misses with an L2 cache with w' ways. We denote this number as Ni" for 1 < i < ROBsize for any possible number of ways w', and the total number of instruction and data L2 misses as MIW and M_W respectively. These A. Accuracy Figure 3 Another source of inaccuracy is the use of a fixed value of ROB occupancy after a data L2 miss commits. It is clear that when the ROB size is the bottleneck, ROB will fill nearly always. This is exactly the case for art, which is the H benchmark with less error. In Figure 6 (a) we show art's ROB occupancy after an L2 miss. However, in the case of twolf ROB occupancy after a data L2 miss varies a lot and the mean value is less representative of overlaps (see Figure 6(b) ). Thus, when the ROB is not the main bottleneck for performance, we obtain higher errors in predictions. In this Section we show a possible implementation of our IPC prediction method. There are two main hardware components: First, we need a set of counters to know the number of committed instructions, the number of cycles and the average ROB occupancy after an L2 miss commits. Second, we need some hardware to track L2 accesses as other proposals have already shown [6] , [7] . For each cache configuration 2 we use hardware to determine at runtime if two L2 accesses overlap, count the number of bursts of overlapped L2 misses, and predict IPC. Overlap Counters. For each possible cache size we have a counter overlapw that counts the number of bursts of overlapped L2 misses. When we have a miss, we need to know if this miss is overlapped with previous misses. According to our memory model, instruction L2 misses do not overlap. Thus, bursts are always of only one miss. In case of an instruction L2 miss with stack distance i, we can directly increase the counter overlapi. In case of a data L2 miss, we need to know the average ROB occupancy after an L2 miss commits to know if this particular L2 miss overlaps with previous ones. This information can be tracked with a hardware counter that we call AROAL2M (stands for Average ROB Occupancy After an L2 Miss commits). This counter is updated every time a served L2 miss commits. This value can be obtained with some cycles of latency, as this latency is not crucial because the mean value should not vary too much in a period. Countdown Counter. When a new non-overlapped L2 miss with stack distance i appears, we set a countdown counter to the value of AROAL2M. This counter, called cdci, is different for each L2 cache size. Each time an instruction is committed, we decrease the counters, which saturate to zero. Thus, when a new L2 miss with stack distance i appears, if cdci :t 0, then it overlaps with the previous L2 miss. Notice that we can count committed instructions that are actually prior to the miss. However, this is not a big problem as the number of instructions in the ROB when an L2 miss occurs is normally small [16] . In any case, if we want to check that the committed instruction is after the data L2 miss, we can add an instruction identifier that is assigned at fetch time and sorts instructions. Predicted Cycle Variation Counters. Next, we need a counter for each cache configuration that stores the variation in total misspenalty due to L2 misses. We call this counter PCVw for a cache configuration with w ways. if the access is a hit. Basically, our prediction mechanism needs to track every L2 access, and store a separate copy of the L2 tags information in an ATD, coupled with the LRU counters. As we have a cache with 16 ways, we need 4 bits to encode the stack distance. As we described in Section II, an access with stack distance larger than the associativity corresponds to a cache miss. Thus, with this ATD we can determine whether an L2 access would be a miss or a hit in the 16 possible cache configurations. In Figure 8 we sketch a diagram of this hardware implementation. Our main contributor to hardware cost corresponds to the ATD, for which we propose to use a sampled version. [19] . A different approach consists in sampling design space points to train artifitial neural networks that predict performance in the whole design space with high accuracy [20] . However, our proposal lies in a different scenario, as it is a runtime mechanism. On the other hand, other papers predict IPC at runtime [21] . They analyze IPC in a window of time together with information obtained at compile time, and predict the future value of IPC. The main difference is that the processor configuration remains constant in IPC predictions. Several papers have the objective of dynamically adjusting the cache size assigned to each thread in a multithreaded scenario. In [5] , column caching is introduced. It allows to partition the caches in a cache hierarchy. In this paper, the control of cache partitions is let to the software (or the programmer). In [7] , a dynamic cache partitioning technique is developed for SMT systems extending the previous paper. Here, partition sizes are varied at runtime using the total number of L2 misses. In [8] , a similar dynamic scheme is introduced that decides partitions depending on the average data reuse of each application. In [9] , instead of using the number of misses for each application, they use the missrate to ensure fairness among threads. Finally, in [6] the authors present a suitable and scalable implementation of the technique appeared in [7] using sampling and showed similar performance gains with just an extra 0.2% space in L2 cache. However, all these proposals are using indirect measures of performance such us misses or data reuse. Our proposal is to use direct estimations of performance to obtain optimal partitions. Regarding reconfigurable hardware for single threaded processors, several proposals try to reduce power consumption without loosing too much performance. In [11] , working set signatures are used to represent programs instructions working set and find the minimal instruction cache size to minimize instruction misses (data misses are not treated). In [13] , selective cache ways are introduced. These caches just precharge lines in active ways. This study is expanded in [12] , where some algorithms are given to dynamically decide to switch on/off cache ways. However, they are unable to ensure a quality of service as they are using indirect measures of IPC. For example, they report a maximum IPC reduction of 52%. In [14] , Cache Decay is proposed to dynamically switch off a cache line when it is highly probable that no more accesses will be done to this line. This occurs after a fixed number of cycles or other more agressive variants of this approach. In [22] , the authors present a runtime decision algorithm for activating or inactivating cache lines based on the number of accesses to the LRU and MRU active lines. Finally, in [10] , issue width and the number of execution units is dynamically modified depending on the present IPC. When the IPC is under a given threshold, then issue width is decreased. In this scenario, mainly all proposals cannot estimate performance degradation as they depend on empiric thresholds and heuristics concerning indirect measures of performance. Our proposal gives the opportunity to bound these losses.
VIII. CONCLUSIONS
Throughout this work we have presented a runtime mechanism that accurately predicts IPC as L2 cache size varies. We have shown average errors of 3. 11% with predictions that follow the shape of the real IPC curve. To obtain these results, we have modified previous memory models to obtain higher accuracy in predictions. Furthermore, we have systematically classified benchmarks so that results are consistent in every benchmark group. We have also discussed a practical implementation that has an extra cost between 5.68% of the L2 cache size (best accuracy) and under 0.4% (for a 4% error). Hardware cost is reduced using a sampling technique. Our mechanism can be used to reduce power consumption in single threaded architectures as it can be used to give the real contribution of each way to the final IPC and bound performance losses. A second possible application is to improve performance in multithreaded architectures that dynamically partition shared L2 caches. The mechanism we propose gives direct estimation of performance for different cache configurations, instead of other indirect measures of performance that are currently used to maximize total throughput.
