Abstract: Non-uniform cache architecture (NUCA) aims to limit the wire-delay problem typical of large on-chip last level caches: by partitioning a large cache into several banks, with the latency of each one depending on its physical location and by employing a scalable on-chip network to interconnect the banks with the cache controller, the average access latency can be reduced with respect to a traditional cache. The addition of a migration mechanism to move the most frequently accessed data towards the cache controller (D-NUCA) further improves the average access latency.
Introduction
The memory-wall problem is given by the gap between the access time of off-chip DRAM memory and the clock cycle time of a microprocessor. In order to overcome this problem, the trend followed by modern processor designs is to employ large on-chip level-two (L2) and, in some cases, level-three (L3) cache memories.
For high clock frequency designs, the access latency of such caches is dominated by the wire-delay, i.e., the delay due to the propagation of signals on wires. Non-uniform cache architectures (NUCA) have been proposed as a new paradigm for large on-chip cache memories aimed at reducing this effect. In a NUCA architecture, the cache is partitioned into many independent banks, while the communication between the banks and the cache controller is allowed by means of a scalable switched network ( Figure 1 ). This organisation allows some banks to be closer to the processor, hence, allowing shorter access latencies to these banks with respect to banks that are located farther away. The mapping between cache lines and physical banks can either be static or dynamic (namely S-NUCA and D-NUCA) . In the former each line can exclusively reside in a single predetermined bank, while in the latter each line can be mapped to one of a set of different banks, similarly to a set associative cache, and it can dynamically migrate from one bank to another. In a D-NUCA cache the banks are logically grouped into rows and columns, each bank containing a fixed number of lines. The entire address space spans over the banks belonging to each column. Each column behaves like a single way of a set associative cache, so a line is allowed to reside only in a single bank within a row. When a cache line search is performed, the controller first determines the row that could contain the requested data using the lowest-order bits of the index field from the address, then it broadcasts the request to all the banks belonging to that row. As soon as a hit happens in one of these banks, the request is satisfied without the need of waiting for the replies from the other, farther, banks. To further reduce access latencies, the promotion/demotion mechanism is adopted: if a hit happens in a column other than the first, the cache line is promoted by swapping it with the line that holds the same position in the next column closer to the controller. If a miss happens, the new line is inserted in the farthest bank, possibly evicting any corresponding line. As a consequence, the most frequently accessed lines are likely to be located into the banks closer to the processor, thus improving the access latency. With such a policy, D-NUCA caches succeed in achieving high hit rates while keeping the access latency low, in spite of the wire-delay effects introduced by high clock rates and technology scaling. These characteristics make D-NUCA an attractive cache architecture for next generation high performance chips, where large storage capabilities, high clock rates and low memory access latencies will be required. Figure 2 reports the instructions per cycle (IPC) achieved by a single-processor system adopting an 8 Mbyte L2 cache, for three different L2 architectures: a traditional cache (UCA), a S-NUCA and a D-NUCA. The methodology used to perform this evaluation is described in detail in Section 3. As found in a previous work , D-NUCA caches exhibit the highest performance level.
Large SRAM structures, like the ones employed in a modern L2 cache, exhibit high energy requirements, especially due to the static component caused by leakage currents typical of deep submicron CMOS technologies (Kim et al., 2003) . A previous work (Bardine et al., 2007) highlights that D-NUCA caches are also more energy efficient than UCA and S-NUCA caches. With respect to UCA and S-NUCA, the D-NUCA scheme exhibits an increase of the dynamic power components due to the higher number of bank accesses and network transmissions, but this is overwhelmed by the static energy savings due to the shorter execution time achieved. However, like in the UCA and S-NUCA schemes, the D-NUCA energy budget is dominated by leakage, which consequently should be the most important target for power-saving techniques. In this work, we present a D-NUCA scheme which is able to retain the performance improvement over UCA and S-NUCA while significantly reducing the static power consumption by adapting the number of active ways to the needs of the running application. This scheme, called way adaptable D-NUCA (WA D-NUCA), would lead to a high performance, power efficient on-chip memory hierarchy. This work extends our original proposal introduced by Bardine et al. (2008) . With respect to that work, we explore a wider memory hierarchy's design space by including UCA and S-NUCA architectures, thus reinforcing the validity of our scheme; in addition, we discuss how the intrinsic parameters of the proposed technique (i.e., the reconfiguration interval and the two reconfiguration thresholds) impact on performance and power consumption and we propose a method to tailor the thresholds to specific applications, including considerations about its computational complexity; we propose a suitable hardware implementation characterised by a low overhead in terms of area and energy consumption; we extend the experimental evaluation to a wider set of benchmarks for both single-core and multi-core systems; finally, we include energy and EDP values for all the considered configurations.
In the next sections, we introduce the power-saving technique that we developed, we describe a methodology to determine the intrinsic parameters of this technique and finally we analyse the application of our proposed scheme to CMP systems.
Way adaptable D-NUCA
The promotion/demotion mechanism of a D-NUCA cache mitigates the effects of wire delays and improves the performance by moving the most frequently accessed data toward the controller and the least frequently accessed data in the opposite direction . As a consequence, the cache hits tend to be located in the cache ways that are closer to the controller while the other ways exhibit different usage patterns as a function of the currently running application and execution phase. As an example of this, Figure 3 shows the distribution of cache hits across the different ways of a D-NUCA cache for the mcf and twolf applications. As expected, for both the applications the number of hits decreases when moving from Way0 to Way7. But, while the hits for mcf are distributed over the entire cache, the hits for twolf exhibit a different pattern and they are distributed over the first two ways, which are able to fully contain the current working set. Figure 4 shows the distribution of hits for two different phases of the parser application. In the first case, the hits are concentrated into the first four ways, while in the second case they span over the entire cache. We analysed the distribution of hits for all the Standard Performance Evaluation Corporation, 2000 (SPEC CPU2000) and Numerical Aeorodynamic Simulation Parallel Benchmarks (NAS) applications listed in Table 1 and found that the distribution of hits across the ways of the D-NUCA cache exhibits significant variations between different applications as well as between different execution phases of the same application, similarly to the behaviour depicted in Figure 3 and Figure 4 . Those results suggest that there are many applications and many execution phases in which some of the ways of a D-NUCA cache are not actively contributing to performances, thus wasting energy because of leakage currents.
Based on these considerations, we propose to introduce a mechanism that allows to dynamically switch on and off the ways of a D-NUCA cache as a function of the associativity level needed by the current execution phase. We call this structure 'way adaptable D-NUCA cache' (WA D-NUCA).
To allow the WA D-NUCA to predict the working set size and to decide when to turn on new ways and when to shut down those that are unnecessary, we developed the prediction algorithm described in Figure 5 . 
It is worth noting that the complexity of the prediction mechanism is very low and it can be implemented in hardware by only employing three counters and the logic for the two steps of the algorithm. Figure 6 depicts the block diagram of a possible implementation of the proposed algorithm. The structure has been derived after applying the following substitutions to the algorithm: All the control logic is assumed to be embedded in the cache controller. The multiplier modules are enabled once every K L2 cache accesses and they involve fixed-point operands, with one of them being always a constant. Each multiplier can be implemented by a shift-add multiplier (Vahid, 2006) , which has small area and energy requirements and whose latency has negligible effects on the effectiveness of the algorithm. As stated in Dropsho et al. (2002) , the impact of counters on both the dynamic and the static power consumption is negligible when compared to a moderately sized cache. The on/off switching of ways can be implemented via the Gated-Vdd technology (Powell et al., 2000) that has been shown to be particularly effective in reducing static power consumption, especially for L2 caches (Meng et al., 2005) .
Methodology
We evaluated the different architectures considered in this work through execution driven simulation. The selected simulation platform is based on an extended version of the sim-alpha simulator (Desikan et al., 2001) , which supports plain UCA, S-NUCA and D-NUCA (including the WA D-NUCA) L2 caches.
The workload used in our experiments comprises applications from the SPEC CPU2000 and the NAS Parallel Benchmarks suites. These benchmarks were selected for their high L1 miss-rate, so as to achieve a significant activity of the L2 caches considered in this study. For each benchmark we identified and simulated a single representative simulation point of the program using the same methodology described in Kim et al. (2002) . Table 1 lists the chosen benchmarks, the number of instructions skipped to reach the selected simulation point (FFWD) and the number of simulated instructions (RUN). The configurations of the simulated systems are all based on processing cores with micro-architectural parameters matching those of the Alpha 21264 processor, including issue width, fetch bandwidth and clustering. We selected a 16FO4 clock cycle time, which roughly corresponds to a 5 GHz operating frequency for a 65 nm technology node. We assume that the cores are backed by a 64 KB two-way set associative L1 I-cache with a single cycle latency (using way-prediction) and by a 64 KB two-way set associative L1 D-cache with a three-cycle latency. The cache line size is fixed at 64 bytes. The access latency of off-chip DRAM memory is set to 300 cycles; this value is aligned with the average access time of modern DRAM subsystems. The main architectural parameters are summarised in Table 2 . Regarding the L2 cache, we evaluated three different architectures (assuming the same baseline configuration described above): UCA, S-NUCA and D-NUCA architectures. For every L2 cache design point considered in our experiments, we derived the bank latencies from the CACTI tool (Tarjan et al., 2006) . For NUCA caches, the link latencies were calculated according to the RC model, assuming a link length equal to the height/width of banks. The switch traversal latency was obtained with gate-level analysis.
For each L2 cache architecture, we identified the configuration with the highest performance by varying different parameters (n. of banks, cache-associativity, etc.) while keeping a constant capacity of 8 Mbytes. Those configurations are summarised in Table 3 . The UCA cache is made up of two sub-banks and it is four-way set associative; its access latency is 18 cycles. The S-NUCA cache is made up of 32 banks organised in eight rows and four columns; each bank has a 256 Kbytes capacity and it is four-way set associative; the access latency of each bank is five cycles and the resulting hop latency (i.e., the sum of the link traversal latency and the switch traversal latency) is two cycles. The D-NUCA cache is made up of 128 banks organised in 16 bank sets (rows), each one made up of eight banks; each bank has a 64 Kbytes capacity and is direct mapped; the access latency of each bank is three cycles while the resulting hop latency is one cycle. For the S-NUCA architecture, the mapping between blocks and banks is determined by the lowest order bits of the index field; for the D-NUCA architecture, we selected the following implementation policies : simple mapping, broadcast search, and one bank/one hit promotion. The links of the NUCA switched network are assumed to be bidirectional, with a 128 bit width for each direction; the network employs a wormhole routing scheme. The flit size is fixed at 128 bits, so the transmission of a cache block (64 bytes) requires four flits for data and one additional flit to transport command and address information. In order to determine the energy requirements of the different architectures and the energy savings introduced by the WA D-NUCA technique, we applied the model and the energy parameters introduced by Bardine et al. (2007) ; this model, which calculates the overall energy consumption of the memory subsystem made up of the L2 cache and the off-chip DRAM, was integrated into our simulation platform.
Experimental results
In the previous sections, we have shown that a system equipped with a D-NUCA cache exhibits higher performance than another one equipped with an UCA or a S-NUCA and, at the same time, the D-NUCA cache allows a slight reduction of energy consumption. For this reasons, we restrict the experimental evaluation of the WA D-NUCA to performance and energy comparisons with the plain D-NUCA system.
For the WA D-NUCA scheme, the configuration parameters that were adopted in the experimental evaluation are: K = 100,000, T 1 = 0.005, T 2 = 0.02. Figure 7 shows the comparison, under the considered workloads, of the instructions per cycle (IPC) achieved by the WA D-NUCA with respect to the plain D-NUCA, while the average number of active (i.e., powered on) ways is shown in Figure 8 . The IPC achieved by the way adaptable scheme is close to the one achieved by the reference plain D-NUCA for almost all benchmarks and exhibits a narrow decrease only for a small number of applications. The average performance loss is limited to 2.96% (last bars in Figure 7 ). Two benchmarks (i.e., applu and mgrid) even exhibit an increase in IPC for the WA D-NUCA case. The rationale behind this counterintuitive result is that a WA D-NUCA cache can potentially lead to a lower access latency than a plain D-NUCA, since the cache resizing mechanism also helps to bring the working set closer to the cache controller and to decrease the miss detection latency.
The average number of active ways is considerably lower for the WA D-NUCA cache and, with respect to the baseline D-NUCA, it is reduced by 35.4% on average (last bars in Figure 8 ). The maximum average reduction (68.8%) is measured for the twolf benchmark. Note: A significant reduction of energy consumption is achieved by the proposed technique. Figure 9 shows the energy consumption for WA D-NUCA normalised with respect to the energy consumption of the plain D-NUCA. Since static energy largely depends on temperature, here we assume a fixed operating temperature of 80°C, which is quite representative of typical working conditions of on-chip L2 caches (Greenhill and Alabado, 2005) . 14 benchmarks out of 15 benefit from the application of the proposed technique, with energy reductions ranging from 2.1% to 69.8%. The mcf benchmark exhibits a 2% increase of the energy consumption. This is a consequence of the reduction of the IPC which affects this benchmark when the WA D-NUCA technique is applied, and a consequence of the small reduction of the average associativity. However, the application of the technique leads to an overall average reduction of energy consumption by 33.36% (last bars in Figure 9 ). Figure 10 shows the energy-delay product (EDP) achieved by the WA D-NUCA scheme normalised with respect to the one of the plain D-NUCA. The trend is similar to the one exhibited by the energy consumption, with improvements ranging from 2.3% to 67.9%. This is achieved because the WA D-NUCA technique is able to reduce the overall energy consumption while maintaining almost the same performance level. The overall average improvement is 31.2% (last bars in Figure 10 ). The experimental results described in this section show that the WA D-NUCA scheme allows to maintain the same performance level of plain D-NUCA while significantly reducing the energy consumption, thus leading to a fast, power-efficient last-level cache design.
Way adaptable D-NUCA intrinsic parameters
The values for T 1 and T 2 thresholds adopted for the experimental evaluation of the previous section were heuristically determined based on the average behaviour of the considered benchmarks and led to good average results for the WA D-NUCA cache. However, the accuracy of the prediction algorithm and the performance/power trade-off depend on such parameters as well as on the length of the reconfiguration interval (i.e., the value of the K parameter). For this reason, in this section, we propose a methodology that is suitable to determine the values of thresholds T 1 and T 2 for a given application, together with considerations about its computational complexity; then we perform an evaluation of the sensitivity to these parameters; finally, we report some experimental results of the effect of varying the value of the K parameter.
Estimating thresholds
The methodology we propose to estimate the thresholds is made up of two steps:
1 identifying the optimal sequence of reconfiguration events (i.e., the sequence of reconfigurations that leads to an optimum value for some chosen performance metric) 2 applying restrictions on allowed threshold values to force the identified optimal sequence to be followed at run-time.
As a performance metric, many alternatives are possible such as IPC, miss-rate, EDP and so on. In the following, for illustrative purposes, we will focus on miss-rate, but our methodology can be applied to every other choice. To determine the optimal reconfiguration sequence, we would need to explore all the possible combinations of reconfigurations and calculate the metric for each one; this would result in an extremely high number of experiments to be performed. We have found that the computational complexity of such a brute-force approach is greater than O(2 n ), with n being the number of reconfiguration events. Instead, an incremental approach can be adopted: every K cache hits (i.e., every time a reconfiguration should be performed) the program's execution is forked into three different branches. In the first branch the cache size is increased by one way, in the second it is maintained constant and in the third it is reduced by one way. Then the same portion of application is simulated in the three branches and, at their end, the three values of the metric are evaluated. The branch leading to the best one is chosen while the other two are discarded. In particular, since higher associativities and larger capacities naturally lead to lower miss-rates, the reconfiguration event that is chosen is the most conservative one that leads to a miss-rate within 1% of the lowest achievable miss-rate, meaning that we choose the event that allows the highest energy consumption reduction and that limits the miss-rate degradation to 1% with respect to the optimal case. Starting from it, and focusing on the next interval, three new simulations are conducted in the same way. This procedure is repeated up to the end of the selected workload. Figure 11 depicts some steps of the described methodology. Note: In the example, the selected sequence is: {+1 way; -1 way; keep current configuration}.
While the procedure described above determines the optimal reconfiguration events according to a metric that is evaluated locally to each step, thus leading to a pseudo-optimal reconfiguration sequence, it allows to rapidly determine such sequence with a limited number of test to be performed. In fact, its computational complexity is O(n), with n being the number of reconfiguration steps.
After the pseudo-optimal sequence has been determined, we can use it to derive the values of the thresholds. At each step, the locality metric D is evaluated and its value, based on the selected reconfiguration event, places restrictions on the allowed values for T 1 and T 2 . At the end, we obtain a set of inequalities for the two thresholds; by solving them, we obtain the desired values for T 1 and T 2 .
In order to show the potential of using thresholds that are accurately tailored to a given application, Figure 12 and Figure 13 show the IPC and the average associativity for two benchmarks, equake and perlbmk, comparing the plain D-NUCA scheme, the WA D-NUCA scheme adopting the previously introduced heuristic thresholds and the WA D-NUCA scheme with the thresholds specifically calculated for the application according to the described strategy (which we will refer to as Est. way adaptable in the following). Using the thresholds calculated with the proposed methodology leads to a larger average associativity reduction (-61.4% for equake and -20% for perlbmk) with consequent static power reduction, without noticeably affecting performances. 
Algorithm sensitiveness to threshold values
In order to evaluate the sensitiveness of the proposed technique to the values of T 1 and T 2 thresholds, we measured the IPC (Figure 14 ) and the average associativity ( Figure 15 ) obtained when using the thresholds estimated for equake (Est. WA D-NUCA 1) and for perlbmk (Est. WA D-NUCA 2) for the entire benchmark set. While giving optimal results for the specific application for which they have been tuned, the two Est. WA schemes exhibit good results also for some other benchmarks as suggested by the measured values of IPC and average associativity (e.g., mgrid and bt). At the same time, there are other benchmarks incurring a significant performance degradation (e.g., galgel and mcf). So, if the workload that will be run on the system is known in advance, then a set of tailored threshold values can be estimated with the proposed methodology. If the workload changes after the tuning procedure, then the sensitivity of the algorithm is such that using values estimated for other workloads can lead to satisfying results if they exhibit similar characteristics. Anyway, in such scenarios or if the tuning procedure is not feasible at all, the use of the proposed heuristic values always lead to a good power/performance trade-off.
Reconfiguration interval
The value of the K parameter directly affects the frequency of the cache reconfigurations and, similarly to the T 1 and T 2 thresholds, its value is critical for the performance of the WA D-NUCA scheme. In fact, adopting a small value for K means making a higher number of resizing decisions, which only rely on the very local behaviour of the application. Theoretically, this would improve the algorithm reactivity and consequently its performance but, as we will show, because of the intrinsic nature of the algorithm, excessively reducing the value of K results in IPC and/or average associativity degradation. Adopting a large value for K could affect the algorithm's ability to detect application phase changes with consequent incorrect reconfigurations. Figure 16 and Figure 17 show the variation of the IPC and of the average associativity when varying the value of K in the range from 100 to 1,000,000 cache hits. The value of IPC of the plain D-NUCA is also reported for the sake of comparison. Looking at the average results, the previously chosen value (K = 100,000) is the one that gives the best IPC. 
Figure 17
Average associativity variation versus the value of K in the range from 100 to 1,000,000 memory hits (see online version for colours)
Adopting K = 1,000,000 excessively reduces the algorithm's reactivity. As expected, the phase changes are not correctly detected and the cache size is not correctly adapted to the application's needs. In particular, we have measured that, for the considered benchmarks, a maximum of four reconfiguration events are scheduled, meaning that almost all the run phase changes are completely ignored by the algorithm. Adopting values smaller than 100,000 worsens the average IPC and/or the average associativity as the cache ways are not correctly turned on according to the application's needs. In particular, when the algorithm decides to turn on a new way, due to the short reconfiguration interval, the way itself has few chances to be populated with new data and consequently it will receive an extremely low number of hits. As the next reconfiguration decision is based on the number of hits received by such way, there is a high probability that the algorithm will decide to shut it down, going back to the initial configuration and entering in a loop in which the way is alternatively turned on and off.
Considering the single benchmarks in isolation, the same considerations are valid even if different optimal working points can be found. For example, when using K = 10,000, galgel exhibits a performance improvement with respect to setting K to 100,000 at the cost of a slight increase of the average associativity; gcc maintains the same performance while reduces its average associativity.
Way adaptable D-NUCA in the multi-core environment
In a multi-core environment, the bandwidth requirements of the CPUs towards the memory subsystem are a matter of concern for computer architects. In a CMP system, such problems could be eased by employing a NUCA architecture as a shared cache (Beckmann and Wood, 2003; Huh et al., 2005; Chisti et al., 2005) . In this section, we evaluate the applicability of our WA D-NUCA technique in a multi-programmed scenario employing a D-NUCA as the shared on chip L2 cache.
Figure 18
The architecture employed for the evaluation of the way adaptable technique in the multi-core environment.
Reference architecture
We considered a CMP architecture made up of two Alpha 21264 cores sharing an L2 D-NUCA cache and connected to it with a bus; the cache architecture and capacity is the same adopted in the previous sections. The reference architecture is shown in Figure 18 and the simulation parameters of the multi-core system are identical to the ones used for the single-core. In this architecture, the same previously shown prediction algorithm can be applied, because the requests coming from the cores are injected from the same side of the NUCA cache, as in the single-core environment. Thus, for the sake of comparison, we adopted the same heuristic values for T 1 and T 2 .
Simulation methodology and performance metrics
When simulating multi-programmed systems, an additional modelling effort must be made on selecting representative groups of applications and finding a reasonable performance metric.
In order to cover all representative cases with a reasonable subset of simulations, we simulated the concurrent execution of couples of SPEC CPU2000 benchmarks which were representative of different classes of memory requirements. In particular, we identified the average associativity needed by a benchmark for the single-core WA D-NUCA case as a synthetic indicator of its cache resource needs.
Intuitively, if WA1 and WA2 are the average number of ways (estimated by the WA D-NUCA technique) needed by two applications running in a single core environment, they will need WA1 + WA2 ways when running together in the multi-core environment. Then, one could naively expect this couple of benchmarks to have chances to power off ways in the multi-programmed case if WA1 + WA2 is still below the number of available cache ways. If WA1 + WA2 approaches or exceeds the number of available ways, it is reasonable to expect that chances for shutting down ways are lower or null. In the latter case it is also reasonable to expect that the competition for cache lines by the two running applications could degrade performance.
In Table 4 , we list the three groups of applications that we identified: the first group is composed by couples of benchmarks whose sum of average associativities in the single-core WA D-NUCA configuration is lower than eight ways (the associativity of the baseline D-NUCA); the second group is formed by couples of applications whose sum of associativities is close to eight ways; the third class of applications has a sum of associativities which exceeds eight ways.
In order to estimate the impact on performances of our technique we chose the weighted speed-up metric used in Snavely and Tullsen (2000) . In Eyerman and Eeckhout (2008) , the weighted speed-up is presented as a fair metric for quantifying the performance of a system in which multiple applications share system resources. Let SingleIPC be the IPC of an application when it is executed by the reference single-core system, and MultiIPC be the IPC of the same application when running on a core of the CMP system concurrently with another application (thus sharing the L2 cache). The weighted speedup for two applications is the sum of the ratios between SingleIPC and MultiIPC of each application and is given by: As in the single-core case, we adopted the average associativity exhibited by the multi-programmed workload as a power reduction metric, comparing it with the plain D-NUCA case. Figure 19 shows the weighted speedup obtained in a multicore system backed by a plain D-NUCA cache compared with the one backed by a WA D-NUCA, when running the same mixes of applications. For both the systems, the weighted speedup was calculated by considering as its reference a single-core system backed by the plain D-NUCA cache. The performance loss in the multi-core WA D-NUCA case with respect to the former is below 2% on average; this small degradation of performances is by the same order of the single-core case and does not vary substantially among different mixes of applications. It is worth noting that for the mgrid + perlbmk couple we measured a speedup greater than two as a consequence of the latency reduction given by the WA D-NUCA scheme already described for the single core case. Note: The performance loss is less than 2% on average.
Results
The performance reduction is encompassed by the reduction of the average associativity shown in Figure 20 . All the considered couples of benchmarks show a reduction varying between 2% and 68%. The reduction is 32% on average, meaning that also in the multi-core environment it is possible to effectively adopt the WA D-NUCA technique to reduce the static power consumption at the cost of a slight performance reduction. Better results are obtained with couples of applications belonging to the first and the second group; however, as shown in Figure 20 , also for couples belonging to the third group the WA D-NUCA technique allows a reduction of the average associativity with negligible performance losses. Note: The reduction of energy consumption is 25% on average. Figure 21 shows the energy consumption for the WA D-NUCA scheme normalised with respect to the energy consumption of the plain D-NUCA. For the evaluation of the CMP case, we used the same energy model and parameters of Section 3 and Section 4 and we assumed the same operating temperature (80°C). For the CMP case, the application of the WA D-NUCA technique results in an overall average reduction of energy consumption by 25%. Figure 22 shows the EDP achieved by the WA D-NUCA scheme normalised with respect to the one of the plain D-NUCA. Also in the CMP environment, the WA D-NUCA technique reduces the overall energy consumption while maintaining the same performance level achieved by the D-NUCA scheme, such that the overall average EDP decreases by 24%. Note: As a consequence of the increased memory traffic in the multi-core environment, the total number of hits is duplicated. However, thanks to the promotion mechanism, the hits are still concentrated in the first ways.
It is worth noting that the actual average associativity achieved by two applications running together often turns out to be significantly lower than the sum of the average associativities achieved by each application when running on a singlecore system (indicated by crosses in Figure 20) ; this counterintuitive result is a consequence of the migration mechanism of the D-NUCA. In fact, it concentrates the most accessed data in the first ways also for the multiprocessor environment, even if the load on the shared cache has grown due to the coexistence of two cores. Figure 23 explains this behaviour, comparing the hit distribution for the gcc benchmark for a single core and a multi-core execution over a phase of 10 millions of instructions. It is noticeable how the increase of the total number of hits mostly translates in an increase of hits in the first ways, which are more efficiently used in the multi-programmed case, and does not increment significantly the number of hits in remote ways, which can still be turned off as in the single-core case.
Related work
The D-NUCA cache architecture was firstly proposed by Kim et al. (2002) , showing that a dynamic NUCA structure outperforms a traditional uniform cache architecture (UCA) when maintaining the same size and manufacturing technology. Subsequent works have extended this design in order to optimise energy efficiency (Chisti et al., 2003; Foglia et al., 2005) , performance (Chisti et al., 2003) or have proposed shared NUCA designs in the context of multiprocessors (Beckmann and Wood, 2003; Huh et al., 2005; Chisti et al., 2005) . Our work strongly relates to research on energy efficiency of cache memory subsystems (Kim et al., 2003; Bardine et al., 2007) and dynamically adaptable cache architectures (Venkatachalam and Franz, 2005) .
NUCA architecture proposals
NuRapid (Chisti et al., 2003) extends the original idea of Kim et al. (2002) by decoupling tags from data and keeping tags in a centralised array in order to improve data mapping flexibility and energy efficiency, reducing the number of bank accesses upon each search. Triangular D-NUCA cache (Foglia et al., 2005 ) is a proposal for high performance embedded systems in which area occupancy of the cache arrays is a primary design concern: in triangular D-NUCA caches the size of each way is not kept constant across the chip and decreases for far, less populated ways. Foglia et al. (2005) show how a judicious mapping policy and the adoption of a heavily sub-banked architecture can result in improved design flexibility for power aware architectures.
In the context of on-chip multiprocessors, the behaviour of D-NUCA as shared L2 caches has been analysed in Beckmann and Wood (2003) , Huh et al. (2005) and Chisti et al. (2005) ; these works evaluate the performance of different data mapping and migration policies but none of them explicitly focuses on power and energy aspects.
Energy aspects
A big issue in modern microprocessor designs is the rise of total power consumption, a large portion of which, for CMOS processes at 65 nm and below, is due to static power dissipated by leakage currents (Kim et al., 2003) . Big SRAM structures, like the ones employed in a NUCA-based system, are responsible for a large fraction of the total leakage power budget (Kim et al., 2003) . In addition, D-NUCA caches utilise switched networks and data movement policies to achieve high IPC, which further increase dynamic power consumption.
An analysis of the balance between static and dynamic power for NUCA architectures, an energy/performance trade-off evaluation for NUCA and its comparison with UCA has been given by Bardine et al. (2007) . By developing an energy model for NUCA components and simulating various UCA, S-NUCA and D-NUCA architectures, Bardine et al. (2007) show how D-NUCA caches are candidate to be the most performing and energy saving of the three classes of cache architectures. They find how the static components dominate energy dissipation in NUCA designs, making the reduction of static power consumption a substantial issue for NUCA caches in deep sub-micron CMOS processes.
Power-aware cache architectures
Many techniques have been proposed to dynamically adapt the cache size to the working set size (Venkatachalam and Franz, 2005) . Most techniques are focused on performances and dynamic power saving and are not directly applicable for reducing the static power consumption. Albonesi (1999) , has proposed the selective cache ways technique, in which dynamic energy consumption of L1 set-associative caches is reduced by dynamically disabling cache ways upon software commands. Also, this technique takes advantage from a sub-banked cache design but no assumptions are made on the access distribution across the ways, because the reference architecture by Albonesi (1999) is not NUCA. In addition, our technique focuses on L2 caches, for which the static power component is more relevant (Bardine et al., 2007) . For this kind of caches, it is straightforward to leverage their sub-banked architecture to directly turn off cache banks, saving both dynamic and static power dissipation. Balasubramonian et al. (2000) have proposed an adaptive set associative cache structure, namely accounting cache, in which data line LRU ordering is leveraged to implement a two step incremental search on cache ways, in order to activate on average a small number of ways upon each access. A data swap mechanism between ways is used to maintain a physical LRU ordering of data: this kind of migration has some similarities with D-NUCA promotion mechanism, but while in accounting cache its purpose is to maintain LRU ordering between ways in a set of an ordinary cache, in a NUCA architecture the non-uniformity of the data access time between different subbanks is leveraged to lower data access latency. Finally, in Dropsho et al. (2002) a run time configurable on chip cache is proposed, in which a single hardware cache is virtually partitioned in two levels. The sizes, associativities and latencies of the first and the second cache level are dynamically chosen by a resizing algorithm invoked on regular execution intervals, based on performance hardware counters (cache miss rate, IPC and branch frequency). Adopting a single reconfigurable NUCA architecture for L1 and L2 caches would lead to an unpractical design complexity and potentially harm first-level cache performance. All these techniques require an always powered on cache to reduce miss rate and dynamic power consumption.
The techniques focusing on the reduction of static power consumption rely on putting memory cells in a low leakage mode, either loosing or maintaining their data contents. MTCMOS caches (Hanson et al., 2003) , drowsy caches (Flautner et al., 2002) and slumberous cache (Mohyuddin et al., 2005) are data preserving techniques while decay lines caches (Hu et al., 2002 ) is a data loosing technique. While data-preserving techniques as drowsy caching appear to be a profitable technique to lower static power consumption while not impacting on cache miss rate, it is worth noting how data-preserving low power techniques rely on a fine grain adjustment on SRAM cell electrical parameters, which turn out to be less controllable as transistor feature sizes are decreased and larger process variation effects occur (Calhoun and Chandrakasan, 2006; Bhattacharya et al., 2008) . Conversely, data loosing techniques as decay caching have a non negligible hardware overhead related to their fine grained control on cache data structures (i.e., they operate at cache line granularity while WA D-NUCA caches operate at way granularity). In addition, in order to utilise both of these kinds of techniques in a D-NUCA cache, the characteristic concepts and parameters on which they rely need to be redefined: in fact, because of the promotion/demotion mechanism, D-NUCA caches exhibit a different access pattern to data lines with respect to a UCA cache. However, it is worth noting that the WA D-NUCA technique is potentially orthogonal to these other techniques: in fact while the former act by increasing or decreasing the cache size/associativity, the latter work at block granularity without further modifying the cache behaviour. Thus, it is possible to conjunctly adopt them in a D-NUCA cache obtaining the benefits of both.
The way adaptable cache (Kobayashi et al., 2005 ) is a data losing technique proposed for traditional cache architectures that predicts cache needs adopting a metric based on the LRU/MRU state of each way and turning on and off entire ways based on a random choice. The decisions are triggered by an algorithm which has a similar structure to the one that we propose; however, the direct application of this technique to a D-NUCA would imply the complexity of calculating the LRU/MRU state; furthermore, in a D-NUCA cache, the random choice of a way to turn off would not take into account the actual usefulness of data.
Finally, our work is related to the more general context of detecting program phases (Sherwood and Calder, 1999) to adapt the number of active resources to the need of the running program. In Dhodapkar and Smith (2003) , a comparison of ideal and actual implementations of such techniques is proposed.
Conclusions
In order to limit the wire-delay effects of large last level SRAM caches, non-uniform cache architecture (NUCA) has been proposed in literature; in particular, D-NUCA cache architecture has been identified as a high-performance solution, being able to combine reduced miss rates of high capacity caches with the low latencies of small SRAM banks. Anyway, similarly to other L2 cache structures, D-NUCA caches are affected by high power requirements.
In this work, we present way adaptable D-NUCA, a lightweight hardware technique which is able to maintain the high performance level of D-NUCA caches while significantly improving their power efficiency by dynamically adapting the size of the cache to the needs of the running application.
This size adaptation technique leverages the nonuniform distribution of accesses on banks due to migration and uses a simple prediction algorithm in order to trigger reconfigurations at way granularity with very limited impacts on performances: in a WA D-NUCA cache, the average number of active ways can be reduced by 35.4% with respect to a conventional D-NUCA, with a slight 2.96% average IPC reduction. Consequently, the average energy consumption is reduced by 33.36% and the average EDP is improved by 31.2%.
We defined a methodology to evaluate the parameters of the prediction algorithm that can be determined through simulation and evaluated the sensitivity of the technique to such parameters, showing that they can both be tuned to target a specific application in order to achieve significant energy consumption reduction or follow a specific energy/performance trade-off, or be used to cover a wider class of applications while maintaining a good performance level without the need to be recalculated.
A large portion of current microprocessor market is dominated by CMP systems (Azimi et al., 2007) . This kind of systems is often characterised by a large last level shared cache whose leakage energy consumption is a growing problem for modern design processes. We showed how our way-adaptable technique can be profitably applied to such systems, evaluating dual-core systems equipped with a D-NUCA cache and running multi-programmed workloads. We showed that also in this case it is possible to achieve a significant reduction of the average EDP (24% on average) while maintaining negligible performance losses (2% average degradation of speedup with respect to a baseline D-NUCA).
