The on-chip memory is a dominant source of power and energy consumption in modern and future processors. This paper explores the use of a new emerging non-volatile memory technology as a replacement for SRAM based lower level caches -Spin Torque Transfer(STT) RAM. While STTRAM achieves a reduction in leakage energy of 90% compared to SRAM, the dynamic energy for a write operation is 2X that of SRAM. Consequently, we propose additional microarchitectural optimizations to reduce overall dynamic energy which achieve an average reduction in dynamic energy over the base case of 30% with a range of 16% to 60% across 10 benchmarks.
INTRODUCTION
The projected energy growth of multicore processors if left unimpeded will stall the performance scaling that has accompanied Moore's Law. The dominant sources of onchip energy consumption are the cores and the memory hierarchy. L2/LLC caches are unique consumers of energy in that these components are optimized for low miss rates and consequently have extremely low utilizations, e.g., in the range of 10%-20% [1] and whose energy consumption is dominated by leakage. This paper explores the use of a new emerging non-volatile memory technology as a replacement for SRAM based L2 and LLC caches -Spin Torque Transfer(STT)RAM. The goal of this work is to understand the challenges posed by the use of STTRAM technology in the cache hierarchy and develop microarchitecture solutions to address them.
Figure 1: Design Space Exploration of Dynamic Energy Vs Latency
The main challenge of STTRAM is the high energy cost of write operations. Wu et.al. [2, 3] have proposed solutions using partitioned caches -fast SRAM partitions coupled with slower MRAM partitions. Zhou et.al [4] proposed a novel circuit optimization for early termination of write operations to an STTRAM array if the write value matches the stored value. Other efforts have investigated the use of non-volatile phase change memory (PCRAM) [5, 6] although these studies have largely focused on the use of PCRAM as a potential DRAM replacement. Our approach differs fundamentally in that we start with the circuit design of STTRAM cells/array driven from an architectural perspective -clock rate -to reach a supporting energy-latency design point. Consequently, the physical access properties of our STTRAM design is coupled with clock rate goals of the data path and differs from the nominal values assumed in prior studies. These designs also serve as the basis for quantifying the performance of the new architectural optimizations proposed in this study.
The physical access properties of an STTRAM cell/array design is determined by the clock rate of a target microprocessor. For a non-volatile 1T1MTJ cell the latency and energy of a read operation is comparable to its SRAM equivalent, while cell write energies and latencies remain only 2X and 3X higher respectively. This substantive narrowing of disparity between SRAM and STTRAM characterstics is determined by a nominal clock rate and highlights the tradeoffs that will admit different architectural solutions to mitigate the increase in dynamic energy due to store instructions. Our results can be summarized in Figure 1 which places the results on a execution time vs. dynamic energy plane for a set of representative benchmarks. We see that drop-in replacement of an STTRAM of equivalent logical size produces an increase in dynamic energy and execution time due to store instructions while our micro-architecture optimizations tailored to STTRAM caches recovers most of the dynamic energy increase over SRAM implementations, maximizing the leakage energy gains. Constant area repalcement produces significant execution time gains due to lower miss rates but retains some increases in energy due to the increased energy of store instructions.
The remainder of the paper describes the technology (Section 2), microarchitecture optimizations (Section 4), and results (Section 5).
STTRAM: ARCHITECTURE AND CELL DESIGN
The structure of a STTRAM cell consists of a magnetic tunneling junction (MTJ) connected in series with a transistor as shown in Figure 2 . The MTJ consists of two ferromagnetic layers separated by a dielectric layer (usually MgO). While the magnetization of one ferromagnetic layer is fixed, the other layer can be controlled by the injection of spin polarized electrons. Switching occurs if current exceeding the critical value flows through the structure in the proper direction. For storage options parallel magnetization is taken to correspond to 0 and anti-parallel corresponds to 1 . In an SRAM, a write is a non timing-critical cell access operation, but the same does not hold for STTRAM. In STTRAM the write takes much longer and hence the cell needs to be designed to ensure a target write time(given a clock rate), and reading becomes a non timing-critical cell access operation. We chose a design target of a 2GHz system clock. First, we design the 45nm SRAM cell to achieve a read access time (considering wordline driver delay and bitline discharge delay) to be less than 250ps for a 64x128(L2) or 128x128(LLC) sub-array. The remaining clock cycle is divided between row-decoder and sense-amplifier delay. We have ensured the cell timing is met even under worst-case 3σVt variations. The SRAM cell is designed to meet the target delay for the 64x128 sub-array. The bitline and wordline capacitances are estimated based on the cell area of 80F 2 (where F is the feature size) and the wide cell layout (the cell width along the wordline direction is twice the cell height along the bitline direction) with 2:1 aspect ratio and 0.2f F/mm metal capacitance. The designed cell read, write, and leakage energy is as shown in Table 1 . Our STTRAM cell has a write time 5ns and a read time of 0.5ns. We compute the corresponding current capacities from [7] . At the 180nm node this required 450µA or a current density of 2.6x10 6 A/cm 2 . We considered the 2X area scaling model per generation for MTJ devices and assumed constant current density. Our results were a close match with properties of fabricated MTJ devices at 65nm [8] . Based on this assumption, we predict 75µA of required switching current for 45nm MTJ device at 5ns switching time. Considering a 7% write margin (from switching current), we have selected a write current of 80µA. The read latency for both STTRAM and SRAM are assumed to be the same which is consistent with prior studies [9] [4] . Coupling this with the estimated switching current requirements, we find that STTRAM read energies evaluated with our model is significantly lower than previously reported literature. In designing the STTRAM cell, we restrict our operating condition to the exponential part of the switching characteristic curve [7] . This can be viewed as an attempt to balance the write energy and latency realistically at the knee of the energy-latency characteristic for STTRAM. We use the above mentioned cell for the L2 and LLC sub-arrays. The additional wiring to/from the sub-arrays and the predecoder energy components are computed using Cacti(version 6) [10] ). For the L2 our design uses a 64x128 cell sub-array and 32KB banks interconnected to form a 1 MB L2. We use Cacti in the uniform SRAM cache architecture configuration to model an SRAM data and tag array with 64B cache lines. The energy and latency values are given in Table 1 . In our L2 and LLC design we assume the tags to be designed using SRAM cell design and the data array to use the proposed STTRAM cell design. This is because tag arrays hold the replacement stack values which change more frequently. Most reported work has demonstrated increased read latency for STTRAMs in comparison to SRAMs. Figure 3 shows the impact of such an increased read latency on application performance. In order to study the impact of increased latencies we simulated the benchmarks described in Section 5 at different read latencies of 1X, 2X and 3X of SRAM. The write latency and write energy was held constant at 5ns. The drop in IPC is application dependent and can be significant for memory intensive applications, despite the out-of-order core modeling. We did not find a substantial change in energy due to increased read latency. The architectural optimizations discussed in Section 4 is based on a 1X read latency.
BASELINE RESULTS
This section describes relevant features of our simulation methodology and the cache hierarchy that we model( Figure 6 ). Our analysis is done using Zesto [11] with the system configuration specified in Table 2 . We model an out-of-order core with write allocate and write back policies for all caches. We first model a drop in replacement of the SRAM caches with equivalent sized STTRAM caches with the cell design of Section 2. The studies evaluate the dynamic energy spent in the L2 and LLC of Figure 6 . The study serves two purposes. First, it quantifies the energy impact of STTRAM when used as a pure technology replacement for SRAM. Second it identifies the need and opportunity for micro-architecture optimizations. As pointed out in Section 2 the latency of a write operation for our STTRAM cell is 3X greater than its SRAM counterpart. There is however a negligible IPC penalty of less than 1% across all the benchmarks. The leakage energy savings from STTRAM as a technology replacement is significant as pointed out by [3] . Figure 4 shows the total leakage energy consumed in executing 250 million instructions of various SPEC applications. On average we see a 90% leakage energy saving ranging from 95.7% to 86% for our STTRAM cell design. This is in contrast to current cache decay techniques which have shown 10%-15% reduction in leakage energy [13] . Our aim however is to mitigate the dynamic energy for writes by optimizations within the micro-architecture. Figure 5 shows the increase in dynamic write energy from such a drop-in replacement.
As a result of our preceding analysis we propose two optimizations with the aim of minimizing the cost of store instructions. We categorize stores into two main categories. Stores due to cache misses and stores due to evictions of dirty lines in the (n − 1) th layer cache, n being the position of the cache in the memory hierarchy. While stores from misses are mainly due to limitations on the physical parameters of the cache, the latter is an effect of the eviction policy of the (n − 1) th layer. 
ARCHITECTURAL OPTIMIZATIONS
Architectural optimizations are proposed to recover the increased dynamic energy due to store instructions by coalescing stores from the L1 to the L2. The idea is to increase the residency of dirty lines in the L1 to (ideally) accommodate all the stores to that line. This would prevent the line from being prematurely evicted to the L2 and being subsequently moved back to the L1 on a near term store miss. Figure 7 illustrates a 3D histogram of the inter-reference time(IRT) of the store operations from the L1 to the L2 for a sample benchmark -h264 from SPEC06, before and after the optimization. By coalescing the stores to the same line in the L2 one can reduce the peaks(high store frequencies) in the histogram. A similar analysis applies to the reduction of stores from L2 to the LLC. Increasing the residency of dirty lines in L2 minimizes the stores to the LLC, which are twice as expensive. If a store is the last reference before a line becomes dead (never referenced again before eviction), we will refer to that store as a dead store. Otherwise the store is live. We now propose two optimizations for minimizing store operations to the L2 and LLC.
Write Biasing
We propose a novel line replacement algorithm for L1/L2 sets that biases replacement in favor of dirty lines to increase the residency of dirty lines even at the expense of increasing read misses to the next level (since store operations are 2X more energy expensive than read operations). In conventional Least Recently Used(LRU) replacement policy, the most recently referenced line on a hit or a miss is placed at the top of the reference stack(TOS). We proposed new differential promotion and insertion policies that operate differently for loads and stores. The combination of these new insertion/promotion policies is referred to as write biasing. We define a parameter K which is the distance from the top of the LRU stack at which a line is inserted or promoted to. The following implementation is referred to as Write biasing-base(WB-base). Figure 6 : Cache Hierarchy Model with memory operation cost in cycles.
• Insertion policy: Target lines of all load misses are inserted at distance K from the TOS. Target lines of all store misses are inserted at TOS.
• Promotion policy: On a hit, target lines of stores are promoted to the TOS. Target lines of loads are promoted to a distance K from TOS.
• Eviction policy: The line at the bottom of the stack is evicted.
WB-base has two major drawbacks. First, the combination of insertion and promotion policies leads to increased miss rates for loads. If there is no dirty line in a set, all entries from TOS to position K will be vacant because clean lines cannot be promoted past the position K. Thus we suggest a modification to the insertion and promotion policies for loads. • Insertion policy for Loads: Target lines of loads are inserted at position K if all stack positions higher than K are dirty. Otherwise it is inserted at TOS similar to LRU.
• Promotion policy for Loads: Target lines of loads are promoted to position K if all stack positions higher than K are dirty. Otherwise it is promoted to TOS similar to LRU.
Second, one-time store operations(dead stores) can cause dirty lines to occupy stack positions higher than K for very This artificially inflates the load miss rate, which increases energy due to higher load miss allocation. Thus we suggest a modification to the insertion and promotion policy of stores that helps act as a dead store filter.
• Insertion policy for Stores: Target lines of stores are always inserted at position K similar to loads.
• Promotion policy for Stores: Target lines of store hits to dirty lines are promoted to TOS. Store hits to clean lines are promoted to position K. Figure 8 shows an example of the steps in write biasing with all the policy modifications suggested. We utilize write biasing in the L1 and L2. The use in L1 is particularly effective as avoiding a single store due to eviction from L1 to the L2 saves 261pJ(write energy per access to L2). We should point out that write biasing itself is not well suited for SRAM caches where there is no disparity in the read and write energies and therefore nothing to be gained by simply inflating the read miss rate. 
Write Cache
Several benchmarks exhibited higher store inter-reference times. To further extend the ability to coalesce across these situations we propose a second optimization -the addition of a small write cache between the L1 and L2. The write cache(WC) is a fully associative 32/64 way cache that contains only dirty lines evicted from the L1; whose contents are mutually exclusive with the L2; and is accessed in parallel with the L2 with a hit in one inhibiting access from the other (we do account for the energy of parallel look-ups). On a store, if a line is present in L2 it is transferred to the WC. Lines evicted from the WC are sent to the L2 and inserted as dirty lines. The WC and the L2 share the MSHR and thus it allows us to allocate store misses exclusively in the WC whereas load misses are allocated in the L2. The insertion and eviction policy for the WC is as described in Figure 9 . We insert at a fixed position K from TOS in the replacement stack of the 32-way cache. On a conflict we evict the line lowest in the stack. When a store hits in the WC (live store), it is promoted to the top of the replacement stack. Load hits do not change priority order. The value of K is a tradeoff between the number of live stores we want to place in the WC, to the amount of time we want to give to a dirty line to get a store hit(recognize it as a live store) before it is eventually evicted from the bottom of the stack assuming its a dead store.
PERFORMANCE EVALUATION
All experiments report the energy spent in the L2 and the LLC to execute 250 million instructions. The following evaluation and analysis is structured around understanding two phenomena -write biasing policies in the L1 and L2 data cache, and the behavior of the WC. In particular the interactions between these two optimizations produces subtle migration patterns of lines in response to which we further refine our optimizations as described. Whenever a store from the L1 hits in the L2, the line is migrated to the WC causing high traffic in the WC. Thus we need to refine the migration pattern. Write biasing in the L2 promotes live stores higher in the stack, thus filtering out dead stores. Thus we modify the policy to migrate a line to the WC (from the L2) only if it is in the Most Recently Used (MRU) position in the L2. We call this Selective Write Cache Migration (SWCM). We assume SWCM in all results presented. Write biasing in the L1 determines the number of stores placed in the L1 and the duration. This in turn determines the traffic going into the WC. High store miss rate in L2 leads to very high traffic in the WC because all store misses are allocated in the WC. Figure 10 shows the results of write biasing in L1 with K=1 (without the WC optimization). As can be seen, write biasing tends to increase L1 miss rate and hence increases the read energy of the L2. However the 2X factor of write energy and reduction in store operations helps offset most of this energy increase. This behaviour can be seen in benchmarks such as sjeng and gcc. Benchmarks like h264 have a high store miss rate. Thus keeping stores in L1 helps the overall miss rate, consequently decreasing both the L2 read and write energies. Figure 11 shows the simulation results for the best performing configurations of write biasing in L1 and L2 with the WC. The results assume the given policy nomenclature -K value in L1 : K value in L2 : K value in WC. eg. k1 k2 k16 implies K=1 in L1, K=2 in L2 and K=16 in WC. The results reflect the exploration of the degree of biasing represented by the selection of the value of K at each level. The WC employs an insertion policy parameterized by K -the insertion point in the LRU stack. Considering a 32-way WC, we analyzed values of K from 12 to 24 and determined that the value of K = 16 was the most effective. Increasing the size of the WC beyond 32 entries provided little additional benefit. Figure 11 also shows how each of the optimizations compare to a base configuration of SRAM caches.
All the policies described are simple local decisions that have minimal hardware overhead. We use write biasing in upto 16 way caches. In general a K value of 2 performs well. Thus additional hardware cost involves a K-input AND gate per set for performing the logical AND of the dirty bits of the K ways in the set (512 2-input AND gates in our example). We now try to present some insights into these policies.
Benchmark Analysis
The most effective configuration for a majority of the benchmarks is k1 k2 k16. Some benchmarks do well under k2 k4 k16 because these exhibit high store miss rates. Thus higher value of K increases residency of dirty lines, thus improving the store miss rate and reducing traffic to the WC. Memory intensive benchmarks like omnetpp and mcf do not have high store miss rates, but still have very high traffic to the WC. Thus lru lru lru does not do well in such applications. This is where Selective Write Cache Migration is effective in filtering dead stores. Benchmarks with good temporal locality of stores such as h264 show dynamic energy reduction as much as 60% due to high store coalescing. The same phenomenon is graphically visible in Figure 7 .
Interaction of Write biasing and Write cache
The replacement policies for the three caches -L1,L2 and WC interact with each other in the following way. 1) Write biasing in L1: Moderate intensity of write biasing in the L1 cache is effective in decreasing WC traffic. It potentially increases the store residency in the WC, thus increasing chances of coalescing stores that have relatively larger interreference times. 2)Write biasing in L2: Helps as a dead store filter. By migrating live stores to higher positions in the replacement stack, it allows the implementation of SWCM. Thus one time stores get filtered, giving opportunity for live stores to occupy the WC for a longer time.
CONCLUDING REMARKS
In this paper we have proposed and explored optimizations for energy efficient STTRAM L2 and LLC caches beginning with a clock-rate driven cell design and supported by two microarchitectural optimizations to recover most of the dynamic energy increase due STTRAM write operations thereby maximizing leakage enery gains of the technology. The design approach can be used to to tailor cache designs for specific datapath environments and their workload characteristics. Furture work will focus on furthering the energy gains via adaptive write biasing schemes to match workload variations.
