Abstract-For the sake of higher cell density while achieving near-zero standby power, recent research progress in Magnetic Tunneling Junction (MTJ) devices has leveraged Multi-Level Cell (MLC) configurations of Spin-Transfer Torque Random Access Memory (STT-RAM). However, in order to mitigate the write disturbance in an MLC strategy, data stored in the soft bit must be restored back immediately after the hard bit switching is completed. Furthermore, as the result of MTJ feature size scaling, the soft bit can be expected to become disturbed by the read sensing current, thus requiring an immediate restore operation to ensure the data reliability. In this paper, we design and analyze a novel Adaptive Restore Scheme for Write Disturbance (ARS-WD) and Read Disturbance (ARS-RD), respectively. ARS-WD alleviates restoration overhead by intentionally overwriting soft bit lines which are less likely to be read. ARS-RD, on the other hand, aggregates the potential writes and restore the soft bit line at the time of its eviction from higher level cache. Both of these two schemes are based on a lightweight forecasting approach for the future read behavior of the cache block. Our experimental results show substantial reduction in soft bit line restore operations, delivering 17.9 percent decrease in overall energy consumption and 9.4 percent increase in IPC, while incurring negligible capacity overhead. Moreover, ARS promotes advantages of MLC to provide a preferable L2 design alternative in terms of energy, area and latency product compared to SLC STT-RAM alternatives.
enhance cache capacity [5] , [6] . Compared to the Single Level Cell (SLC) structure, one more Magnetic Tunnel Junction (MTJ) is placed into a single MLC STT-RAM cell. These two MTJs, whose feature sizes are maintained differentially to meet certain Tunneling Magneto-Resistance (TMR), can be implemented either in plane or perpendicularly. Various line-to-bit mapping strategies (e.g., direct mapping, cell split mapping) are enabled due to the distinct access characteristics of small and large MTJs in a MLC. Despite these design features, reliability remains a current design challenge [7] In the write operation of a MLC, the switching current for the large MTJ (hard bit) is able to change the magnetization direction of the corresponding small MTJ (soft bit), which is called Write Disturbance (WD). To rectify WD, on every write access to a hard bit, data stored in the soft bit of the same MLC must be sensed out first and restored back after the writing is completed. On the other hand, in order to read data without disturbing the MTJ configuration, a small sensing current is applied to the bit line in MLC. However, with the continued scaling of MTJ feature size, the sensing margin narrows as the read current is mostly unchanged [8] while the switching current continues to be reduced. As a result, reading an MLC is likely to disturb the soft bits' stored value, which is commonly referred to as Read Disturbance (RD). Analogous to the solution in WD, an immediate restoration is required after every read of MLC. Although a restore scheme guarantees data integrity, it introduces extra reads and writes and is energy inefficient in some scenarios. For instance, suppose a cache block is impacted by WD. In the case where that block is not subsequently read prior to its eviction, then its disturbed is insignificant. Also, if a block is about to be updated, immediately restoring it after RD turns to be a superfluous write operation.
In this paper, we propose an energy efficient restore schemes for MLC STT-RAM based cache, Adaptive Restore Scheme (ARS), for WD (ARS-WD) and RD (ARS-RD) respectively. Leveraging the non-inclusion property of modern multi-level cache hierarchy, and more importantly, the predictable access behavior of cache blocks, ARS-WD chooses to overwrite the soft bit line when it is less likely to be read in the near future. ARS-RD, on the other hand, postpones the disturbed block correction until the eviction of its most updated version from higher level. To enable ARS, first we define the concept of Read Reuse Distance (RRD), which is the timing distance between two consecutive read accesses to the same cache block. We also develop a lightweight yet precise RRD predictor inspired by an existing cache line reuse distance prediction design [9] , [10] . The RRD predictor samples memory access streams (hashed values of program counter) to calculate RRD at run time, and the RRD prediction table is updated for higher accuracy up on the incoming pairs of PC and RRD. Furthermore, to determine if it is harmless to restore adaptively, a threshold RRD with high coverage is adopted and compared with the Estimated Distance to the next Read (EDR). Our comprehensive experimental results show ARS attains the full potential of MLC for L2 deployment.
The following contributions are developed in this paper:
ARS-WD, a low overhead restore scheme, is proposed to deal with the WD in MLC STT-RAM cache. ARS-RD is developed to mitigate the restore overhead for RD in MLC. A lightweight PC-based RRD predictor is developed to forecast the read access behavior of a cache block.
A trustworthy threshold RRD is adopted based on the analysis of a wide variety of memory access traces.
Comprehensive evaluations are conducted in terms of energy, access latency, and IPC, etc. The results confirm the significant benefits delivered by ARS. The remainder of this paper is organized as follows. Section 2 presents the background of MLC STT-RAM cache. In Section 3, WD is analyzed first, then the overhead of the conventional restore scheme is assessed, then ARS-WD is presented. Section 4 is organized in the same manner to present the design of ARS-RD. Section 5 presents the RRD prediction method and implementation. In Section 6, extensive application-based experiments are conducted using the PARSEC benchmark suite [11] to evaluate the proposed strategies. Advances beyond related work are summarized in Section 7 and the conclusions are provided in Section 8.
BACKGROUND
In this section, we introduce the basics of MTJ operation, Single Level Cell (SLC) STT-RAM, and Multi-Level Cell (MLC) STT-RAM. Two-step read and write schemes, and mapping strategies are then described for MLC designs. Lastly, the non-inclusive multi-level cache hierarchy model is presented.
Single Level Cell STT-RAM
STT-RAM is an emerging non-volatile memory which can provide SRAM-like read speed, DRAM-like density, and near-zero leakage power. Unlike traditional on-chip cache technologies using either latches (SRAM) or capacitors (eDRAM) as storage units, STT-RAM employs the non-volatile device-MTJ to store the data value. Each MTJ consists of two ferromagnetic layers (free and reference layer) and an oxide barrier (MgO) sandwiched between them. The magnetization direction of the reference layer is fixed, while that of the free layer can be switched. MTJ resistance is determined by the relative magnetization direction of two ferromagnetic layers and thus it is programmable. As shown in Figs. 1a and 1b, the magnetization directions of two ferromagnetic layers can be tuned to either parallel or anti-parallel, indicating whether the MTJ is in a low resistance state (logical 0) or a high resistance state (logical 1), respectively. Fig. 1c shows a popular '1T1J' structured STT-RAM cell where each MTJ is connected to a access transistor. In the write operation, a high voltage is applied between the source line (SL) and the bit line (BL), generating a current across the MTJ and switching the magnetization direction of the free layer. When reading data from a STT-RAM cell, a small sensing current is injected to generate a bit line voltage (V BL ). This V BL is then compared with a reference voltage in order to decide whether a logical 1 or a logical 0 is stored in the cell. Similar to other non-volatile memories, such as Phase Change Memory (PCM) and NAND Flash, write is more costly than read for STT-RAM in terms of latency and energy consumption. The expensive write operations also wear out MTJs gradually and thus wear leveling techniques are required to uniformly allocate the write. Moreover, the MTJ switching behavior is essentially an asymmetrical and stochastic process.
Multi-Level Cell STT-RAM
To further improve the density of STT-RAM, MLC designs have been introduced and studied recently. There are two varieties of MLC structures, namely, serial MLC shown in Fig. 2a and parallel MLC shown in Fig. 2b . The serial MLC stacks two MTJs vertically in a single memory cell and has been proven to be more reliable and easier to fabricate. On the other hand, the parallel MLC utilizes a single MTJ with a free layer having two independent fields, which leads to a smaller critical switching current compared to its serial counterpart. We considered the serial MLC design in this paper since it is more practical and has been adopted in various implementations [12] . However, our proposed technique is also applicable to the parallel structure. In the serial MLC, two MTJs must occupy different cell areas so that four differential resistance states can be achieved. We refer to the small and large MTJs as the soft bit and hard bit storage respectively. Under constant resistance-area product, the soft bit, with larger resistance, is easier to be flipped than the hard bit because the soft bit requires a smaller switching current.
The read and write schemes of MLC STT-RAM have been well-studied in [13] . To read a two-digit value from a MLC, it requires two comparisons as shown in Fig. 3a . Recall that the sensing current passing through MTJs will produce a V BL . This V BL is compared with the reference voltage V ref1 first to decide the value of the soft bit, then a second comparison with V ref2 or V ref3 is done to decide the hard bit. With the knowledge of previous bit values stored in the MLC, resistance state transitions are depicted in Fig. 3b and described as follows:
No Transition: MTJs remain at the original state when the incoming two-digit value is identical to the previous one.
Soft Transition: A weak current I Low , which will only affect the small MTJ, is applied to switch the soft bit only.
Hard Transition: If the hard bit needs to be changed and the soft bit needs to have the same value as the hard bit, then a strong current I High is used to switch both MTJs. Two-step Transition: When the hard bit needs to be flipped and the target two-digit number has different values for both bits, then a hard transition is conducted first followed by a soft transition.
Cell Split Mapping Strategy
Due to the distinct features of soft bits and hard bits, line-tobit mapping strategies, such as direct mapping, cell split mapping, interleaved mapping [5] , [14] have been investigated to enhance the MLC cache performance. In direct mapping as shown in Fig. 4a , although both bits in a MLC are mapped to the same cache line and thus it requires only one step to write the line, the fact that it is easier to access soft bits than hard bits is neglected. Interleaved mapping leverages the non-uniform access frequency within a data block, however it requires additional address decomposition to support mixed block modes.
We choose the cell split mapping strategy in our design, which constructs an entire cache line using the favorable soft bits. In other words, soft and hard bits in a MLC are mapped to two cache lines, referred as Soft Bit Line (SBL) and Hard Bit Line (HBL), respectively. As shown in Fig. 4b , N MLCs construct two N-bit cache lines, where all the hard bits form an HBL and the corresponding soft bits in the same cells compose an SBL. By doing so, only one-step read and write are required to access the block in SBL. Furthermore, the fast and low-power feature of soft bits can be exploited by migrating the frequently accessed blocks to SBL.
Non-Inclusive Multi-Level Cache
Different inclusion models (inclusive, non-inclusive and exclusive) can be employed for a multi-level cache. In this paper, we adopt non-inclusive cache hierarchy as industry has been trending towards it [15] . Also, non-inclusion property enables the write bypassing techniques to enhance the cache performance and energy efficiency [16] , [17] . Fig. 5 describes an non-inclusive two-level cache with write back policy. Triggered by a read miss on L2, the data block is fetched from the main memory and allocated to both L1 and L2, which is referred as a cache fill. On the other hand, when a block is evicted from L2 due to the replacement policy, there is no need to invalidate the victim block replica in L1, which is referred as back invalidation. In other words, compared to the inclusive cache, which guarantees L1 is a subset of L2, non-inclusive cache can accommodates substantially more data blocks occupying the same physical space. The capacity of an non-inclusive hierarchy is between the sum of all cache levels and the size of L2. Therefore, non-inclusive caches typically outperform inclusive caches by tradeoffs involving the snoop filtering effect [18] . When a block in L1 is selected for eviction, L1 tests its status first. If it is dirty, then this evicted block will be written back to L2. If it is clean, then this block will be overwritten by the incoming block without write-back.
WRITE DISTURBANCE IN MLC
In this section, we first elaborate the root causes of the inevitable write disturbances in MLC. We then demonstrate the significant energy overhead incurred by the naive Immediate Restore Scheme (IRS), which is adopted to overcome the disturbances. Finally, ARS-WD is proposed to mitigate the energy overhead incurred by restores. In order to enable ARS, the read reuse behavior of a block becomes critical. We will elaborate RRD and EDR in Section 5.1, and the adoption of threshold RRD value (RRD th ) in Section 5.3.
To facilitate our discussion, an eight-way associative cell split mapping cache set is depicted in Fig. 6 . Here a HBL is a cache data block stored in an HBL, b SBL is another block saved in the corresponding SBL of a HBL , both of which are in L2 of an non-inclusive cache hierarchy.
Motivation
As we described in Section 2, the switching current I w flows through an MTJ and changes its magnetization direction. The write current value is proportional to its MTJ area [19] as
where J c0 is the critical current density at zero temperature; T v is the switching current duration; C and a denote fitting parameters. Whereas the feature size continues to shrink, the MTJ area will decrease exponentially. Therefore, the switching current amplitude continues to decrease as technology scales to 22 nm feature size and beyond as illustrated in Fig. 9 . In a serial MLC, switching the large MTJ requires higher current according to Equation (1), which will overwrite the value stored in the small MTJ. To rectify WD, upon each hard bit write request, data stored in the soft bit has to be read out first, then restored back immediately after the hard bit update is completed. In other words, being unaware of the original value stored in the MLC, a two-step transition is always adopted in the hard bit write to ensure the data accuracy.
Energy Cost of Handling Write Disturbances
This inefficient, yet necessary IRS for correcting write disturbance will incur significant performance and energy consumption overheads. In particular, we illustrate the energy overhead caused by IRS on cell split mapping MLC STT-RAM cache. Assume E PC is the dynamic energy consumed by cache peripheral circuitry, e.g., address decoder per access, E WSBL and E RSBL are the average write and read energy of SBL per request, while N WHBL is the write access number of HBL. Upon every write to an HBL, the corresponding SBL has to be read (E RSBL ) and then become backed up in the write buffer first, then decode the address and write back (E PC and E WSBL ) after HBL write is completed. The energy overhead spent on the IRS for WD is E IRS WD , modeled as below:
Fig . 7 demonstrates the dynamic energy consumption breakdown of read, write and IRS for WD (IRS-WD) with PARSEC benchmarks (see Section 6.1 for detailed experiment settings). IRS-WD consumes 23 percent on average and up to 26.6 percent additional dynamic energy in writeintensive workloads.
Adaptive Restore Scheme for Write Disturbance
The intuition behind MLC energy reduction is to avoid unnecessary restore operations. For example, in WD, immediately updating a block which will not be read soon is superfluous. In the Adaptive Restore Scheme for WD (ARS-WD) proposed in this section, by overwriting selected SBLs that are less likely to be read, a large portion of restorations can be reduced.
Recall the write scheme of MLC in Section 2.2, without the knowledge of the current value stored in a cache line, writing an HBL is a two-step transition. In contrast, ARS-WD selectively adopts one-step write, i.e., hard transition for HBLs, based on the read reuse feature of the corresponding SBLs. The flow chart of ARS-WD is shown in Fig. 8 and it works as follows:
Upon a write request to a HBL , L2 first checks the 'V' bit of b SBL . If it is invalid ('V' bit is '0'), this line will only serve write access, but not read access. Overwriting it will not introduce any cache miss. The dirty bit of a HBL is set when the write request is a dirty cache block write-back from L1. When the request is a new write allocation from the main memory due to L2 cache read miss causing cache fills, the dirty bit remains '0'. Next, ARS-WD detects if b SBL has a replica in L1. If yes, then overwriting it will not incur an extra cache miss. Recall that in the non-inclusive hierarchy, upon a read miss in L2, the missing block is retrieved from the main memory to all cache levels. However, the copy stored in L2 will not be accessed until its eviction from L1. Thus, immediately restoring b SBL will not benefit cache read hit. Also, since L1 maintains the most updated copy of the cache block, there is no need to write back b SBL at the time of overwrite even if it is dirty. If b SBL has no copy in L1, overwriting and invalidating this block may lead to a future read miss. Therefore, it is necessary to predict the read reuse behavior of b SBL and calculate when the next time this block is read (EDR). Furthermore, we determine if b SBL will be involved in subsequent reads shortly by comparing EDR with a threshold RRD value, which is adopted from the analysis on a wide variety of memory traces. When EDR is shorter than a threshold read reuse time (RRD th ), conventional IRS is applied to handle read miss. Supposing b SBL has no copy in L1 (i.e., no write-back from L1) and will not be read in the near future according to the prediction result, thus no read fetch from L1, b SBL can be overwritten in the write access of a HBL without leading to more cache miss. Note here the dirty bit of b SBL needs to be checked first. If it is '1', this block is written back to the main memory and is then overwritten. Despite a confidence counter design is employed in RRD predictor as described in Section 5.1, misprediction can still occur occasionally. The worst case vulnerability occurs when an SBL is overwritten without restoration then read soon afterwards. In this scenario, an extra data fetched from the main memory is incurred, while data accuracy will not be affected as the SBL is invalidated first.
To leverage the advantages of SBL, various data migration policies have been proposed [5] , [20] which relocate frequently accessed cache blocks to SBLs. Although the data migration policy essentially alleviates the write pressure on HBLs, more than 20 percent of total writes are likely to be served by the unfavorable HBLs [20] . Moreover, due to the excessive write to SBLs after migrating cache blocks, SBLs wear out much faster than HBLs [21] . Therefore, wear leveling techniques will be required to prolong the lifetime of MLC [2] , [21] . In this scenario, HBLs will serve even more writes to balance the MLC wear out. In summary, ARS-WD is orthogonal to the existing data migration techniques, it becomes even more favorable when the wear leveling is applied.
READ DISTURBANCE IN MLC
In the section, we first present the bit error rate caused by read disturbance, then the significant energy overhead of IRS in RD (IRS-RD) is demonstrated. Lastly, we propose ARS-RD which is based on RRD to alleviate the restoration overhead.
Motivation
In contrast to the switching current, the sensing current does not continue to scale down with the feature size [22] . Fig. 9 shows the comparison between read and write currents at different technology nodes. For large feature sizes, such as 130 nm, the read current is much smaller than the write current. Therefore, reading data out of the MTJ will not accidentally flip the stored state. Nevertheless, for the smaller process technologies, it is arduous to shrink the read current amplitude, since conventional STT-RAM sense amplifiers are unable to guarantee the data sensing accuracy using below 20mA current [23] . Thus, the read current remains relatively constant in the deep sub-micrometer regimes, and the margin between write and read current values diminishes significantly. For example, at 32 nm node, the sensing current approaches the switching current so closely that some reads may accidentally write their being-read cells [22] . Equation (3) below is adopted to model the read disturbance rate:
where P donates the read disturbance rate; I is the read current value; t is the read pulse duration; t indicates inverse of the attempt frequency; D 0 donates the magnetic memorizing energy without any impact from current or magnetic field. I 0 indicts the critical switching current at 0K. Previous works [24] , [25] have utilized this model to estimate the Raw Bit Error Rate (RBER). At 15 nm node and beyond, the BER with strong ECC, such as BCH, is still larger than the acceptable error rate for on-chip caches (1E-3) [26] . Therefore, similar to the solution for WD, data is immediately restored back after every read operation.
Energy Cost of Handling Read Disturbances
Assuming N R read requests occur in total, then the energy consumed by the IRS for RD strategy (E IRS RD ) is given by Equation (4). Whereas reading a cache line will also destroy . Read and write current scaling [22] . the SBL stored value, cache block is copied to the write buffer. The address is decoded and restored upon every read, which incurs energy of E PC and E WSBL , respectively. Note here only the soft bit cells are susceptible to read disturbance, as the surface area of the hard bit cell is at least twice as large as the area of the soft bit cell [6] , i.e., 1.4 times in terms of feature size. The sensing current is large enough to corrupt the SBL data regardless of soft or hard bit access.
The dynamic energy consumption breakdown of read, write and IRS for RD (IRS-RD) with PARSEC benchmarks is depicted in Fig. 10 . IRS consumes 52 percent additional dynamic energy on average and up to 147 percent in readintensive workloads.
Adaptive Restore Scheme for Read Disturbance
However, in RD, it is unnecessary to immediately restore a block which will be modified and written back. Aiming to merge the restorations of disturbed blocks with dirty block write-backs, Adaptive Restore Scheme for Read Disturbance (ARS-RD) is presented in this section.
ARS-RD is a two-fold scheme as shown in Fig. 11 . Ideally, a cache line requires at most one restore operation after being read from L2. To this end, ARS-RD first fetches the block to L1 without immediately correcting it, then the victim block is selectively restored to L2 based on the its modification status, origination and associated RRD.
Since only the SBL will be disturbed in the read operation, an 'S' bit is attached to each L1 cache line to differentiate the block read from an SBL. As shown in Figs. 6 and 11a, when b SBL is read, its replica in L1 should carry the modification status to ensure the data integrity. To this end, we attach one more flag bit 'C' to every cache line in L1 to indicate whether a block is unmodified or not when loaded from L2 [24] . Note here the dirty bit in L1 is not inherited directly from that in L2, because that can increase the number write-back to L2. In particular, when a modified block is fetched to L1 for read only, due to the inheritance of dirty bit, it will be written back to L2 anyway at the time of eviction. Lastly, L2 invalidates the disturbed block after every read without writing back it immediately, and then the block is copied to L1.
On the other hand, when a HBL is read, the 'S' bit is set as '0'. Next, the block is brought to L1 directly. IRS or even ARS-WD can be applied to address the disturbance issue of its corresponding block b SBL .
Upon the eviction of a block in L1 as described in Fig. 11b , its dirty bit is first checked for a write-back decision. If the block has been modified, it will be restored to L2. While if the block is clean, it cannot be simply discarded as specified in the conventional write back policy. This is because an unmodified block in L1 is originated from either the cache fill or the read fetch on L2 (Fig. 5) . For the first origination, the clean block can be abandoned, while for the second one, the block with 'S' bit set is fetched with the disturbance and invalidation of its copy in L2. Discarding it will lead to a data integrity issue. Due to the foregoing reasons, L1 checks the 'S' bit of the clean victim block. If it is not set, that implies this block was loaded either from HBL read or cache fill. The victim can be dropped directly as there is no block disturbance involved. Otherwise, it is necessary to consider whether or not b SBL still exists in L2's replacement policy stack if it was not invalidated initially. Next, we compare the RRD of the victim (RRD victim ) with that of the first block in L2's replacement policy stack (RRD LRU ). It has been proven that read misses are more performance critical than write misses, as read misses will delay the processor [27] . Therefore, we protect the block which is likely to serve read requests (RRD victim is smaller than RRD LRU ) by restoring it to L2. Otherwise, L1 tests the 'C' bit for the next decision. If the victim is never written('D' is '0' and 'C' is '1'), read reuse distance is approximate to reuse distance [9] . Since the victim block is less probable to be accessed than the first one in replacement policy stack at this stage, it will be discarded directly. While if the 'C' bit is not set, the evicted block is written back to the main memory to ensure the data integrity. ARS-RD differs from the previous Direct Restore (DR) scheme [24] in two major aspects. First, after a block in L2 is read and disturbed, it is invalidated directly in ARS-RD instead of remaining in L2. This effectively improves the cache utilization. Second, after a victim is selected for eviction in L1, the scheme in [24] requires to check if the victim exists in L2. However, considering the large capacity of L2, this consumes significant resources. Based on RRD, ARS-RD trades cache capacity for performance. Additionally, unlike DR which can only solve RD issue, ARS-RD is coupled with ARS-WD to provide a holistic solution to both RD and WD issues in MLC. Due to these trade-offs and in pursuit of the stated goals of this paper, the evaluation of DR has not been emphasized in the experimental results.
READ REUSE DISTANCE PREDICTION
In this section, we propose the concept of RRD and the associated PC-based predictor to enable ARS. Then the adoption of RRD threshold value and RRD predictor sampling period are presented based on comprehensive experiments. The details of experimental settings can be found in Section 6.1.
RRD Predictor Implementation
In contrast to the reuse distance measurement considering both read and write [28] , RRD is the interval between two successive reads to the same block. In other words, RRD is a notion that quantifies data block read reuse frequency. Here, we use the number of intervening L2 access to represent it. EDR is defined as the interval from the time an HBL is written to the next time this its corresponding SBL is read. We illustrate the concept of RRD and EDR in Fig. 12 . Here timeline describes the memory access stream, 'B' is a data block saved in an HBL being written at Current time, while 'C' is the block stored in the corresponding SBL of 'B'. EDR is calculated using the last read access timestamp stored in the cache line, current time, and predicted RRD as below:
where Timestamp and Current time are from the L2 access counter, and RRD is from the RRD predictor. It has been observed that a specific instruction often executes a highly unique task (e.g., memory read access) and rarely changes this behavior. As instructions are uniquely associated with their Program Counters (PCs), which describes the instruction address in memory, PCs provides a very effective way of recording program context and predicting program behavior. Previous work [9] , [10] proposed a PC-based reuse distance predictor to optimize cache replacement policy, by leveraging the fact that memory accesses can be grouped based on the instructions that caused them. In light of this design, we exploit instructions of cache accesses to predict block read reuse behavior.
Similar to prior work [9] , [29] , there are two parts in our RRD predictor, namely, read sampler and RRD prediction table. Read sampler aims to calculate the corresponding RRD of a given PC. The sampling FIFO buffer is scanned for a matching PC on each read access. Simultaneously, read sampler samples the access stream and stores into the sampling buffer. To calculate RRD, Read Sampler simply multiplies the sample period by its relative tail pointer position. For example, in Fig. 13 , the sampling period is set to 2, therefore PCs associated with RdA, RdC are sampled into the buffer from access stream. WrtB is stored as a stall, i.e., a void state, regardless of its PC particularly. Then the WrtC request comes into cache followed by the RdC. WrtC is skipped in the PC matching process. While triggered by the next read access RdC, the sampling buffer is scanned for a match with the PC associated with block C (PC2). Since PC2 exists in the buffer already, a match is found and the RRD paired to this PC is computed as 4. The second part of our RRD predictor, RRD prediction table, is similar to the one in [9] , which is indexed by hashed PC and holds correlated RRD and Confidence Counter (CC). Up on every incoming pair of PC and RRD from the read sampler, the CC of the corresponding pair in RRD prediction table is updated. Only if the CC reaches a certain threshold, the RRD can be taken for prediction.
Novelty and Overhead of RRD Predictor
Our RRD predictor differs from previous designs in the following aspects. First, at the sampling stage, if a write request is sampled, a stall will be stored in the FIFO buffer as a placeholder without its PC. This is because we only care about read operations in the prediction. However, in [9] , the PC associated with a sampled write request will be stored into the FIFO buffer for matching. Second, at the PC matching stage, only an incoming read request can trigger PC matching. This is due to ARS objective of estimating the read reuse distance instead of the general reuse distance in [9] . Lastly, in [9] , a write back access filter is needed, which significantly increases the design overhead. This is because evictions from L1 appear as write accesses at L2. However, these write-backs are not associated with any instructions. Failure to consider this will substantially degrade prediction accuracy. Read Sampler explicitly avoids irrelevant PCs brought by evictions without the need for a write-back filter. Therefore, our predictor is more lightweight, but delivers even better performance over a range of workloads.
Regarding the memory size overhead incurred by RRD predictor, for example, a 512-entry RRD prediction table brings 512 entries Â 39 bits (32 bits PC + 5 bits bucket + 2 bits confidence counter) per entry = 2.4 KB overhead, which is only 0.06 percent of a 4 MB cache. Also, for the 10-bit timestamp in each cache line, it only consumes 10 bits per 64 B = 1.95 percent of cache line size. According to the circuit level simulation (see Section 6.1 for the setting), RRD predictor only contribute to less than 2 percent of energy consumption of cache line read, which is consistent with results from [17] . 
RRD Threshold Value
In order to determine whether restoration can be postponed or not, EDR of an SBL needs to be compared with a threshold RRD value. This value is adopted from the analysis of various memory accesses. Note here RRD value is presented by the power of two as this can help us define RRD access counter size. Fig. 14 shows the percentage of cache block associated with each range of RRD. For example, cache blocks whose RRD is greater than 2 4 account for 62 percent of the total number on average. In other words, more than half (62 percent) of the cache blocks have a distant RRD and are eligible for overwrite if 2 4 is selected as the threshold. Moreover, the number of reads associated with these blocks takes only 13 percent of total read accesses as described in Fig. 15 . That means overwriting these blocks is less likely to incur read misses. Our goal here is to find out a RRD range, where the associated cache block percentage is maximized so that more restoration can be skipped, while the number of read operation on these blocks are insignificant, i.e., the miss penalty increase is tolerable. Hence, we adopted 2 4 as the threshold RRD. Blocks with EDR greater than the threshold, i.e., non read-intensive, are selected to skip restorations.
Read Sampler Sampling Period
To reduce the storage requirement for RRD prediction table, a reasonable sampling period needs to be adopted. We conducted an experiment to explore the impact of sampling frequency on the hit ratio of L2 cache as shown in Fig. 16 . We found when the sampling period increases from 2 3 to 2 7 , the average hit ratio barely changes across PARSEC benchmarks. However, from 2 7 to 2 10 , it started to fluctuate and decreased significantly at 2 10 . This is because a large sampling period is not able to capture the locality in the access stream. Since a small sampling period introduces intensive PC comparison and table updates, which impairs predictor performance in turn, we chose a sampling period of 2 7 in our final evaluation. The size of sampling buffer can be determined via the equation:
The maximum RRDs we observed in memory traces are generally under 2 10 , and the sampling period is set as 2 7 . Thus we selected 2 3 as the FIFO size. 
EVALUATION
In this section, we first introduce our simulation platform setup. Then we evaluate MLC STT-RAM with ARS in terms of energy, latency, EAT product and IPC.
Experiment Setup
The evaluation was conducted by using the cycle-accurate simulator MARSSx86 [30] . We modified its cache controller module to realize the proposed function. The simulator mimics the computer architecture as shown in Table 2 . Eleven different benchmarks from PARSEC 2.1 suite [11] were used for the experiment, executing 500 million instructions starting at the Region Of Interest (ROI) after warming up the cache with 5 million instructions. We counted read and write accesses on odd (HBL) and even (SBL) cache lines using simsmall and simlarge input sets respectively. Table 1 shows the similar cache line access distribution for these two input sets. Thus,similar experimental conclusions are obtained by choosing either of them. The simsmall input sets are selected for all benchmarks. We adopted the serial MLC STT-RAM cell design from [6] and scaled it under 32 nm technology node [31] . The small MTJ pillar is configured as 32 nm Â 64 nm elliptical shape. We used the NVSim and CACTI [32] , [33] to obtain the key design parameters as shown in Table 3 . MLC STT-RAM model is integrated into NVSim by modifying configurations, such as set/ reset currents, nMOS transistor sizes, etc. The energy contributions from peripheral circuits are also included.
Restore Overhead Reduction
The percentages of dynamic energy consumed by IRS and ARS are shown in Fig. 17 . As presented in Section 3.3 and Section 4.3, IRS-WD and IRS-RD dissipate 23.1 and 51.9 percent of dynamic energy respectively. After applying ARS-WD, 54.6 percent of the WD restore operations are avoided on average, decreasing the WD restore energy overhead to 10.6 percent; in some benchmarks like streamcluster, where there are a large portion of distant read intervals, ARS-WD can save up to 62.7 percent of restoration. On the other hand, ARS-RD reduces 36.9 percent of RD restorations and consumes only 35.1 percent dynamic energy on average. In the write intensive workloads, such as vips, a significant amount of restorations (59.0 percent) are merged with the dirty block write-back.
Energy Area Latency (EAT) Product Comparison
There are a few other technologies that can be employed as L2, for instance, SLC STT-RAM and conventional SRAM. SLC almost doubles the cell area of MLC with the same capacity, whereas MLC suffers from asymmetric read and write performance and needs restoration. On the other hand, SRAM consumes significantly more leakage power and size in contrast to MTJ-based technologies. We use Energy Area Latency (EAT) product as metrics to not only find the energy efficiency ARS can help to improve, but also to evaluate the preferred L2 candidates. In the following evaluations, the baseline is the MLC with conventional IRS by default. No SBL restoration can be deducted in the baseline. Similar to the soft bit in MLC, SLC will have RD issue after scale down to 32 nm and beyond. We consider four L2 candidates, e.g., SLC with ARS-RD, MLC with IRS, MLC with ARS, SRAM in our comparison. 21%  33%  22%  24%  22%  33%  21%  swaptions  24%  22%  28%  26%  24%  22%  28%  26%  facesim  22%  20%  38%  20%  22%  19%  38%  21%  dedup  23%  19%  37%  21%  21%  18%  39%  22%  vips  20%  17%  35%  28%  21%  19%  38%  22% 
Energy Comparison
Fig . 18 compares the dynamic energy consumption of four L2 candidates. SRAM consumes the least dynamic power, as the switching process of CMOS transistors in SRAM is symmetrical and much easier than changing the resistance states of MTJ in STT-RAM. Thus, writes in SRAM are not as power hungry as they are in STT-RAM. With ARS, 17.3 percent of dynamic energy in MLC is reduced on average. To calculated the static energy of four candidates, we used the execution time of each benchmark and the unit leakage power. Considering SRAM, energy consumed due to its large cell structure is exacerbated by subthreshold and gate leakage. Similarly, to implement the cache with same capacity, SLC doubles the number of transistors and thus has higher leakage power than MLC. Fig. 20 compares the average dynamic and static energy consumption of four L2 candidates over PAR-SEC benchmarks. The leakage energy of SRAM exceeds its dynamic energy consumption by more than an order of magnitude, while MLC and SLC dissipate significantly more dynamic energy than their static component.
The overall energy consumption equals the sum of dynamic and leakage energy as shown in Fig. 19 . SRAM consumes considerably more leakage energy than other candidates, making the overall energy consumption enormous.
ARS can mitigate 17.9 percent of the total energy for MLC on average. In write intensive benchmarks like vips, ARS can save up to 21.4 percent energy. Furthermore, MLC with ARS consumes less power than SLC with ARS-RD in read intensive benchmarks, e.g., blackscholes, streamcluster.
Latency Comparison
We used the read and write latency and number to calculate the overall cache access latency. ARS reduces unnecessary restorations and decreases MLC access latency by up to 23.8 percent and 18.8 percent on average as shown in Fig. 21 . SRAM is the fastest among four candidates due to its symmetrical rapid access speed. Note here the overhead incurred by the peripheral circuits are also included. The latency gap between SLC and IRS-based MLC expands in the write-intensive workloads due to the high write latency associated with HBLs. For instance, the access delay of SLC is 43.4 percent less than that of IRS-based MLC in vips. However, the gap diminishes quickly in read-intensive workloads, i.e., 25.4 percent in streamcluster.
EAT Product
EAT product is defined as the numerical product of Energy, Area, and Latency. Fig. 22 shows that MLC with ARS has the preferable EAT in most cases. Compared to SLC, which tends to be considered as the preferred L2 candidate by intuition, MLC with ARS decreases EAT by 2 percent on average. In the read intensive workload like streamcluster, EAT is reduced by 35.5 percent. SLC doubles the area compared to MLC regardless of workloads due to the storage of two bits within two MTJs of a memory cell. Meanwhile, for read intensive workloads, the read latency of SLC is very close to that of the hard bit in MLC. Thus, EAT is dominated by area in this case which results in significant reduction in EAT product for MLC. ARS also substantially reduces the baseline MLC EAT by 32.9 percent.
IPC Comparison
In contrast to the access latency, IPC indicates the overall system performance. Although ARS-RD discards a large amount of unnecessary RD restorations, dynamically computing RRD and comparing it take extra clock cycles. However, MLC with ARS substantially improves the IPC of SLC(IRS) by 17.6 percent. This is not only due to the larger cache capacity, but also because of the reduction of WD and RD restoration. Compared to the baseline, ARS also improves the IPC by 9.4 percent.
RELATED WORK

Exploration of MLC STT-RAM
After MLC STT-RAM was proposed at circuit level [6] , [34] , it has gained tremendous interests as a high-density, low-power and non-volatile memory. Previous research on MLC STT-RAM on-chip cache mainly focused on leveraging the performance disparity of two MTJs. For instance, Jiang et al. [35] proposed to promote frequently written data into write-fastread-slow soft bit lines, and frequently read data into readfast-write-slow hard bit lines within parallel MLC structured STT-RAM. Luo et al. [21] proposed to minimize the two-step transition by comparing with the original code and applying the optimal encoding for the data to be written. Whereas the foregoing work utilize the parallel MLC model, there are several research work adopting the serial MLC. Bi et al. [5] advocated a bit mapping strategy constructing general fast-and slow-lines which is used in this paper. Wang et al. [14] proposed to dynamically disable the hard bits in cache line while keep the number of associativity. Chi et al. [36] presented an efficient local checkpointing method by storing working data in soft bits and checkpoint data in hard bits.
Reliability Concern of STT-RAM
Meanwhile, the reliability issue of STT-RAM is gaining significant research interest. Recently, Wang et al. [24] proposed a selective restore method to mitigate the overhead brought by read disturbance correction in SLC STT-RAM. They also added three more flag bits in each cache line to assist their scheme. Our work shares some similarities with this one, but our research also address the write disturbance issue in MLC. Jiang et al. [25] also approached the read disturbance issue by adaptively selecting between High Current Restore Required (HCRR) reads and Long Current Long Latency (LCLL) reads. Our previous work [37] presented ARS-WD to correct write disturbance in MLC with low energy consumption overhead, however, the read disturbance issue of soft bit was not discussed. In [7] , [38] , ternary coding technique is proposed to remove the most error-prone state and trade MLC capacity for reliability. Wen et al. [20] considered the nonuniform ECC design requirement for MLC and advocated a tri-region mapping strategy to reduce the write pressure on hard bit lines. Li et al. [39] proposed a compilerassisted solution to relax the non-volatility of STT-RAM while minimize the refresh operations.
Prediction of Cache Block Behavior
On the other hand, there has been considerable research on cache block access behavior and its prediction. For example, Ahn et al. [40] proposed a write intensity predictor which identifies write intensive blocks and places them in SRAM portion of in a hybrid cache architecture. Khan et al. [27] presented a Read Reference Predictor (RRP) distinguishing between cache blocks that will be read and those that will not. RRP protects the performance-critical read reused cache block over the write-only ones. Some cache management policies also take into consideration of the reuse distance of blocks to enhance cache performance [9] , [10] , [28] . However, none of these previous work predict the read reuse characteristics of cache blocks based on PC, and thus not completely address address the unique challenges brought by emerging on-chip cache technologies, such as MLC STT-RAM.
Cache Inclusion Property Trade-Offs
Trade-offs have been made between the inclusion property and various cache design constraints, such as capacity, coherence protocol complexity and write traffic density. Jaleel et al. [18] proposed Temporal Locality Aware (TLA) policies to achieve non-inclusive cache performance by mitigating the appearance of harmful inclusion victims. Various dead block bypass and placement policies have been presented to enhance the performance of non-inclusive cache [16] , [17] . Sim et al. [41] presented a design adaptively selecting between exclusive and non-inclusive configuration based on the workload characteristics. Recently, Cheng et al. [42] proposed a selective inclusion policy to reduce the write traffic, which considers asymmetric read/write properties of STT-RAM based LLC.
CONCLUSION AND FUTURE WORK
In this paper, we proposed ARS to mitigate the energy overhead caused by WD and RD restores in MLC STT-RAM. We first introduce the concept of RRD, which exploits the number of memory access, to quantify the timing distance between two successive reads to the same cache block. We also developed an RRD predictor consisting of a read sampler and RRD prediction table to provide trustworthy RRD for each cache block. Upon a write request to an HBL, ARS-WD compares the EDR of corresponding SBL with a threshold value to determine if restores can be postponed or not. We conducted memory access analysis to adopt a threshold value with high confidence. Furthermore, to correct the soft bit read disturbance issue, ARS-RD delays the restoration till the eviction from higher level cache, so that a significant amount of restore operation can be merged with dirty block write-back. Our experimental results show that ARS effectively reduces both dynamic and overall system energy dissipation, significantly decreases EAT product and improves the IPC with negligible storage overhead.
We leveraged the non-inclusion property to enable ARS in this work. However, when an inclusive cache hierarchy is enforced, simply applying ARS violates the nature of inclusion. One way to address this issue is to introduce an additional 'void' state [16] . In particular, ARS can be applied to inclusive cache by keeping the disturbed block as 'void' in the LRU chain. Since accesses to void blocks are handled as cache misses, it is possible to develop a replacement policy which considers the void state and minimizes the total miss penalty based on RRD. On the other hand, ARS can be readily adopted to the exclusive cache hierarchy. Note here in ARS-WD, the step of checking the duplicated block in higher level cache can be skipped because of the exclusion property. In ARS-RD, it is unnecessary to restore the disturbed block after read operations as a block can only exist at one level.
Xunchao Chen received the MS degree in electrical engineering from the University of Texas at Dallas. Currently he is working towards the PhD degree in computer engineering at the University of Central Florida. His research interests include data management in NVM based memory and storage system, big data infrastructure, and distributed system. Navid Khoshavi received the MS degree in computer engineering from Amirkabir University of Technology, in 2012. He is currently working toward the PhD degree in computer engineering at University of Central Florida. His research interests include minimizing the energy of data movement in emerging non-volatile memory hierarchy, fault-tolerant computer systems, low power design, energy-efficient and high performance architecture design.
Ronald F. DeMara received the PhD degree in computer engineering from the University of Southern California, in 1992. Since 1993, he has been a full-time faculty member with the University of Central Florida where he is a professor of electrical and computer engineering and joint faculty of computer science. His research interests include computer architecture with emphasis on resilience and energy-aware applications of emerging devices, and he has published approximately 200 articles on topics related to the field. He is currently an associate editor of the IEEE Transactions on Computers. He received the Joseph M. Bidenbach Outstanding Engineering Educator Award from IEEE in 2008. He is a senior member of the IEEE. Wujie Wen received the BS and MS degrees in electronic engineering from the Beijing Jiaotong University, Beijing, China, and Tsinghua University, Beijing, in 2006 and 2010, respectively, and the PhD in electrical and computer engineering from the University of Pittsburgh, in 2015. Before he joined ECE Department, Florida International University, he also worked with Advanced Micro Devices (AMD), Inc. and Broadcom, Inc. for various engineer and intern positions. His current research interests include span emerging memory, VLSI circuit/chip design and computer architecture, hardware acceleration (Neuromorphic computing) and hardware security.
Jun Wang
Yiran Chen received the BS (Hons.) and MS (Hons.) degrees in electrical engineering from Tsinghua University, China, in 1998 and 2001, respectively and the PhD degree in ECE from Purdue University, W. Lafayette, Indiana. He moved to the University of Pittsburgh as an assistant professor, Electrical and Computer Engineering Department, in September 2010 and then promoted as an associate professor in September 2014. His research interests include VLSI design/CAD for nano-scale silicon and non-silicon technologies, low-power circuit design and computer architecture, emerging memory technologies, and nano-scale reconfigurable computing system and sensor system. He is a member of the IEEE.
" For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib.
