STT-RAM is an emerging memory cell to construct on-chip memories or caches. However, in advanced process technology, it is known that STT-RAM cells are vulnerable to read disturbance. To employ STT-RAM cells in on-chip caches for better energy-and cost-efficiency, appropriate techniques to prevent or avoid read disturbance are essential. In this paper, we propose a novel architectural technique to enable an energy-and performance-efficient STT-RAM based L1 instruction caches for future process technologies. Our selective way access with a write line buffer adopts a sequential cache access between the MRU way and non-MRU way, reducing energy overhead from the data restoring after the read operation. In addition, the write line buffer hides a latency of currently pending or on-going write operations in L1 instruction caches, minimizing stalls in processor pipelines. Our proposed techniques improve performance per Watt of the STT-RAM based L1 instruction cache by 1.6X and 2.6X compared to the conventional SRAM-based cache (denoted as SRAM in this paper) and STT-RAM based cache with the naive data restoring (denoted as STTRAM_dr in this paper).
This is because of STT-RAM cells' higher energy-and area-efficiency over the conventional on-chip memory cells (e.g., SRAM). However, a major impediment to employ STT-RAM cells in on-chip caches has been their inferior write performance and energy-efficiency compared to conventional SRAM cells. Thus, the main focus of the previous studies is to mitigate an adverse impact of write operations in STT-RAM cells deployed for on-chip caches [1, 2, 3, 4, 5, 6, 7, 8] .
On the other hand, as process technology scales down (below 32 nm), a new problem of STT-RAM cells has arisen: read disturbance [9] . When reading a bit from an STT-RAM cell, the cell content is destroyed (or unintended bit-flip in STT-RAM cells). In the future process technologies, a read and write current will be close in STT-RAM cells (due to asymmetry between the scalability of read and write current), which might incur unintended write operations in the cell during the read operation. It does not affect the reliability of data read from the STT-RAM cells while only affect the stored data in the cell after the read operation. Though a temporal data corruption (e.g., due to noise or particle) may be recovered by employing parity bits or ECC, the bit-flip in the STT-RAM cell after the read operation may be hard to be recovered by parities or ECCs due to non-negligible bit-flip rates. For fully guaranteeing operation correctness, the data read from STT-RAM cells must be recovered after every read operation. Since STT-RAM cells typically have high write energy and latency, merely applying a data restoring scheme (i.e., just re-write the previously stored data after read [10] or 'restore-afterread' method introduced in [9] ) without any optimization may not be efficient for on-chip caches in terms of energy and performance. In particular, L2 caches or lastlevel caches (e.g., L3 cache) may efficiently suppress write energy and latency overhead since they are infrequently accessed. However, in the case of L1 caches, there could be a huge energy and performance overhead when employing the data restoring scheme without any further optimizations (referred to as naive data restoring in this paper).
In this paper, we propose a novel technique for energy-and performanceefficient STT-RAM based L1 instruction cache under read disturbance. We propose a selective way access technique with a small write line buffer to reduce energy and performance overhead from the data restoring. Our selective way access technique also re-writes accessed data after read operations. However, our technique tries to reduce energy overhead from the data restoring by adopting two-step cache access which prioritizes MRU way access. If the cache hit occurs in the MRU way, energy overhead from the data restoring can be significantly reduced by only restoring MRU line in a cache set. In addition, the write line buffer reduces performance overhead by hiding the latency from pending or on-going write operations to the L1 instruction cache. As far as we know, technology-scalable STT-RAM based L1 instruction cache architecture has not extensively studied yet in research communities.
The remaining sections of this paper are organized as follows. Section 2 introduces related work for STT-RAM based cache memory design. Section 3 describes our novel optimization technique for energy-and performance-efficient STT-RAM based L1 instruction caches. Section 4 provides our framework for evaluation and its results and lastly, Section 5 concludes this paper.
Thanks to superior energy-efficiency compared to SRAM, STT-RAM has been considered as promising cells for on-chip memories. Hence, a large body of work has been focused on replacing SRAM cells with STT-RAM cells in on-chip cache memories. Due to their low leakage power consumption, large scale last-level caches (L2 or L3 caches) have been an attractive candidate for STT-RAM deployment [2, 3, 5, 8] . In addition, there have been several studies that utilize STT-RAM cells for L1 cache memories [1, 4, 6, 7] . However, those studies introduced above do not consider read disturbance of future technology STT-RAM cells.
A few studies considered read disturbance in STT-RAM cells. In [10] , they proposed a simple data restoring scheme under read disturbance. However, energy and latency overhead of write operations in STT-RAM cells are significant, which would lead to high energy and performance overhead in STT-RAM based on-chip caches under read disturbance. In [11] , a dual-mode STT-RAM cell is proposed to minimize read disturbance though the dual-mode cells are only applied to L2 caches. In [12] , X-mode cache architecture was proposed to improve read performance under read disturbance though they applied X-mode cache only to L1 data cache and L2 cache. Wang et al. proposed a selective restoring scheme to alleviate read disturbance [9] in STT-RAM based L2 cache. In [13, 14] , and [15] , circuitlevel read disturbance detection and alleviation techniques are proposed.
To the best of our knowledge, there have been no work which employs STT-RAM in L1 instruction caches under read disturbance. Our proposed technique enables technology-scalable STT-RAM based L1 instruction caches with improved energy-and performance-efficiency.
3 A novel STT-RAM based L1 instruction cache architecture for future process technologies
Motivation
For STT-RAM cells in advanced process technologies (e.g., below 32 nm), it is crucial to recover STT-RAM cell data after read operation to guarantee operation correctness. The simplest solution is to temporarily store accessed data (in the L1 instruction cache) after the read operation and write those data again to the previously accessed cache line (we refer to this data restoring scheme without any optimization as naive data restoring scheme). However in the case of STT-RAM based cache memories, the naive data restoring could incur a huge energy and performance overhead due to energy and performance asymmetry between read and write operations. Particularly for L1 instruction caches, only applying the naive data restoring scheme may incur a huge energy overhead as a write operation should follow every read operation. Moreover, the L1 instruction cache is accessed nearly every clock cycle unless there is long pipeline stall such as memory (or last-level cache) access, TLB (translation lookaside buffer) miss, or branch mis-prediction. In addition, the conventional L1 instruction cache accesses all ways in parallel along with tag access as shown in Fig. 1 . It enables a fast L1 instruction cache access, which translates into better performance. However, with the naive data restoring under read disturbance, it incurs a severe energy overhead because the data read from all ways (four ways in Fig. 1 ) must be re-written to the data arrays.
In terms of performance, due to high write latency of STT-RAM cells, the naive data restoring may incur a high performance penalty when re-writing the accessed cache lines. Without appropriate latency hiding mechanism, the data restoring in a certain cache line can cause a blocking of cache accesses. In this case, performance would be severely hurt due to low instruction fetching bandwidth.
According to our evaluation (our framework is described in Section 4.1), the naive data restoring causes 3.6X more energy consumption and 62% performance loss compared to the case without the data restoring. It completely cancels out the energy benefit of STT-RAM cells over the SRAM cells (see Fig. 4 ). Considering significantly worse performance and energy-efficiency than SRAM-based cache, there is no reason to employ STT-RAM cells for L1 instruction caches with the naive data restoring under read disturbance. Thus, to employ STT-RAM cells in L1 instruction caches under read disturbance, there should be appropriate architectural supports.
Our novel techniques for technology-scalable STT-RAM based
L1 instruction cache To mitigate energy overhead of the naive data restoring scheme, we propose to use a selective way access scheme. The selective way access scheme enables better energy efficiency through restoring only one way among multiple ways (four ways in our cache configuration) in the case of MRU (most recently used) way hit. Fig. 2 shows a conceptual description of the selective way access scheme.
We design our technique based on locality principle which means MRU ways are likely to be accessed again in the near future. Our selective way access scheme refers to the MRU bits (stored in MRU bitmaps in Fig. 2 ) in cache sets. The MRU bitmap stores which way in the cache set is most recently used (i.e., accessed) cache line (way). And then, our selective way access scheme speculatively accesses only MRU ways among four ways (in Fig. 2 , the MRU way is way0 though the MRU way will be different across the cache sets). If there is a cache hit in the MRU line, the rest of the ways are not accessed and only data in the MRU line is restored. In other words, there is no need to restore data in the rest of the ways. By doing so, in the case of MRU hit, our proposed scheme significantly reduces energy consumption when accessing the L1 instruction cache under read disturbance. If a cache miss occurs in the MRU way, the remaining ways (way 1, 2, and 3 in Fig. 2 ) in the cache set are accessed. If a cache miss occurs again, then a data request is sent to the L2 cache. If there is a cache hit in the remaining ways, there is a performance penalty since our scheme performs two-step access in every L1 instruction cache access. However, according to our evaluation there is a high possibility of cache hit in the MRU way (see Fig. 6 ). Thus, an adverse impact on performance caused by the two-step access is almost negligible.
Our second optimization is to employ a small write line buffer (same as the cache line size: 64 Bytes) composed of the conventional SRAM cells. The write line buffer temporarily stores the most recently accessed cache line in the L1 instruction cache. Employing the write line buffer mitigates a latency penalty from the pending or on-going write operations. After every cache hit, the corresponding cache line (i.e., hit cache line) is stored into the write line buffer. When there is a consecutive access to the hit cache line and the data is not completely restored or written yet to the STT-RAM cells in the cache line, the corresponding data can be served from the write line buffer instead of the L1 instruction caches. In this case, the instructions can be fed into the processor pipeline without any stall. When accessing the MRU way from the L1 instruction cache (i.e., the first step access), the write line buffer is also accessed. Fig. 3 summarizes an access latency comparison between the conventional cache access and selective way access with the write line buffer. The conventional STT-RAM based instruction cache access is done by 3 clock cycles: address routing, data array access, and data output ( Fig. 3(a) ). For data caches, a cache access can perform either read or write operation because data can be modified due to store instructions over the program execution. For instruction caches, however, a cache access is only read operation (3 cycles for STT-RAM, see Section 4.1) because instructions are not modified over the program execution. The write operation (5 cycles for STT-RAM in our evaluation framework) in STT-RAM based L1 instruction caches only occurs in the case of cache block fill (i.e., after cache misses). The block fill operations are typically performed in the background after the requested cache block is already delivered to the processor pipeline. Thus, in the case of typical L1 instruction cache access, it takes 3 cycles in the case of STT-RAM based L1 instruction caches. In the case of our proposed techniques, our technique speculatively accesses MRU-way first, taking 3 cycles. If requested cache line does not reside in the MRU way, then we additionally access non-MRU ways, taking additional 3 cycles (thus, total 3 þ 3 ¼ 6 cycles). Thus, in the case of MRU hit and non-MRU hit, our STT-RAM based instruction cache takes 3 cycles and 6 cycles, respectively. In the case of cache misses, it will take additional cycles to deliver instructions to the processor pipeline because the instructions should be fetched from L2/L3 caches or main memories. For additional storage, we need 32 Bytes (2 bits for each cache set) and 64 Bytes for MRU bitmap and write line buffer, respectively. It corresponds to approximately 0.5% of the L1 instruction cache data array area. Thus, an additional area cost is almost negligible to deploy the selective way access with write line buffer. Please note that this cost analysis is carried out based on the cache configuration used for our evaluation framework (described in Section 4.1).
Evaluation

Evaluation framework
For performance simulation, we use M-SIM architectural simulator [16] . We modified the simulator to implement the pipeline structure shown in Fig. 3(b) . The core micro-architectural parameters are tuned to model ARM Cortex-A15 [17] as close as possible. The L1 instruction cache is 32 KB capacity with 64 Byte line size and 4-way set associative with LRU (least recently used) replacement policy. We run 16 selected programs from SPEC2006 benchmark suite. For accurate simulation, we fastforward 2 billion instructions and run 500 million instructions. The simulated microprocessor operates at 2 GHz.
For energy evaluation, we use STT-RAM based cache energy parameters from [12] . From cache access traces extracted from M-SIM, we calculated energy consumption for various cache configurations. Please note that our simulation framework also models the impact of refresh operations as in [12] . Energy parameters for additional logic (MRU bitmap and write line buffer) are extracted from CACTI 6.5 [18] . Table I summarizes the energy parameters used in our simulation. The energy parameters shown in 'MRU bitmap and linebuf' column are used only when the selective way access with the write line buffer is employed.
For quantitative comparison, we also evaluate the conventional L1 instruction cache with SRAM cells (denoted as SRAM) and ideal STT-RAM cells (denoted as STTRAM_nodr). We assume that the ideal STT-RAM cells do not have read disturbance and data does not need to be restored after the read operation. Thus, STTRAM_nodr is not implementable in the real-world and we show it just for comparison. In addition, we show the case where the naive data restoring scheme is employed under read disturbance (denoted as STTRAM_dr). The case of the data restoring with our selective way access and the write line buffer is denoted as 'STTRAM_selway w/ linebuf'. For SRAM configuration, the cache access takes 2 and 3 cycles for read and write operation, respectively. For STTRAM configurations, the read operation takes 3 cycles while the write operation takes 5 cycles [12] . As we already mentioned in Section 3.2, when using the STTRAM_selway w/ linebuf configuration, the cache access may take different cycles depending on whether the cache hit occurs in the MRU way or non-MRU way. Fig. 4 describes energy evaluation results across four different cases: SRAM, STTRAM_nodr, STTRAM_dr, and STTRAM_selway w/ linebuf. All of the results shown in Fig. 4 are normalized to the SRAM configuration. Without read disturbance (STTRAM_nodr), the STT-RAM based instruction cache configuration reduces L1 instruction cache energy by 54%, on average compared to the SRAM configuration. In contrast, when employing the naive data restoring in the STT-RAM based instruction cache under read disturbance (STTRAM_dr), there is a huge energy overhead consuming 67% more energy than the SRAM configuration. This is because a read operation in the case of STTRAM_dr consumes 7∼8X more energy than the case of STTRAM_nodr (see Table I ). In the case of STTRAM_dr, one read operation includes both read and write operation for data restoring.
Evaluation results
Energy
On the other hand, compared to the SRAM configuration, one can save cache energy by 35% with STTRAM_selway w/ linebuf. When comparing with the STTRAM_dr, our scheme (STTRAM_selway w/ linebuf ) reduces energy by 61%, on average. Though the STTRAM_selway w/ linebuf configuration consumes 41% more energy than the STTRAM_nodr, the STTRAM_nodr is not feasible for real implementation under read disturbance. In summary, our scheme enables an employment of STT-RAM cells in the on-chip L1 instruction caches under read disturbance with significantly higher energy-efficiency than the SRAM-based or STT-RAM based cache with the naive data restoring.
In the case of libquantum and mcf, the SRAM configuration consumes more energy than the STTRAM_dr configuration. This is because libquantum and mcf show much longer execution time than the other programs, resulting in a large portion of leakage energy. Thus, the impact of increased dynamic energy of the STTRAM_dr is hidden by huge leakage energy consumption. Fig. 5 shows performance comparison results. As in the previous subsection, all results are normalized to the SRAM configuration. Assuming there exists read disturbance in STT-RAM cells (STTRAM_dr), performance degradation is 64% compared to the SRAM configuration, on average. The main reason of the huge performance loss is pending or on-going write operations caused by the data restoring in the L1 instruction cache. However, by adopting the selective way access with the write line buffer, performance is comparable to the SRAM configuration (5% difference). Hiding the latency of pending or on-going write operations from the data restoring translates into much higher performance of STTRAM_selway w/ linebuf than that of STTRAM_dr. When comparing with the ideal case of STT-RAM (STTRAM_nodr), performance degradation in the case of STTRAM_selway w/ linebuf is only 1%. In the case of STTRAM_selway w/ linebuf, the main reasons for achieving comparable performance to the SRAM configuration are two-fold. Firstly, one can achieve high MRU-hit rate. Fig. 6 depicts MRU hit rate across our benchmark programs. On average, 99.2% of L1 instruction cache access occurs in the MRU way, while only 0.8% corresponds to the non-MRU way hit. It implies 3-cycle penalty of non-MRU way hit is almost negligible. Secondly, a write line buffer (WLB) hit rate is also significantly high in general programs. As shown in Fig. 7 , the WLB hit rate is 89.0%, on average. The main reason for this phenomenon is that instructions typically show high temporal and spatial locality during the program execution. The write line buffer helps to hide a long write latency of STT-RAM cells. Without the write line buffer, when there is a pending or on-going write operation in a certain cache block and processor pipeline needs to access that block, the processor pipeline should be stalled. On the other hand, the write line buffer temporarily stores and serves recently accessed cache block with a fixed latency (3 cycles). Thus, in the case of write line buffer hit, the cache block can be delivered to the processor pipeline without any stall even when there is on-going write operations to the cache block in background. Since every read operation entails a write operation under read disturbance, employing the write line buffer hugely contributes to better performance by hiding the long write latency of STT-RAM cells. 
Performance
Performance per Watt
To examine an energy-performance trade-off, we present performance per Watt comparison results in this subsection. As shown in Fig. 8 , the STTRAM_selway w/ linebuf shows better performance per Watt than the SRAM and STTRAM_dr configurations. Since the STTRAM_nodr is not feasible to be implemented, we omit the discussion of the STTRAM_nodr in this subsection. Compared to the SRAM configuration, the STTRAM_selway w/ linebuf exhibits 55% better performance per Watt. Compared to the STTRAM_dr, the STTRAM_selway w/ linebuf shows better performance per Watt by 2.6X. In summary, our STTRAM_selway w/ linebuf configuration shows the best energy-performance trade-off among three feasible configurations (SRAM, STTRAM_dr, and STTRAM_selway w/ linebuf ).
Conclusions
As process technology scales below 32 nm, an employment of STT-RAM cells to the on-chip cache would be challenging due to read disturbance. In this paper, we present an energy-and performance-efficient technique for STT-RAM based L1 instruction caches under read disturbance. As revealed in our evaluation results, compared to the SRAM-based configuration and STT-RAM based configuration with the naive data restoring (STTRAM_dr), our proposed technique reduces L1 instruction cache energy consumption by 35% and 61%, respectively. In terms of performance per Watt, our proposed technique leads to 1.55X and 2.58X better performance per Watt compared to the SRAM and STTRAM_dr configurations. We believe that our novel architectural support enables an efficient employment of STT-RAM cells in on-chip L1 instruction caches in future process technologies.
