ABSTRACT The probability of timing failure in SRAM accessing becomes unacceptably high at low voltages, which makes the SRAM become the bottleneck of the system performance. Recently proposed timing speculation SRAM (SSRAM) can access bit cells much earlier than the conservative time margin and detect the potential timing failures. Unfortunately, the performance benefit brought by this speculative scheme might be nullified by the extra cycles used to correct the errors. Therefore, in a cache architecture, cachelines containing bit cells with timing failure bring significant performance degradation in a low-voltage scenario. In this paper, we propose the set-associative L1 and L2 Remapping and Reuse-aware timing Speculation caches (RRS caches) based on the cross-sensing SSRAM to reduce the weak cacheline access penalties under low supply voltages. RRS caches improve the proportion of the error-free cachelines (or strong cachelines) by clustering as more as possible bit cells with timing failures into weak cachelines. In addition, RRS caches swap the frequently reused data to the error-free cachelines to further reduce the average access latency and the energy consumption by an optimized filling/replacement policy. We compared the performance, energy consumption and area overhead of RRS caches with those of state-of-the-art approaches. According to our simulation results, RRS caches improve the system performance close to that of the Perfect caches in a four-core processor with only 0.12% extra energy consumption, 4.12% extra area overhead in the L1 cache and 2.69% extra energy consumption, 5.66% extra area overhead in the L2 cache compared to the cache designs with the raw SSRAM.
I. INTRODUCTION
With the development of Internet of Things and other mobile devices, power consumption has become the main factor that limits performance improvement. For years, the industry has relied on scaling down the supply voltages as low as to the near-threshold region to further reduce the power consumption. Unfortunately, global process variations could weaken both P and N devices by increasing the threshold voltage, which results in a potential timing failure during both SRAM reads and writes at low supply voltages [1] . In an SRAM reading, discharging the bitlines (BL/BLB) with large
The associate editor coordinating the review of this article and approving it for publication was Chao Wang. capacitances through those weakened memory cells (or weak bit cells) becomes slower, making the small voltage difference between BL and BLB difficult to be sensed by a sense amplifier. To guarantee the target yield, more timing margin must be applied such that the weak bits can be correctly read out. As a result, conventional SRAM-based caches inevitably become the bottleneck of the system performance under low voltage scenarios.
To alleviate this performance degradation, researchers normally address this problem from two aspects. At the circuit level, the usage of larger transistors or more transistors (e.g., assistant circuitry) improves SRAM cell resilience [2] - [4] . However, the area overhead and static power would increase at the same time. At the architecture level, fault-tolerant VOLUME 7, 2019 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see http://creativecommons.org/licenses/by/4.0/ cache designs mainly rely on correcting defective bits through error correction codes (ECCs) [5] , disabling faulty resources at different granularities [6] or adjusting the cache filling/ replacement policy with or without circuit level improvements [7] , [8] . However, for all the architectural approaches, there is always a significant fraction of the cache capacity to be disabled or sacrificed for storing the additional information. In recent years, SSRAM (Timing Speculation SRAM) has been proposed, which can tolerant the timing failures when operating in the low supply voltages [9] - [11] with limited area cost. However, when SSRAM detects that a weak bit cell is accessed, it extends more cycles to further discharge the bitline to obtain the sufficient voltage swing. Such error correction mechanism, which increases the average read latency, dilutes the performance profit brought by the timing speculation. Furthermore, the performance of SSRAM goes worse when used in a cache architecture. For example, when the access granularities of L1 and L2 caches are set to 128 bits and 512 bits, 1 for the bit error rate (BER) being 10 −3 , the probability of occurring a weak cacheline (in which there is at least one weak bit) are as high as 1 − (1 − BER) 128 = 12.02% in the L1 cache and 1−(1−BER) 512 = 40.09% in the L2 cache. Obviously, more weak cachelines account for more cycles spent on the error corrections. As shown in Fig. 1 , four benchmarks from SPEC2006 [12] are simulated by gem5 [13] for collecting their normalized CPIs (Cycle Per Instruction). The detailed introduction of the SSRAM used in this paper will be presented in Section III.A. The detailed hardware configurations of the SSRAM-based caches can be found in Section III.B. Because of the extra cycles needed for correctly reading the weak cachelines, SSRAM-based L1 and L2 caches cause an average 23.3% performance degradation compared to the simulations without the weak cacheline reading penalties. 1 In this paper, we assume the cacheline size of both L1/L2 cache is 64 Bytes. However, the minimum reading granularity of the L1 cache is 16 Bytes (128 bits) considering the bus width between the CPU core and the L1 cache, while that of the L2 cache is 64 Bytes (512 bits), or the size of the cacheline.
To reduce the error correction penalties of timing speculation technique, this paper presents RRS caches, which are low-voltage L1 and L2 caches hierarchy based on the cross-sensing SSRAM arrays proposed in [11] . Similar to the conventional cache, cachelines of RRS caches are partitioned into several sub-blocks residing in several SSRAM arrays. The sub-block that contains any weak bit cells is called a weak sub-block, while an error free sub-block is called a strong sub-block. By the proposed remapping scheme, RRS caches group available strong sub-blocks to construct more strong cachelines. Furthermore, RRS caches swap the frequentlyreused data from the weak cachelines to the strong cachelines by using a reuse-aware filling/replacement policy to reduce the access frequency to the weak cachelines.
The contributions of this paper are as follows: (1) the architectures of L1 cache and L2 cache based on the faulttolerant cross-sensing SSRAM are proposed; (2) to reduce the extra read penalty in the cross-sensing SSRAM, a remapping scheme, and a reuse-aware filling/replacement policy are proposed; (3) the system performance, area overhead, and energy consumption analyses of RRS caches are compared comprehensively with other state-of-the-art approaches.
The rest of the paper is organized as follows: Section II presents related works, Section III introduces the mechanism of the cross-sensing SSRAM and the RRS cache architectures, Section IV details the experimental setup, Section V compares performance results and Section VI outlines our conclusion.
II. RELATED WORKS
Generally, the studies targeting on improving the cache performance under low supply voltages are divided into two categories: the SRAM circuits and the micro-architecture of caches.
A. THE SRAM CIRCUITS
A cache is comprised of a group of SRAM arrays to efficiently arrange the die floor plan and to limit the lengths of the bit lines under the timing requirements. As for the SRAM, many solutions enable robust operations at low voltages by increasing the cell size or adding various read-/write-assist transistors, like 8T [2] , 10T [3] , and Schmitt trigger-based SRAM [4] . However, aforementioned solutions inevitably bring more static power consumptions and area overhead.
Differently, Timing Speculation techniques correct the accessing timing faults by increasing the bitline discharging time. Khayatzadeh et al. [9] proposed Razor SSRAM that distinguishes the correction of the data by comparing two successive reading samples. However, the long-time duration between the speculative and the confirm readings of Razor SSRAM limits its applications. For example, a complex roll-back mechanism must be implemented in the processor pipeline to correct the error data read from the Razor SSRAM, which can be extravagant in a low-power processor. Yang et al. [10] presented a double sensing mechanism with selective bitline voltage regulation (DS-SBVR) to improve the throughput of SSRAM, which can detect the timing error much earlier than Razor SSRAM. Shen et al. [11] presented a TS cache based on the cross-sensing SSRAM, which reduces the area and the energy overhead by removing the capacitors required by DB-SBVR. However, [11] does not perform detailed performance analysis and corresponding optimizations from the micro-architectural perspective. In addition, [11] only focuses on the L1 cache.
B. THE CACHE ARCHITECTURE
Using ECC is one of the methods that solve the problem of cache stability. However, when a cache is operating with a high bit error rate, such as at a low voltage scenario, a stronger ECC with larger latency, area, and energy overhead has to be applied. For example, the Orthogonal Latin Square Code (OLSC) proposed in [5] sacrificed half cache capacity to store the ECC bits, which influences the system performance not only by its time-consuming coding and decoding operations but also by the larger cache miss rate caused by the relatively smaller cache size. Yan C et al. [14] presented a novel technique that uses cache compression and ECC to harvest additional capacity to store ECC check bits. However, both decompressing and decoding need to be added to the critical path of reading the data from the cache, which degrades the performance.
Disabling the error resource of the cache is another microarchitecture level solution. For example, the block (cacheline) disabling scheme disables the whole cacheline with any faulty bit cells [6] . Not surprisingly, excessive cache capacity can be wasted in these schemes. Siddique and Badawy [15] presented a novel cache architecture that uses spare subblocks as back up sub-blocks in a set associative cache. However, the number of redundancy sub-blocks in a cache set is fixed, which is not flexible enough for different scenarios. Chiu et al. [16] presented a scheme called line recycling that using three disabled weak cachelines to recycle one useable cacheline. Although one third disabled weak cachelines can be recycled in their approach, over half of the disabled cachelines are wasted, not mention the energy consumption of the three cachelines reading for one recycled cacheline and extra hardware overhead. Wang et al. [7] presented a ZeroCounting and Adaptive-Latency (ZCAL) cache based on the characteristic of 8T cells that the read failure can only happen when reading ''0'' at low supply voltage. Khan et al. [8] presented a Mixed-Cell architecture that improves performance by using both robust (larger size) and non-robust (normal size) cells, and modifies the cache filling/replacement policy to ensure the dirty data is stored in the robust cache way. However, both the 8T cells and the robust cells require larger area overheads and more static energy consumed by the robust cells. Ferreron et al. [17] presented an approach named Concertina, which divides both cacheline and data into several parts and stores the Null sub-data (the data segment that contains all '0') in the sub-block with weak bit cells. However, obviously, only the block with a smaller number of weak sub-blocks than the number of Null sub-data can be used to save data. Consequently, the greater number of Null sub-data in data, the less candidates can be chosen when filling data into the cache. Hence, the miss rate of Concertina is high, which degrades the performance of the system as well.
This paper argues that performance, area, and energy overheads should be considered comprehensively in a low voltage cache. We propose a cache architecture that combines the circuit-level optimization of cross-sensing SSRAM with architecture techniques to improve the low-voltage performance with little overhead.
III. RRS CACHE ARCHITECTURE A. CROSS-SENSING TIMING SPECULATION SRAM
Due to the large area and energy overhead of the shared capacitors in DS-SBVR SRAM [10] , RRS caches proposed in this paper are based on the cross-sensing SSRAM (CS-SSRAM) [11] , whose structure is shown in Fig. 2 . Similar to the DS-SBVR SSRAM, each bit column in CS-SSRAM includes a latch style sense amplifier, an XOR logic and a dynamic latch. All columns in the array share a dynamic OR logic. Differently, the CS-SSRAM does not need the area hungry sharing capacitors. Instead, a switch is implemented between the bitlines (BL/BLB) and the sense amplifier in each column. The switch comprised of 4 PMOS is controlled by the switch (SWT) signal. When performing a normal bitline sensing, the bitlines are connected to the inputs, IN, and INB, of the sense amplifier (SA). As the SWT signal sends a pulse to the switch, it swaps the connections between the bitlines and the SA. The error signal in all columns (Error[i]) are merged by the dynamic OR logic, which generates the Error Flag if any timing failure is detected in the accessed row. By using the switches in CS-SSRAM, two opposite samples can be sensed by SAs. The correctness of the data can be judged by comparing the two samples before and after the reversed connection. Fig. 3 demonstrates the mechanism of the cross-sensing scheme. Assume the offset voltage of On the other hand, samples in Group B (such as P2) are wrongly read as '0's for the small BL/BLB swings (0 < V SWING1 < V OFFSET ). After the result of the first sensing being latched, the SA inputs are switched by the SWT signal and the SA is triggered again. Thus, the second input voltage becomes negative (V SWING2 = −V SWING1 ) and is sensed again, which makes the samples of Group A' and Group B' (i.e., P1' and P2' in Fig. 3 ) that are symmetric to those of Group A and Group B. 2 Since V SWING2 < 0 < V OFFSET , the second sensing outcomes are all '0's. The timing error can be identified if two sensing results are identical (in this case, samples in the group B and Group B' are wrongly read, e.g., P2 and P2'), which means CS-SSRAM has to extend another cycle or more cycles such that the voltage swing of BLB can be enlarged by the continuous discharging, to obtain the correct result. As for the samples in Group A and Group A' that have opposite sensing results, a reliable read can be confirmed and the requested data can be sent out earlier than the conventional approach.
The timing diagram of the cross-sensing scheme is shown in Fig. 4 . The first pulse of SAE activates the SA to sample the V SWING1 . In the example, the voltage swing on BLB is not sufficient to be detected by the sense amplifier (V SWING1 < V OFFSET ), hence the bit cell is wrongly read (sample1) as '0'. Then the switch (SWT) signal is activated for the swap of the connections between the bitlines and the SA, and the second pulse of SAE activates the SA again. As we discussed above, since this sample belongs to Group B, the V SWING2 (sample2, which is called a confirm sensing) should be read as 0 (V SWING2 < V OFFSET ) as well. Thus, a timing failure is detected and the error flag will be triggered. In case of an error reading, extra cycles are needed, in which the WL (wordline) is activated again by the third SAE pulse to guarantee the swing of BL and BLB (V EXTEND ) can be enlarged enough such that the data can be correctly read (sample3). Due to the limited space, more details about CS-SSRAM can be found in [11] .
For quantifying the delay of CS-SSRAM, we conduct Monte Carlo HSPICE simulations to evaluate the BER of a 256 rows * 128 columns SRAM array with different bitline discharging time operating at 28nm 0.5V 25 • C TT process corner, shown in Fig. 5 . Note that the threshold voltage of the simulated cells is 0.25V, and the nominal voltage is 0.9V. For the discharging time of 3.2ns, the corresponding BER is 10 −3 (3σ region). To achieve the best frequency boosting, we set the bitline discharging time to let the BER in the 3σ region in this paper. When the error signal is triggered, the bitline will continue to discharge until 7.4ns, at which point the BER is 10 −8 , which is enough for the 6σ region. In order to make the core uses the L1 cache more efficiently, we set the core frequency to 400MHz, which means the access to the strong cacheline can be completed in 2 CPU cycles (5ns). When the request accesses a weak cacheline, 2 more CPU cycles are needed for extending the discharge.
B. SSRAM-BASED L1/L2 CACHE ARCHITECTURE
To explain the organization and the architecture of the SSRAM-based cache (SSRAM cache for short), we use a commonly seen L1 cache configuration as the example, in which the cache is with 64KB capacity, 4-way associativity and 64 Bytes per cacheline. The SSRAM cache in this example is built by 16 SSRAM arrays with the size of 256 rows * 128 columns, each of which is marked as (x, y) in Fig. 6 (x refers to its row index, y refers to its column index). Note that the arrangement of SSRAM arrays in Fig. 6 is for a more concise description, which is different from the layout of the actual cache. Hence, we can partition a cacheline into 4 sub-blocks (4 * 128 bits) that are stored in 4 individual SSRAM arrays. We define the SSRAM Column (Column for short) as a Column of the SSRAM arrays. Due to the bus width between the CPU core and the L1 cache, most L1 accesses actually are read out one sub-block after another in a pipelined manner. Therefore, we assume that only the requested sub-block will be read out in an access in the L1 cache. When accessing the L1 SSRAM cache, the index and the tag of the target address will be sent to the tag arrays 3 for the tag comparing, which eventually chooses one hit cacheline (if this access is a hit) from the indexed cache set. The sub-block ID in the address indicates the requested sub-block in a cacheline. For example, if the requested data is located in the 3rd sub-block of the cacheline, then the sub-block ID must be 'b'10'. Hence Column 2 is selected as the chosen Column that marked with the blue frame in Fig. 6 . Since the L1 cache is latency-sensitive, all the sub-blocks in the chosen Column of the indexed set (in our example, four sub-blocks stored in Column 2, i.e., subblocks in the arrays of (0, 2), (1, 2), (2, 2), and (3, 2)) are speculatively read out (risk sensing and confirm sensing) in parallel with the tag comparing. However, before the data can be sent out, the error flag of the requested SSRAM array which contains the hit sub-block (i.e., array (0, 2)) is selected by the way multiplexers (Way Mux) and the Column multiplexer (Column Mux). The need-extending signal is the error flag selected by the Column Mux. In our example, the cache controller will activate the targeted SSRAM array for extending 2 more cycles to guarantee the hit sub-block can be correctly read out. The chosen SSRAM Column only sends the hit sub-block to the buffer in the cache controller by the Way Mux. Then, the requested data is chosen by the sub-block offset in address and sent to the core. Considering their non-critical characteristic, write accesses of the SSRAM cache are same as those of a conventional cache.
To explain the organization of the L2 SSRAM cache, a typical L2 cache configuration with 4MB capacity for a 4-core system, 8-way associativity, and 64 Bytes cacheline, is used as our example. The SSRAM cache in this example is built by 1024 SSRAM arrays with the size of 256 rows * 128 columns like that in [18] . Only 256 sets of total 8192 sets are shown in Fig. 7 , in which each SSRAM array is marked as (x, y) (x refers to its row id, y refers to its Column id). Similar to the L1 cache example introduced before, a cacheline is also divided into 4 sub-blocks (128 bits) that are stored in 4 individual SSRAM arrays. The differences between the L1 and L2 SSRAM caches are: (1) the minimum accessing granularity of the L2 cache is the cacheline, or 64 Bytes, which is for filling the missed cacheline in L1 cache; (2) due to more cachelines in an L2 cache set, i.e., 8 cachelines in this example, parallel accessing the data with the tag comparing is not energy efficient any more. Consequently, L2 cache normally accesses the tag array and the data array in serial. Only the hit cacheline, which is the result of the tag comparing, FIGURE 7. The SSRAM-based L2 cacheline organization. VOLUME 7, 2019 will be activated. The need-extending signal is generated by the OR gate with 4 error flags coming from the requested subblocks as the input. Each Column sends 1 sub-block by using the Way Mux.
C. CACHELINE REMAPPING SCHEME Obviously, when directly using the L2 SSRAM cache introduced above, whose access granularity is a cacheline, any timing failure sub-block in the hit cacheline can prolong the access latency due to the error correction cycles in SSRAM. As shown in Fig. 8 (a) , cachelines 0∼7 (C0∼C7) are in the same L2 cache set. The white block indicates a strong subblock, while the gray block refers to a weak sub-block, which contains at least one weak bit cell. The sub-blocks of different cachelines are marked as different colors. For the SSRAM cache shown in Fig. 8 (a) , C0∼C7 are all weak cachelines because each of them contains at least one weak sub-block, although there are only 8 weak sub-blocks out of all 32 subblocks in this cache set. To reduce the performance degradation of accessing these weak sub-blocks, we design a remapping scheme to group more weak sub-blocks into the weak cachelines such that more strong cachelines can be obtained. In the sub-block remapping, the cacheline selects 1 sub-block from every SSRAM Column to store its data. The strong sub-blocks will be selected with a higher priority, while the weak sub-blocks have a lower priority. As shown in Fig. 8 (b) , C0 (marked as red) chooses the highest priority sub-blocks from each SSRAM Column to make itself a strong cacheline. Same as C0, C1∼C5 choose 4 sub-blocks with the highest priority from the rest of sub-blocks in each SSRAM Column. Because the strong sub-blocks in all Column are all used up by the previous cachelines, the remaining C6 (purple) and C7 (pink) still stay in the weak state. By this remapping scheme, the number of weak cachelines can be reduced from 8 to 2.
We apply the remapping scheme in both L1 and L2 RRS caches. The remapping can be proceeded at the system booting (we assume that the PVT condition is stable during system operating) after the completion of Build-In Self-Test (BIST) [17] . A detailed implementation of the sub-block remapping in the L2 cache set is shown in Fig. 9 (a similar remapping scheme can be applied to the L1 cache as well): x first, we get the Fault Map of sub-blocks through BIST. The '0's and '1's in the Fault Map represent the strong and the weak sub-blocks, respectively. y the LCs (Link Codes) are generated from Fault Maps for each sub-block: the zerobit counter assigns an LC to a strong sub-block starting from 'b'000' in the Column, and the set-bit counter assigns an LC to a weak sub-block decreasing from 'b'111' in the Column. In this case, the LCs for Column 0 are '111-000-001-010-011-100-110-101'. z the RCs (Remap Codes) of a cacheline, which are stored in the corresponding tag entry, are generated by scanning the LCs in each Column, which indicate the positions of all its remapped sub-blocks in the corresponding SSRAM Column. As Fig. 9 depicts, the RCs of the cacheline C0 are '001-000-000-000', which denote that the data of C0 resides in the 2nd (b'001) sub-block of SSRAM Column 0, the 1st (b'000) sub-block of Column 1 and so on. In addition, the weak state bit of each cacheline in the tag array is generated simultaneously by checking if all the Fault Maps of its remapped sub-blocks in the cacheline are strong.
Regardless of reading, writing or filling the data in the RRS cache, only the way-selection phase of the RRS cache is adjusted to realize the remapping operation compared with the SSRAM cache we introduced in Section III.B. As we discussed above, in the L2 SSRAM cache, the hit signal from the tag array will be sent to the indexed set to select the hit sub-blocks from the same SSRAM array row. While in the L2 RRS cache, all remapped sub-blocks of the requested cacheline will be read out by the corresponding RCs. To identify the remapped sub-blocks of the cacheline, we add a 3-8 decoder to each SSRAM Column to select the requested sub-blocks by the corresponding RCs. For example, if the access hits the cacheline C0 with RCs of '001-000-000-000', the decoder in Column 0 reads the first RC of the cacheline (i.e., b'001) and only the 2nd sub-block will be activated. Similarly, other three Columns can activate the corresponding sub-blocks in the same manner. The need-extending signal of L2 RRS cache is generated by the error flags from the SSRAM arrays storing the data.
In the L1 RRS cache, only the indexed SSRAM Column will receive the corresponding part of the RCs to select the requested sub-block. For example, if the access hits the cacheline C0 with RCs of '01-00-00-00' (4-way L1 cache only needs 2 bits to record the position of one sub-block in a Column), the position of the third sub-block in this cacheline can be determined by reading the third part of the RCs, which is '00'. Consequently, the requested data is stored in the 1st SSRAM array in Column 2. Similar to SSRAM cache, the need-extending signal is generated by the error flag from the selected SSRAM array as well.
For the 8-way associative L2 RRS cache in our example, every cacheline needs 4 bits to save its Fault Map and 12 = 4 * log 2 (assoc = 8) bits to save its LCs. In our L2 cache example, there are 65536 cachelines grouped into 8192 sets, which means a SRAM (called Pre-remap SRAM) with size of 128KB (65536 * 16 bits) is used to store the remapping information. This Pre-remap SRAM is designed with a large timing margin since it is only used at the system booting. The 12-bit RCs and 1-bit weak state of each cacheline are stored in each tag entry for reading out simultaneously during the tag comparing, which costs extra 104KB (65536 * 13 bits) in the tag arrays.
Similar with the L2 cache, for each 4-way associative L1 cache in the example (in a 4-core processor, each core has its own private instruction and data caches), every cacheline needs 4 bits to save its Fault Map, and 8 = 4 * log 2 (assoc = 4) bits to save its LCs. In the L1 cache example, there are 1024 cachelines grouped into 256 sets. L1 cache needs its Pre-remap SRAM as well, which takes 1536 Bytes (1024 * 12 bits) to store the remapping information of all cachelines. The 8-bit RCs and 1-bit weak state of each cacheline are stored in the corresponding tag entry for reading out simultaneously during the tag comparing, which requires extra 1152 Bytes (1024 * 9 bits).
D. REUSE-AWARE CACHE FILLING/REPLACEMENT POLICY
According to the experimental results and the report in [7] , it is found that over 90% of the L1 cache accesses aiming at the Most Recently Used (MRU) cachelines in each cache set. In addition, we also collected the ratio of the L2 accesses with the stack distance of 1 or less (i.e., the ratio of MRU accesses) by running 10 billion instructions of several benchmarks. The experimental results also show the average MRU accesses of L2 cache is up to 53.5%. Obviously, when the MRU data is located at a weak cacheline, frequently accessing this cacheline will reduce overall processor performance due to the prolonged correct reading. This is because an unfinished load operation will block the execution of the subsequent instructions in the single-issued in-order core that commonly used in a low-power system. On the other hand, the writeback strategy in the processor removes the write operations from the critical path. Thus, we prefer to allocate the data with a higher read frequency to a strong cacheline by a reuse-aware cache filling/replacement policy: (a) in the L1 and L2 RRS caches, the filling cachelines from lower memory will be inserted to the weak cachelines with a higher priority; (b) as for the replacement, if an incoming read request hits the data stored in a weak cacheline, it will be swapped to a candidate cacheline, which is the Least Recently Used (LRU) cacheline of all strong ones in this set.
We use 1-bit weak state bit to mark the weak and strong status of the cachelines (i.e., b'1 for a weak cacheline and b'0 for a strong cacheline). When the L1 and L2 RRS caches find that a read access hits a weak cacheline, the swap will be processed after the read operation being finished. The read data will then be stored in the buffer of the cache controller as introduced in Section III.B. When proceeding the swapping, RRS caches find the candidate cacheline by using both the LRU bits and the weak state of the cachelines in the indexed cache set and exchanges the content and the tags of the two cachelines. When all the cachelines in one cache set are weak cachelines, no candidate cacheline can be found. Consequently, the swap cannot be executed. However, even the BER is as high as 0.1%, the probability of ''all the cachelines in one cache set are weak cachelines'' is merely 0.084%, which cannot significantly influence the performance.
Either in the L1 cache or the L2 cache, in order to place the MRU data in the candidate strong cacheline, the data in the candidate cacheline needs another reading operation firstly. Hence, the power consumption of one swapping operation contains another 1 reading operation and 2 writing operations. As for the extra latency, cacheline swapping needs a reading operation and stalls the subsequent operations in the cache. We use the configuration in [8] and add an extra reading latency for the candidate strong cacheline, which stalls 5 cycles for 1 swapping operation in the L1 cache, and stalls 12 cycles in the L2 cache.
IV. EXPERIMENTAL SETUP
To evaluate the performance impact of RRS caches, we use gem5 [13] , a full-system cycle-accurate simulator, to model a four-core system with a 2-level cache hierarchy. In our experiments, we have evaluated 25 benchmark sets, each of which contains 4 benchmarks selected from 27 SPEC2006 benchmarks [12] according to the work in [19] . The four programs in each benchmark set are assigned to the four individual cores. Table 1 summarizes our benchmark sets.
All configurations about the cores and the caches are shown in Table 3 . Four X86-ISA single-threaded 1-instruction/cycle in-order cores with no branch predictor operating at 400MHz and 0.5V, which matches the low power application scenarios, with two level MOESI protocol caches run 4 programs in the benchmark set concurrently. Each core runs 40 billion cycles in our simulations. The first 20 billion cycles are used for cache-warming and the performance metrics are collected from the second half. All benchmarks are compiled with X86 ISA by the gnu 4.8 compiler with O2 optimization level.
The separated instruction and data L1 caches are configured to a single bank with 64KB capacity and 4-set associativity. As shown in Table 2 , the data array of each L1 cache consists of 16 SSRAM arrays (256 rows * 128 columns), the tag array consists of 4 conventional SRAM arrays (64 rows * 120 columns) (1 tag SRAM for 1 cache way, 1 row for 4 cachelines, 18 bits for the tag address, 3 bits for the LRU and valid states, 8 bits for RCs and 1 bit for the weak state). Note that the short bitlines in tag array only need 2ns to ensure the correctness owing to their small capacitance (64-row size) according to our simulation result. As for the Pre-remap SRAM(128 rows * 96 columns), the bitline discharging time is set to be long enough to avoid timing failures. Consequently, we assume that the tag array and the Pre-remap SRAM do not have any timing failure. The 4MB 8-way set-associative L2 cache consists of 1024 SSRAM arrays (256 rows * 128 columns). The tag array of L2 cache consists of 128 SRAM arrays (64 rows * 120 columns) (16 tag SRAMs for 1 cache way, 1 row for 4 cachelines, 13 bits for the tag address, 4 bits for the LRU and valid states, 12 bits for RCs, and 1 bit for the weak state). In addition, a 128KB L2 Pre-remap SRAM (64 SRAM arrays with the size of 128 rows * 64 columns) is used for L2 cache to store the Fault Maps and the LCs. The detailed configurations of the SRAM arrays in our L1 and L2 caches are listed in Table 2 . In order to quantify the benefit brought by the remapping scheme and the reuse-aware filling/replacement policy in RRS caches, we set up three approaches with timing speculation SRAM: SSRAM (only using the raw CS-SSRAM), SSRAM+remap (SSRAM arrays with the remapping technique) and RRS (both the remapping and the reuse-aware filling/replacement policy are employed). The latencies for reading a strong cacheline in the L1 cache and L2 cache is set to 2 cycles and 6 cycles respectively. 2 more cycles will be counted for extended reading the weak cacheline (as we discussed in Section III.A).
For the L1 cache comparisons, we also simulated four other schemes: Perfect, SECDED (21, 16), ZCAL [7] , and Mixed-Cell [8] . All configurations of the L1 caches are shown in Table 3 . For the Perfect cache, it is configured as a conventional SRAM-based cache without any timing errors or any extra penalty. For the other approaches, the L1 cache access latency is set to 2 cycles with BER of 0.1% as we discussed in Section III.A. We choose SECDED (21, 16) as an ECC example that uses extra 5-bit ECC for a 16-bit data segment and can correct up to 32 erroneous bits without any extra delay, which is close to the Perfect cache. The ZCAL L1 cache uses 8-bit zero-bit error code to record the number of '0' in 128-bit data. When the zero number is mismatched, i.e., a timing error is detected, 2 more cycles are spent to correct the data. It also swaps cachelines with a 5-cycle penalty for all the accesses with error. In the Mixed-Cell approach, its L1 data cache have 2 robust cache ways that constructed with 2X-sized bit cells, while the remaining 2 ways are constructed with nonrobust standard-sized bit cells with a parity bit for each 8-bit data segment. Since the L1 instruction cache of Mixed-Cell is the conventional cache, we mainly discuss the L1 data cache in Mixed-Cell. Note that we use the ''MC_SWP'' strategy in [8] : when a write operation hits a non-robust cacheline, it will be swapped into a robust way. The swap penalty is also configured as 5 cycles.
For the L2 cache comparisons, we simulated Perfect, Mixed-Cell and two other schemes: OLSC (128, 64) [5] and Concertina [17] . Similarly, the BER of the Perfect cache is set to 0. For other approaches, the BER is set to 0.1%. Since the L2 cache accesses the tag array and the data array in serial, the latency of the Perfect L2 cache is set to 6 cycle, where only 2 cycle is used for reading the data. Note that the L2 cache of the Mixed-Cell uses SECDED (21, 16) approach for the nonrobust part. In addition, the L2 cache in Mixed-Cell is constructed by 2 robust ways and 6 non-robust ways according to [8] . Due to the large capacity of L2 cache, we choose the OLSC (128, 64) that uses extra 64-bit ECC for a 64-bit data segment as the ECC example, which can correct up to 4 errors in 64-bit data. The latency of the OLSC is set to 7 cycles, where an extra 1 cycle is used for decoding and confirm the correctness. Because Concertina records the strong/weak state for each sub-block, the granularity of Concertina is set to 4 Bytes in this paper. When storing data, Concertina marks the Null 4-Byte data segments and stores the Null segments to the weak part of the cache. Concertina uses 16 bits to store the Fault Map and another 16 bits to record the Compression Map for each cacheline. Note that we use the ''Fit LRU'' strategy in [17] , i.e., only the cacheline with a greater number of weak sub-blocks (4B) than the number of Null sub-data (4B) in the data can be used to store data. The latency of Concertina is set to 7 cycles, where the extra 1 cycle is for recovering data by using both Fault Map and Compression Map.
In summary, there are 8 design combinations used for simulating the CPIs that consist of one L1 approach and one L2 approach: Perfect, ECC, ZCAL, Mixed-Cell, Concertina, SSRAM, SSRAM+Remap and RRS. As shown in Table 4 , all caches (L1 and L2) of the Perfect, Mixed-Cell, SSRAM, SSRAM+Remap and RRS are only using their own approaches. The ECC design uses L1 SECDED approach and L2 OLSA approach. The ZCAL design uses L1 ZCAL approach and L2 OLSA approach. The Concertina design uses L1 SECDED approach and L2 Concertina approach.
V. PERFORMANCE RESULTS
The CPI is calculated as the ratio of the maximum cycles and the total number of the committed instructions on all four cores. Note that the lower the CPI, the better the performance is. Fig. 10 shows the normalized CPI (to the Perfect caches) comparisons for different cache designs under different benchmark sets, while the last group in Fig. 10 demonstrates the average performance of 8 tested cache designs. In addition, Fig. 11 shows the average access hit latency comparisons between all L1 and L2 approaches. ECC design achieves the average CPI (1.027X) because of its lowoverhead of error detections and corrections. However, even FIGURE 11. The average access hit latency comparisons (in cycles). VOLUME 7, 2019 using the ECC (21, 16) and OLSC (128, 64) for the L1 and L2 caches respectively, the limit of error correction capability still exists. As for the scenarios that the read data cannot be corrected, the corresponding access will be treated as a miss, which degrades the CPI. The normalized CPI of ZCAL is 1.074X, which benefits from the reliability of the 8T cells and the reuse-aware swaps compared to SSRAM. The 8T cell may occur a timing failure only when it storing '0', which accounts for 71.5% of all data in average. The normalized CPI of Mixed-Cell is 1.144X with the penalty coming from the accesses to the non-robust ways and corresponding data swapping, which are up to 41.3% and 78.3%, according to our simulations, for the L1 D cache and the L2 cache respectively. The normalized CPI of Concertina is 1.054X. However, due to the mechanism of ''Fit LRU'' in Concertina, there is a limitation that only when the number of weak sub-blocks in the cacheline is more than the number of Null segments in data, the cacheline can store data. Hence, the average L2 cache miss rate of Concertina is as high as 67.07% while the average L2 cache miss rates of other approaches are approximately 37%, which is much lower than that of Concertina. Consequently, the CPI of Concertina is degraded by its high L2 cache miss rate.
Not surprisingly, SSRAM cache has a bad performance (1.113X), which is caused by the extended reading of the weak cachelines in both L1 and L2 caches. As we discussed in Section III.B, by using the remapping scheme, RRS cache can group the weak sub-blocks together. For clearer description, we define the Weak Cacheline in a Set (WCS) as the number of weak cachelines in a cache set. As shown in Fig. 12 , in the example L1 RRS cache, the ratio of the cache sets with 4 WCS is decreased from 16.84% to 1.19%, the ratio of the cache sets with 3WCS is decreased from 37.34% to 13.24%. In the 8-way L2 cache we introduced in this paper, the average WCS without and with remapping scheme is 7.1 and 3.1, respectively. However, since the access granularity of the L1 cache is set to 128 bits, remapping scheme cannot guarantee the performance improvement. Furthermore, it even causes the performance degradation when an MRU cacheline consists of all weak sub-blocks. For example, as Fig. 13 showing, before the remap, there are four weak subblocks in the cache set and C3 is the MRU cacheline. We assume 90% accesses of this cache set are aiming at the FIGURE 12. The decrease of WCS by using remapping scheme. MRU cacheline and the distribution of the required data in the cacheline is uniform. Obviously, before the remapping, the weak sub-block is only accessed when the requested data is in the weak sub-block, which is 90% * 25%+10% * 25% = 25%. However, after the remapping, the ratio of accessing a weak sub-block is changed to 90% * 100% + 10% * 0 = 90%, which can degrade the performance significantly. As a result, the average L1 latency is increased to 2.26 and the average performance of SSRAM+Remap scheme is slowed to 1.129X. Although the remapping scheme alone has no performance improvement, it has a great improvement in combination with the reuse-aware cache fill/replacement policy. By the remapping scheme, over 98.81% cache sets in the L1 cache have at least 1 strong cacheline to be the candidate, hence the average normalized CPI drops to 1.014X after using both remapping &reuse-aware filling/replacement policy by allocating the strong cacheline to the MRU data.
The large performance degradations of the SSRAM cache without the remapping and the reuse-aware filling and replacement policy demonstrate that the system performance is highly sensitive to the average hit latencies of the L1 cache and the L2 cache. To further analyze how does the RRS cache lower the average hit latency, we define Read-HitsStrong Rate (RHSR) as (1), which is the proportion of the read operations that hit the strong cachelines. N all in (1) refers to the number of all read operations, and N hit−strong refers to the read operations that hit the strong cachelines. RHSR = N hit−strong N all (1) As shown in Fig. 14 , since the access granularity of the L1 cache is 128 bits, the RHSRs of the SSRAM are relatively high, which are87.9% and 88.7% for the L1 Instruction cache and the L1 Data cache, respectively. After using the remapping scheme, the RHSR of the I-cache is slightly dropped to 86.8% while that of the D-cache is raised to 88.8%. The reason behind this has been discussed above. Hence, only using the remapping scheme cannot bring a certain performance improvement. However, since the proposed reuseaware filling/replacement policy allocates the MRU data to the strong cachelines, it can be determined that the remapping will bring benefits to performance with the reuse-aware filling/replacement policy. According to Fig. 12 , the average WCS of L1 cache is decreased from 2.6 to 1.7, and from 7.1 to 3.1 in L2 cache. The more strong cachelines in a cache set, the more candidate cachelines can be selected by the reuseaware filling/replacement policy. Consequently, the RSHRs of the RRS cache are 99.8% and 98.2% for the Instruction and Data caches, respectively. Furthermore, since the access granularity of L2 cache is 512 bits, only using the remapping scheme will receive an obvious performance improvement as well. The RHSRs of three schemes for the L2 cache are 60.1%, 75.9% and 91.9%, respectively. In summary, due to the increased RHSR of the RRS caches, the CPI of this approach is the closest one to the Perfect cache.
The 256 rows * 128 columns SRAM array layouts are presented using Cadence Virtuoso suit [20] in the same 28nm process technology to demonstrate the area overhead. Fig. 15 shows the normalized area overhead (to the Perfect cache) of all simulated L1 and L2 cache approaches. As we can find in Fig. 15 (a) , The area overhead of SECDED (21, 16) is 31.2%, consumed by 5 ECC bits in each 16-bit data segment. ZCAL uses 8T bit cells to build the SRAM array and uses 8-bit zero-bit counters to save the information for each 128-bit data. Hence its total area overhead is as high as 59.25%. Similarly, the area overhead for the MixedCell is also 56.25% compared with the Perfect cache, which mainly comes from the 2X size cells in the robust ways and the parity bits in the remaining non-robust ways. According to [11] , the area overhead of the CS-SSRAM is 3.4% in the 256 rows * 128 columns SRAM array, which is much lower than 13.18% of the DS-SBVR scheme [10] . By using the remapping scheme, every L1 tag entry needs 9 more bits to save RCs and the weak state (1.76%) in the tag array. In addition, the Pre-remap SRAM has the area overhead as well (we assume the bit cells in Pre-remap SRAM are the main source of its area overhead, which is 2.36%). Note that the reuse-aware filling/replacement policy does not introduce significant area overhead, which only uses the weak state and the LRU bits in the tag to choose and swap the cachelines.
As for the L2 cache approaches, the area overhead of OLSC (128, 64) is as astonishingly high as 100%, which is mainly consumed by the 64 ECC bits in each 64-bit data segment. The area overhead of Mixed-Cell is 48.4%, which is consumed by the 2 robust ways and the ECC used for the other non-robust ways. For Concertina, its Fault Map and Compression Map each need to occupy 16 bits for every cacheline as the granularity of Concertina is set to 4 Byte in this paper. Therefore, the extra area overhead of Concertina is 6.25%. Since the L2 RRS cache is an 8-way associative cache, each L2 tag entry needs 12 bits to store RCs and 1 weak state bit. Considering the existence of Pre-remap SRAM, both the L2 area overheads of SSRAM+Remap and RRS are 9.06%. Since the energy of one reading operation is irrelative to the workloads, instead of collecting the whole energy consumption of the workloads, we only compare the energy consumed by one read operation by all approaches. The energy dissipation of one reading operation is collected from the simulations that using HSPICE at 0.5V 25 • C TT process corner with the SRAM arrays listed in Table 2 , which is mainly consumed by SRAM arrays with the charging and the discharging of bitlines, voltage sensing, and error detecting [10] . Meanwhile, the energy consumed by the interconnection and buffers of the cache is counted as well. Fig. 16 (a) shows the normalized L1 cache energy consumptions (to the Perfect cache) for each reading operation. The energy consumption of SECDED is 1.27X, which is caused by storing and reading extra ECC bits. Although ZCAL implemented with a large swapping overhead (its swapping ratio in the L1 data cache is 2.34X compared to that of RRS cache) and extra cost from zero-bit counters, the energy consumption is lower than others due to its highly energy efficient 8T cells, which is even lower than the Perfect cache with only 0.79X. Note that only the L1 data cache and the L2 cache of Mixed-Cell is used to compare as we discussed in Section IV. The main source of power consumption for Mixed-Cell cache is the larger data arrays. As for SSRAM, its normalized energy is 1.06X due to its high RHSR. When the remapping scheme is added, its RHSR is almost not changed and the energy overhead is raised to 1.12X due to the extra area overhead in the tag array and the Pre-remap SRAM. When using the reuseaware filling/replacement policy, since the RHSR is further increased, the energy overhead of RRS cache is lowered back to 1.06X.
As for the L2 cache approaches, the energy consumption of the OLSC approach is 1.82X, which is mainly consumed by reading, encoding, and decoding the ECC bits. Since the L2 cache of the Mixed-Cell using the SECDED for protecting the non-robust ways, its energy consumption is higher than L1 Mixed-Cell. Concertina only uses the Fault Map and the Compression Map for reading the cacheline. Therefore, its energy consumption is the best one. As the RHSR of the L2 SSRAM cache is low, the energy consumption of SSRAM is as high as 1.14X due to the extended discharging on bitlines. However, even the RHSRs of SSRAM+Remap and RRS is increased, their energy consumptions are still higher than the SSRAM because of the extra reading of RCs in the tag array.
The performance of the cache can be increased by decreasing the miss rate or the average latency. Meanwhile, the overhead of area and energy will be increased as the cost. For a more comprehensive comparison of all approaches we have discussed, this paper uses the normalized (to the Perfect cache) figure of merit (FOM) of latency, miss rate, area, and energy (LMAE) as Eq. (2), like the FOM of PPA in [10] . Note that the indicators in Eq. (2) are all normalized. The indicators that proportional to the performance are placed at the numerator (the reciprocal of the average latency and the reciprocal of the miss rate), and the indicators of consumptions are placed at the denominator (area and energy overheads). Consequently, the FOM of LMAE is a higher-isbetter metric. As shown in Fig. 17 , RRS caches have the best FOM of LMAE among all L1 approaches, which is 0.85X to the Perfect cache. The LMAE of the L2 RRS cache is 0.76X, which is the highest one in the rest approaches as well. 
VI. CONCLUSION
We have proposed RRS L1 and L2 caches under low supply voltages, which use the CS-SSRAM as the data array. RRS caches use 2 primary mechanisms to reduce the read latencies brought by the timing speculation SRAM. First, we propose the remapping scheme to re-locate the error fraction in caches, which decreases the ratio of weak cacheline hits with only 4.12% and 5.66% extra area in the L1 and L2 caches respectively. Then, we propose the reuse-aware filling/replacement policy to swap the MRU data into the strong cacheline, which further decreases the ratio of weak cacheline hits with no extra area overhead. By the two schemes, the RHSR (Read-Hits-Strong Rate) of RRS caches are over 95% in both L1 cache and L2 cache, only less than 5% of accesses will have timing failures. Consequently, compared to the raw SSRAM caches, the performance of the whole system is increased by 9.8% with only 0.12% and 2.69% extra energy consumptions in the L1 and L2 caches, respectively. He has coauthored over 50 academic papers and holds 40 patents. His current research interests include near-threshold circuit design and global navigation satellite system (GNSS) algorithm. VOLUME 7, 2019 
