This paper proposes a novel adaptable and reliable L1 data cache design (Adapcache) with the unique capability of automatically adapting itself for different supply voltage levels and providing the highest capacity. Depending on the supply voltage level, Adapcache defines three operating modes: In high supply voltages, Adapcache provides reliability through single-bit parity. In middle range of supply voltages, Adapcache writes data to two separate cache-lines simultaneously in order to use one line for error recovery when the other line is faulty. In near threshold supply voltages, Adapcache writes data to three separate cachelines simultaneously in order to provide the correct data based on bitwise majority voter.
INTRODUCTION
A very effective approach in reducing the energy consumption is to reduce the supply voltage close to the transistor's threshold. However, the energy reduction in the low-power mode comes with a drastic increase in the number of memory cell failures especially in large memory structures such as on-chip SRAM memories. This motivates us to design a cache which is resilient to large number of cell failures and operates at lower supply voltages. In this paper, we propose a novel highly reliable L1 data cache for three distinct supply voltage ranges. Our proposed cache, Adapcache, automatically adapts itself for different supply voltages in order to provide the most possible capacity with minimum energy consumption and access time. Adapcache operates in three states: (1) Single Copy State (SCS), (2) Double Copy State (DCS) and (3) Triple Copy State (TCS). When the supply voltage is relatively high, Adapcache operates in SCS to satisfy high performance execution. In this state, Adapcache provides reliability only based on single-bit interleaved parity. DCS is utilized in medium low supply voltage ranges in which error rate starts to increase. Adapcache writes data to two selected cache-lines simultaneously. At every read instance, Adapcache compares both those values through XOR circuits to check their equality. This approach is similar to Dual Modular Redundancy technique (DMR) but unlike DMR, the utilization of parity bits enables Adapcache to decide the correct value upon an error. When the supply voltage is at near-threshold and the error rate increases drastically, Adapcache operates in TCS. Adapcache writes data to three selected cache-lines simultaneously to provide very high reliability. In read, the correct value is decided through bit-wise majority voters similar to Triple Modular Redundancy technique (TMR) but the error correction capability is improved by parity bits participation in majority voting.
ADAPCACHE CICUIT DESIGN
Our sub-bank design is based on the structure that authors proposed in [1] . Figure 1 shows the block diagram of each subarray structure. Whenever the supply voltage is high, Adapcache operates in SCS and only one cache-line is activated at each access time. For writing the selected cache-line, signal IEU1 is high and activates input buffers and data can transfer to the selected cache-line via Bus4 and Bus1; and similarly for reading the selected line, signal OEU1 is high and output buffers are active and data is transferred from Bus1 to Bus3. The error protection is based on bit-parity calculation in order to achieve higher speed but less accuracy. In medium supply voltage range Adapcache operates in DCS where at each cache-access two cache-lines are activated simultaneously. At writing time data is written to two selected lines at the same time. Parity calculator circuits generate parity bits and write them in parity bit cells as well. For reading the two selected lines, Bus1 is divided into two parts and output buffers transfer two selected data each from separate sub-array slice groups (sub-array slices 0 to 3 and sub-array slices 4 to 7) to the XOR circuit to check their identically. If there is at least one bit flip, signal EN10 activates two parity calculator circuits to calculate parity bits of selected lines and compare with the original parity bits of each selected-lines. Whenever one of comparator shows equality, the related output buffer transfers its data to Bus3. In TCS, for writing the selected cache-lines signal IEU1 is high and data is transferred and written in three selected lines simultaneously. At the reading time, Bus1 is divided into three parts and three selected lines each from separate sub-array slice group (sub-array slices 0, 1, 2, and sub-array slices 3, 4, 5, and sub-array slices 6, 7 and extra slice) are transferred to a majority voter. The majority voter output for cache-lines is DataM and for their parity-bits is ParityM. The parity bits of DataM are calculated and compared with ParityM. If there are any differences, the parity bits of selected lines should be calculated and compared with their original parity bits. Whenever one of comparator shows equality, the related output buffer transfers its data to the output. We designed a new decoder circuit to activate one, two or three word-line addresses at each access time and described its detail in [2] .
Copyright is held by the author/owner(s). GLSVLSI'13, May 2-3, 2013, Paris, France. ACM 978-978-1-4503-1902-7/13/05.
EVALUATION
We evaluate useful cache capacity, access time and energy consumption of Adapcache and compare to DMR and TMR caches and also to Parichute [3] . We inject random persistent faults according to bit failure rate [0%-12%] by repeating each experiment 100 times. We calculate the useful cache capacity as the undisabled portion of the cache as can be seen in Figure 2 . Bit failure rate for persistent failures in the given Vcc is examined by Miller et al [3] which we reference in our work. Figure 2 presents the Vcc intervals that SCS, DCS and TCS can be used efficiently according to persistent bit failure rate. Figure 3 depicts the normalize energy consumption of each cache structure (We normalized based on a single parity protected cache in Vcc=1V). We calculate the latency of all structures for a single cache-line. In Adapcache, if a line is faulty, it takes 3 clock cycles to read and correct it. For DMR case it takes 5 clock cycles to read selected lines and detect the errors. Similarly, it may take 7 clock cycles for TMR structure to read selected cache-lines and check and correct the errors. Finally for Parichute, it may take up to 20 cycles if the bit failure rate is very high. This latency is almost two times the L2 cache latency which may not be acceptable. 
