Geometry scaling of semiconductor devices enables the design of ultra-low-cost (e.g., below 1 USD) batterypowered resource-constrained ubiquitous devices for environment, urban life, and body monitoring. These sensor-based devices require high performance to react in front of infrequent particular events as well as extreme energy efficiency in order to extend battery lifetime during most of the time when low performance is required. In addition, they require real-time guarantees. The most suitable technological solution for these devices consists of using hybrid processors able to operate at: (i) high voltage to provide high performance and (ii) near-/subthreshold voltage to provide ultra-low energy consumption. However, the most efficient SRAM memories for each voltage level differ and trading off different SRAM designs is mandatory. This is particularly true for cache memories, which occupy most of the processor's area.
INTRODUCTION
Increased semiconductor technology integration due to geometry scaling opens the door to new market segments. In particular, technology evolution enables adding some degree of intelligence to any control or measuring engine such as environment sensor applications to monitor wind, sea level, temperature, tsunamis, the body, urban life, etc., by means of battery-powered ultra-low-cost (e.g., below 1 USD) computing devices running critical real-time applications. Such applications can be characterized by long time periods of minimal data processing interleaved with short bursts of intense computation with larger amounts of data to be processed. For example, consider a human body vital signs monitoring system in which a computer is attached to an activity sensor. Such a system spends most of the time (e.g., 99%-99.99% of the time [Szewczyk et al. 2004; Kusumoto and Goldschlager 2008] ) processing noncritical input data until infrequent events arise. During these long periods, low-performance and processing needs are required while consuming the minimum amount of energy possible in order to extend battery lifetime. However, when infrequent events arise (e.g., 0.01%-1% of the time [Szewczyk et al. 2004] ), the system must read and process a large data input set and react quickly [Kusumoto and Goldschlager 2008] . During this short time period, inputs change noticeably and high performance must be provided. Guaranteed performance and reliability are also needed to provide strong functional and timing guarantees [Wilhelm et al. 2008] given that a failure to perform an operation correctly and within a given time may have a catastrophic consequence in this environment. Therefore, the main requirements of these new market segments are:
(1) ultra-low energy consumption in order to extend battery lifetime, (2) high performance to react in front of infrequent events, (3) simple system design for increased yield and reduced costs, and (4) strong timing and functional guarantees, needed for running critical applications on top.
Basically, processors must provide high-performance and low-power operation when computing requirements are high, as well as ultra-low energy operation when low performance is required. This can be achieved by operating at high/moderate supply voltage (Vcc) during high-performance operation and at near-/subthreshold (NST) voltage 1 during low-performance operation. Therefore, two operation modes with different needs and different optimal Vcc must be distinguished:
(1) high-performance and low-power operation mode under high or moderate voltage (HP mode for short) during relatively short periods of time to react to some infrequent particular events, and (2) ultra-low energy and reliable operation mode under NST voltage (ULE mode for short) during most of the time until infrequent events arise.
Existing solutions can be used to handle both operation modes, but they are mainly based on having multiple-Vcc domains. Multiple-Vcc domains enable hybrid voltage 1 Note that the particular NST voltage where energy is minimized can fall in the near threshold (moderate-V T H devices) or subthreshold (high-V T H devices). It highly depends on the target CMOS technology, as shown in , and the activity profile of the application. Further, other design constraints may also influence the particular NST value chosen. operation and suit those markets with moderate-cost computing devices such as smartphones, some implanted devices, and the like. However, their design, test, validation, and fabrication costs increase to unaffordable levels for the ultra-low-cost market, where chips can be priced even below 1 USD. Therefore, new, ultra-low-cost, singleVcc-domain processors [Wilkerson et al. 2008; Ghasemi et al. 2011] , which enable efficient hybrid voltage operation, must be developed.
Using a single-Vcc domain introduces some challenges in CMOS designs when the same circuits are intended to operate at drastically different Vcc levels, because, unfortunately, CMOS technology does not behave equally at different voltage levels. This fact is particularly true for SRAM memory cells because the most power-, delay-, and areaefficient SRAM cells for high and moderate voltage operation do not operate reliably at ultra-low voltage. Conversely, those SRAM cell designs suitable for ultra-low voltage operation [Zhai et al 2007; Jain and Agarwal 2006; Kulkarni et al. 2007; Calhoun and Chandrakasan 2007] are far from optimal at higher voltage levels due to substantially high power, delay, and area overheads. SRAM cache memories are particularly critical because they:
(1) occupy most of the chip area; (2) provide increased performance, which is particularly important during high voltage operation; and (3) avoid many energy-hungry off-chip memory accesses.
Existing approaches that enable operation at different Vcc levels [Roberts et al. 2007; Wilkerson et al. 2008; Abella et al. 2009; Chishti et al. 2009 ] are shown effective from an average performance perspective and provide functional guarantees, but fail to provide strong timing guarantees 2 required for worst-case execution time (WCET) estimation, as needed for critical applications in our target market [Wilhelm et al. 2008; He et al. 2011; Werner-Allen et al. 2006] . Therefore, understanding the trade-offs involved in the design of on-chip SRAM caches for hybrid high and ultra-low voltage operation in our target market from a microarchitectural perspective is of prominent importance.
In this article, we propose hybrid cache ways, that is, single-Vcc-domain, hybrid L1 cache architectures, that satisfy all stringent needs of our target market. The proposed cache architecture combines heterogeneous SRAM cell types by splitting the cache into two sections: (i) HP ways, where some cache ways are made of the SRAM cells optimized for one particular Vcc level (e.g., high or moderate Vcc) and (ii) ULE ways, where the rest of the cache ways are made of the SRAM cells optimized for another Vcc level (e.g., NST Vcc). During ULE mode, data processing is expected to be minimal and workloads are much smaller than during HP mode [Szewczyk et al. 2004; He et al. 2011; WernerAllen et al. 2006] . Workload discrepancy across HP and ULE mode justifies reducing the hardware resources at ULE mode. Since HP ways would experience many faults at NST Vcc and thus would not provide reliable operation, they are turned off at ULE mode. However, all cache ways are enabled at HP mode to fit larger workloads and provide high performance. ULE ways are reused at HP mode, in spite of their inefficiency at high Vcc, because they reduce the number of slow and energy-hungry memory accesses [Maric et al. 2012] . Strong timing guarantees are achieved at both modes due to the deterministic behavior of the proposed cache designs at any voltage considered. In particular, reliability of all cache space is high enough so that no cache space is disabled and WCET estimates can be obtained. In order to conduct the research of this article, we provide a comprehensive study varying several critical parameters 10:4 B. Maric et al. such as SRAM cell type, operation voltage, cache size, associativity, and line size. We show that our hybrid caches can efficiently and reliably operate across a wide range of voltages, consuming little energy at ULE mode as well as providing high performance with small overheads at HP mode, as required for our target market.
The rest of the article is organized as follows. Section 2 provides some background on hybrid voltage operation for on-chip cache memories. Section 3 presents hybrid cache designs. Evaluation results are presented in Section 4. Finally, Section 5 summarizes the main conclusions of this work.
BACKGROUND
In this section, we first highlight the main aspects related to hybrid high and ultra-low voltage operation. Then, we briefly describe existing low/ultra-low SRAM cells. Finally, we review some related work on low-energy techniques for caches.
Hybrid High and Ultra-Low Voltage Operation
High performance required by critical applications at HP mode is only available at saturation voltage (typically above 400-500mV for current technology nodes [Zhao and Cao 2006] ). Dynamic energy at these voltage levels is high due to the quadratic dependence of energy on voltage, and hence energy cannot be neglected even if the main constraint is performance. Thus, these systems require minimizing energy under a given performance constraint. On the other hand, when applications require modest or low performance (but still with real-time constraints), the main target is minimizing energy consumption (ULE mode). The simplest way to minimize the total energy is to scale down Vcc to the NST regime [Calhoun and Chandrakasan 2004; Hanson et al. 2006 ], but without increasing fault rates beyond affordable levels.
CMOS combinational logic shows near-linear voltage-delay scaling in the saturation regime, and exponential delay increase in the NST voltage regime. Dynamic power decreases much faster than delay increases, hence decreasing voltage is beneficial in terms of dynamic power and energy. However, leakage power decreases slowly with voltage scaling, and total leakage energy grows as delay (and hence execution time) increases. Thus, there is a "sweet" point in terms of energy, where the total amount of energy required to execute a given task is minimized. Increasing voltage from this point increases energy due to dynamic energy, whereas decreasing voltage increases energy due to leakage energy. This effect is depicted in Figure 1 (left). In general, the energy-sweetest point for CMOS technology is in the range 200-400mV, thus is in the NST regime. The particular voltage value, and so whether it falls in the subthreshold or near-threshold regime, depends on the application and its activity, which determines dynamic and leakage power, as well as the target CMOS technology. However, performance is low, as shown in Figure 1 (right), because delay grows exponentially when voltage is lowered.
In spite of the challenges related to hybrid voltage operation, both operation modes must be integrated into a single circuit due to several reasons such as reduced fabrication costs, high integration, and power efficiency. In general, using shared circuits in a single chip to perform both kinds of tasks, namely those at high voltage and those at ultra-low voltage, reduces packaging, burn-in, and test costs as well as overheads related to communication between the on-and off-chip circuitry [Wilkerson et al. 2008; Ghasemi et al. 2011] .
Low and Ultra-Low Voltage SRAM Cells
Differential 6T SRAM cells have been commonly used for high-Vcc operation. However, many alternative SRAM cells have been designed targeting different voltage and robustness scenarios [Zhai et al 2007; Jain and Agarwal 2006; Liu and Kursun 2007;  Kulkarni et al. 2007; Calhoun and Chandrakasan 2007] . Results in Gerosa et al. [2009] , Jain and Agarwal [2006] , and Kulkarni et al. [2007] show that 8T and 10T SRAM cells are superior to 6T SRAM cells at low and ultra-low voltages under various conditions such as iso-robustness, iso-area, and iso-read failure.
A fundamental problem in conventional 6T SRAM cells is read stability (or read upset) at low voltages, which cannot be achieved without significant cell upsizing [Chen et al. 2007] . A read upset occurs when a pull-down transistor in the cross-coupled inverter is unable to hold the correct value in the node. Adding two extra NMOS transistors (T3 and T4 in the 8T SRAM plot in Figure 2 ) to a 6T SRAM cell decouples read and write paths and blocks noise injection from the read path to the value holding nodes. Read stability is then achieved by sizing transistors T3 and T4 without affecting write functionality of the cell. Hence, the write stability condition is then defined only for cross-coupled inverters. Moreover, transistor T4 and pull-down transistors in the cross-coupled inverters can have smaller width because the read margin does need to be considered, thanks to the separated read port. Adding two extra NMOS transistors introduces around 30% area penalty, which makes the 8T SRAM cell less efficient than 6T ones at high voltage due to power and area overheads. Delay impact is negligible at high voltage [Jain and Agarwal 2006; Gerosa et al. 2009] .
The 10T SRAM cell design considered in this article is the Schmitt-trigger fully differential cell [Kulkarni et al. 2007] . Transistors T1-T8 create the cross-coupled Schmitt-trigger inverter. Such a cell does not introduce any architectural change compared to the 6T SRAM cell, as the read/write port is accessed through two passgate NMOS transistors in both types of cells (T1 and T2 for 6T bitcell and T9 and T10 for 10T bitcell in Figure 2) . Read stability at ultra-low voltage is improved by the positive feedback from T7/T8 that adaptively increases or decreases the switching threshold of an inverter depending on the input transient direction in order to preserve the logic state of the cell. Write stability at ultra-low voltage is enhanced by reduced pull-down transistor sizes due to stacked NMOS transistors in the pull-down path (series connected: T1 and T3, T2 and T4). Therefore, 10T cells provide improved process variation tolerance and reliable ultra-low voltage (down to 160mV [Kulkarni et al. 2007] ) operation due to their built-in feedback mechanism, thus making this cell superior to conventional 6T and 8T SRAM cells at ultra-low voltage. The area penalty compared to the 6T SRAM cell is around 45% [Kulkarni et al. 2007 ], which makes 10T SRAM cells unattractive for high voltage operation due to significant power and area overheads. Delay increase for 10T SRAM cells with respect to 6T and 8T ones at high voltage is negligible [Kulkarni et al. 2007] .
In summary, all those cells devised for low or ultra-low voltage operation are less efficient than 6T SRAM cells at HP mode in terms of power, area, and delay. Thus, whereas the same combinational logic can be employed at HP and ULE modes (despite somewhat increased error rates), efficient SRAM cell designs across the different voltage levels simultaneously do not exist. Different SRAM cell designs target particular voltage ranges, but how to combine and lay these cells out at microarchitecture level for on-chip cache memories to allow efficient operation at high and ultra-low voltage with affordable cost is an open issue.
State-of-the-Art on Low-Energy Cache Designs
Literature on low-power techniques for caches is abundant. SRAM cells discussed in Section 2.2 can be used for hybrid voltage operation, but these SRAM cells introduce significant area and power overheads (especially at HP mode) due to additional transistors, which is unaffordable in embedded cache design if used extensively.
Several proposals have exploited different trade-offs between performance and power by reconfiguring cache size and associativity [Albonesi 1999; Zhang et al. 2003 ], lowering cache Vcc [Flautner et al. 2002] (or even gating it [Powell et al. 2000] ) for some cache sections or the whole cache, splitting the cache into different modules [Kin et al. 1997; Abella and Gonzalez 2003; Fujii and Sato 2004] , etc. In general, these techniques are unsuitable for our target market since they fail to operate reliably at ULE mode. Techniques based on having multiple-Vcc domains are unaffordable for our target ultra-low-cost (e.g., below 1 USD) market [Dreslinski et al. 2008; . Zhou et al. [2010] propose downsizing 6T cells of on-chip caches combined with error correction codes. Ghasemi et al. [2011] propose mixing heterogeneous cell sizes of the same SRAM cell types. However, they target large on-chip memories (e.g., last-level caches), devised particularly for high-performance commercial systems running server applications on top. Likewise, both designs are devised for high voltage operation. If we put them in the context of hybrid high and ultra-low voltage operation, they are overdesigned to operate reliably at ULE mode.
The simplest way to achieve higher energy efficiency is decreasing the size of SRAM cells at the expense of higher failure rates; this is particularly critical at NST voltage. Faulty cache entries should be then disabled or replaced. Techniques based on replacing faulty cache entries [Wilkerson et al. 2008; Chishti et al. 2009; Ansari et al. 2009; Sasan et al. 2009 ] introduce significant overheads due to bypassing and signal rerouting. Likewise, techniques based on simply disabling faulty storage [Wilkerson et al. 2008; Abella et al. 2009; Choi et al. 2011; Khan et al. 2013 ] may provide noticeable performance variation for a given program depending on the faults' location because the distribution of faulty bits is random. Such techniques are shown effective from an average performance perspective and provide functional correctness, but fail to provide any guarantees on available cache size, number of sets, number of ways, etc., thereby making them useless in real-time scenarios where WCET estimation requires full knowledge of the hardware features below to provide strong timing guarantees [Wilhelm et al. 2008] . WCET estimation is an expensive task even when cache characteristics are known a priori, and no WCET estimation techniques exist considering caches some of whose entries may be faulty.
In our previous work [Maric et al. 2011] , we proposed hybrid, single-Vcc-domain cache designs, where the cache ways are implemented with heterogeneous SRAM cells to provide reliable hybrid high-and ultra-low-Vcc operation. However, in that work we briefly analyze standalone caches only. In this article, we extend the analysis towards the whole processor to understand the potential of our hybrid and nonhybrid designs, studying carefully different cache organizations and their sensitivity to parameters such as cache size, associativity, and memory latency.
HYBRID L1 CACHE DESIGN
In general, L1 caches take up a significant energy fraction at HP mode due to frequent accesses and dominant dynamic energy at high voltage. At ULE mode, caches are expected to be the main energy contributor due to both dynamic energy and leakage, which highly correlates with area occupancy 3 . Despite their significant energy consumption, caches are desirable from an energy perspective since they filter many off-chip accesses whose energy consumption is high and whose latency increases the chip's leakage. On the other hand, caches are not desirable from a time predictability perspective. Only deterministic caches can be used in order to provide strong timing guarantees [Wilhelm et al. 2008; Ferdinand and Wilhelm 1999] . Therefore, devising fast and energy-efficient L1 cache designs with deterministic behavior is of prominent importance for hybrid processors operating at high and ultra-low voltage levels in our target market.
In this section, we present new, deterministic L1 cache designs for reliable hybrid voltage operation. First, we describe our assumptions on the technology. Then we describe the proposed cache designs.
Technological Assumptions
In order to support hybrid voltage operation, we consider different operation modes, at least one for high Vcc and one for ultra-low Vcc. These modes may require SRAM cells different from traditional 6T ones. In fact, some existing processors use non-6T SRAM cells. For instance, Intel Atom [Gerosa et al. 2009 ], Intel Nehalem [Intel Corporation 2008] , and AMD Llano [Kanter 2008 ] use 8T SRAM cells for their L1 caches to operate at high (above 1V) and low voltage modes (0.8V). The need for further voltage scaling in future technologies is very likely to consolidate this shift from 6T to 8T SRAM cells. However, using 8T SRAM cells instead of 6T ones has some cost in terms of area, energy, and delay, as we show later.
In the rest of the article, we use three different technologies for cache design, namely high-performance technology (HPT), low-power technology (LPT), and low-leakage technology (LLT). Each particular technology uses a different operating voltage (Vcc) and different SRAM cells, although our hybrid cache design is not limited to any particular Vcc level, SRAM cell type, or process node. The process node used for this study is 32nm.
HPT is devised to provide very high performance despite its energy consumption. Thus, HPT is optimized for 1V operation (HP mode). The most suitable SRAM cell for such operation mode is 6T, given that all cell types operate properly at such Vcc and 6T SRAM cells provide lower energy, area, and delay than the other cell types.
LPT is devised to provide high performance (lower than HPT), but low dynamic energy. LPT is optimized for 0.7V operation (also HP mode) to trade off between both performance and energy. Results in Gerosa et al. [2009] and Chen et al. [2007] show that conventional 6T SRAM cells are less efficient than 8T ones under iso-robustness conditions at 0.7V due to process variations and noise susceptibility, thus making them unsuitable for such Vcc level. The only way to keep robustness at an acceptable level would be increasing cell size significantly and hence making 6T SRAM cells less attractive than 8T and 10T ones. Among the other cell types considered (8T and 10T), 8T is the most suitable one due to its higher efficiency in terms of energy, area, and performance. Even if 8T cells have one extra bit line and word line with respect to 10T cells, total bit line and word line activity is similar for both 8T and 10T cells. Therefore, this issue is not a disadvantage for 8T cells in terms of dynamic energy.
LLT is devised to minimize the total energy consumption (both dynamic and leakage energy) when performance is not critical. LLT is optimized for 0.35V operation (ULE mode), which is in line with state-of-the-art results [Calhoun and Chandrakasan 2004; Hanson et al. 2006] . The most suitable SRAM cell for such low voltage is 10T due to its high robustness, despite its overheads. 8T SRAM cells could be used if transistor size was increased. However, increasing transistor size for 8T SRAM cells would increase their overheads, making them less attractive than 10T ones. Note that both 8T and 10T SRAM cells have been sized to be reliable in front of process variations at 0.7V and 0.35V, respectively.
In summary, all technologies work at 1V HP mode, but HPT is the most convenient (6T cells). LPT and LLT work at 0.7V HP mode. Among these, LPT is the best choice (8T cells). Only LLT is suitable for ULE mode (10T cells).
Proposed Cache Designs
Based on the fact that most L1 caches in existing chips are set associative, we have chosen this type of organization as the target of our study, although significant parts of our study can be easily reused for direct-mapped and fully associative caches. Several hybrid cache designs have been implemented using heterogeneous SRAM cell types at a coarse granularity. Implementation details are presented in Section 4. Two different cache configurations are considered and analyzed.
(1) Nonhybrid. In the first configuration, caches are implemented with the same technology type for all arrays, so we have pure high-performance, low-energy, or ultralow-energy optimized configurations. (2) Hybrid Cache Ways. In the second configuration, some cache ways have been implemented with one technology and the rest with another. The minimum voltage level for this configuration is dictated by the cell able to operate at the lowest voltage level. For example, if we consider a 4-way cache configuration with 2 LPT and 2 LLT cache ways, as shown in Figure 3 , we can see that operation voltage can be 1V, 0.7V, or 0.35V. 1V provides high-performance operation and all cache ways work; 0.7V provides moderate-performance, low-energy operation and all cache ways still work; and finally, 0.35V provides ultra-low-energy operation where cache ways implemented with 8T SRAM cells must be turned off. During 0.35V operation mode (ULE mode) data processing is expected to be minimal [Szewczyk et al. 2004] and workloads are much smaller than during 0.7V/1V operation mode (HP mode). Workload discrepancy across HP and ULE modes justifies reducing the hardware resources to complete a given computation at ULE mode. Since HPT and LPT ways would experience many faults at NST Vcc and thus would not provide reliable operation, we simply turn them off at ULE mode. This approach is used for all cache ways that do not operate reliably at the current operation mode. Turning off some cache ways may have some impact on performance. However, as long as at least one cache way is turned on, the cache can operate properly. All cache ways are enabled at HP mode in order to use full cache space for high performance. Even LLT ways remain active despite their inefficiency at high Vcc, because they reduce slow and energy-hungry off-chip memory accesses [Maric et al. 2012] . Eventually, LLT ways can be turned off and extra HPT or LPT ways could be in place to replace LLT ones at HP mode. However, such an approach substantially increases area and, even in this case, using LLT ways at HP mode still can provide significant energy savings.
Note that all configurations provide strong guarantees on available cache size, number of sets, number of ways, etc., at each voltage level considered. Configurations implemented either with a single SRAM cell type or with hybrid cache ways have been evaluated. These cache designs are analyzed in the context of a processor. Performance and energy results are presented in Section 4. 
Changing Operation Mode
The processor, and therefore the cache, is designed to support two distinct operation modes. The software running on top is responsible for deciding when to switch modes based on the observed inputs by using specific instructions. Figure 4 shows the typical code structure for applications to be run on top of these processors. On a ULE-to-HP mode transition, HPT/LPT ways are activated and Vcc raised. The latency of such operation depends on the time required to change Vcc and frequency and activate cache ways. On an HP-to-ULE mode transition, first dirty HPT/LPT cache lines (if any) must be written back to memory. Then these ways can be desactivated and Vcc and frequency can be scaled down. We consider gated Vdd [Powell et al. 2000 ] to turn off cache ways. Using a PMOS gated-Vdd transistor significantly reduces the required transistor width, resulting in negligible area and power overheads, as explained in Powell et al. [2000] .
In other words, the user or compiler can take advantage of the input activity characteristics through software configuration to switch the mode, but this problem is out of scope of the article [Truong et al. 2009 ]. Since HP mode occurs seldom (less than 1% of the time He et al. 2011; Werner-Allen et al. 2006] ), performance and power overheads due to mode switching are expected to be negligible. For example, results in prove that it is reasonable to neglect performance and energy penalty due to mode switching. We refer the reader to section IX.B in to see the implications on energy and performace due to mode switching.
EVALUATION
In this section, we present the evaluation framework and results in terms of performance, energy, and area for different L1 cache designs. First, we describe the evaluation methodology used to obtain results for the proposed cache designs. Then, we present and discuss results for the whole chip when such cache designs are deployed in a single-core processor.
Methodology
Next we describe the cache and processor modeling methodologies, the operating modes considered, and the benchmarks used.
4.1.1. Cache Modeling. L1 cache memories have been modeled using CACTI 6.5, a flexible and accurate cache delay, energy, power, and area simulator [Muralimanohar et al. 2009 ]. The technology node considered is 32nm. To support different operating modes, we have extended the CACTI tool with models of other alternative SRAM cells (e.g., 8T and 10T) that are able to operate at low and NST voltages. Given that 6T SRAM cells are already modeled in the CACTI tool, we have extended it with delay, power, and area models for 8T and 10T SRAM cells by adapting capacitances, resistances, and geometry. Here, we provide more details about dynamic and leakage power models for caches including 8T and 10T SRAM cells.
The dynamic power model for 10T SRAM cells is anologous to that already implemented for 6T ones. In fact, implementation of the 10T cells is quite similar to 6T ones due to its fully differential architecture. However, the architecture of the 8T SRAM cell is not differential because of the separated write and read ports. Extra bit lines are modeled by including the proper capacitances and resistances in parameter calculations. These values depend on the transistor sizes, which have been chosen to provide high bitcell stability and operational reliability at low supply voltage [Jain and Agarwal 2006; Kulkarni et al. 2007] . Sense amplifiers are single-ended in case of 8T SRAM cells due to the single bit line for the read operation. For the sake of simplicity, we use the same differential sense amplifier suitable for 6T and 10T SRAM cells with a reference voltage tied to one input [Verma and Chandrakasan 2008] . The reference voltage is set sufficiently below Vcc to sense the read of a logic 1 correctly. Increased delay and bit-line swing when sensing a logic 0 is recovered by optimal transistor sizing of the stacked read buffer of the 8T cells (T3 and T4 in the 8T cell in Figure 2 ).
Following the same philosophy for dynamic power modeling that is already used in CACTI, our model tracks the physical capacitance of each stage of the cache model and calculates dynamic power consumed at each stage. Basically, cache dynamic power dissipation is comprised of word-line capacitance dissipation, bit-line capacitance dissipation, and short-circuit power consumption. Since capacitance plays an important role for dynamic power, we take into account the following capacitances: parasitic capacitances of transistors in the SRAM cell, capacitances of the access transistors, capacitance of a precharge transistor, and capacitances of a column select transistor and a word-line driver. Capacitances of the word-line/bit-line wires and wires in decoders are modeled as a distributed RC network.
Given that the impact of process variations is high for the 32nm technology node, especially at NST Vcc (ULE mode), our cache leakage power model is updated to take into account process variations. We model random within-die variations in threshold voltage (V T H ) using the analytical model proposed in Narendra et al. [2002] . In particular, the cache is decomposed into smaller building blocks and total leakage power is the sum of leakage power in each block. We refer the reader to Narendra et al. [2002] for more details.
Each SRAM cell is sized by using the analysis based on importance sampling proposed by Chen et al. [2007] assuming 6σ random variations in V T H for high (1V), low (0.7V), and ultra-low voltage (0.35V), respectively, considering read, write, and hold failures in the 32nm technology node. 8T and 10T SRAM cells are sized to match the same failure rate when operating at low (0.7V) and ultra-low voltage (0.35V), respectively, as for the 6T cells when operating at high voltage (1V) targeting a 99.9% cache yield. Impact in terms of area has been also considered for 8T and 10T SRAM cells and their associated circuitry. The smallest rectangle where the cache fits is chosen in area calculation to keep layout regularity.
Several hybrid cache microarchitectures have been implemented using heterogeneous SRAM cell types at a coarse granularity, as explained in Section 3. For example, hybrid cache designs where different cache ways are implemented with different SRAM cell types are allowed. Also, cache tag or data words can be extended with some additional bits (e.g., check bits, valid bits, etc.), considering their area, delay, and power impact.
HSPICE validation. The accuracy of the power model implemented in CACTI is validated using HSPICE. We have first created the 10T SRAM cell model using the low-power 32nm Predictive Technology Model (PTM) [Zhao and Cao 2006] with nominal threshold voltages for NMOS and PMOS transistors of V T H n = 350mV and On the other hand, SRAM cell simplification reduces drastically simulation time, which would be in the range of many hours (or even days) otherwise. We have compared dynamic and leakage power and read/write access time of the SRAM array with the corresponding SRAM array in CACTI. We have created digital input vectors for several clock cycles (frequencies in Table I for HP and ULE mode) to perform SRAM array write and read operations. We exercise the array with a reasonable number (i.e., 200) of different write and read patterns in order to measure read, write, and short-circuit 4 dynamic and leakage power. Finally, we have measured read and write access times for the created SRAM array. Read access time is measured as the difference between the time the address bit's voltage reaches Vdd/2 and the time the output (32-bit data value) of the read buffer reaches 90% of its final value. However, write access time is measured as the difference between the time the address bit's voltage reaches Vdd/2 and the voltage of bitnodes inside the cell reach 90% of their final values. Values obtained are averaged and used as the results of HSPICE simulations.
Values obtained from CACTI for the corresponding SRAM array show to be accurate within 7% variation (on average) to the reference HSPICE models for all metrics considered. The maximum error observed is 18%. The main reason for this discrepancy is the fact that the authors of the analytical models for leakage power [Narendra et al. 2002] used in CACTI make some empirical assumptions in their equations for leakage current.
According to our simulations, the access (read/write) time variation between 6T, 8T, and 10T SRAM cells is less than 3% at 1V, so the impact of using different SRAM cell sizes at HP mode on total delay is negligible 5 .
Processor Modeling.
We have chosen a very simple processor architecture with one core and in-order execution, because energy efficiency and design cost are the main drivers for the ultra-low-cost market segment. Our processor configuration resembles a recently fabricated Intel R processor for hybrid Vcc operation, although it is not suited for the ultra-low-cost market [Jain et al. 2012 ] (see Table I ). Both on-chip L1 data (DL1) and instruction (IL1) caches implement the proposed designs.
In order to understand the impact of different cache designs on the whole chip, we have incorporated our custom-modified CACTI tool into a full-chip simulator. As the full-chip simulator, we have used MPsim [Acosta et al. 2009 ], an enhanced version of SMTSim [Tullsen et al. 1995] extended with power models analogous to those of Wattch [Brooks et al. 2000 ], but using our enhanced CACTI version to model all SRAM structures 6 (caches, memory, TLB, etc.). All SRAM arrays except L1 caches have been implemented using 10T SRAM cells so they operate properly at any voltage level considered.
Off-chip memory is also modeled. The relative memory latency is low, given the low speed of the core, the small memory size (typically a few MBs), and its high integration with the processor itself. We have studied the impact of varying memory latency between 10 and 100 cycles at different modes, but no meaningful variation is observed in any metric. Therefore, we omit the details of this study. Memory power and energy are measured with CACTI and these values are included in our results.
4.1.3. Operating Modes. Our system has three distinct operating modes: HP-HPT, HP-LPT, and ULE modes. HP-HPT corresponds to the HP mode implemented with HPT, whereas HP-LPT corresponds to that implemented with LPT. Further, ULE mode is always implemented with LLT. Thus, we have set Vcc to 1V, 0.7V, and 0.35V for HP-HPT, HP-LPT, and ULE modes, respectively. Operating at different voltage levels requires different operating frequencies for each voltage. Thus, we have set operating frequencies to 1GHz for HP-HPT, 300MHz for HP-LPT, and 5MHz for ULE, which is in line with state-of-the-art results [Zhai et al. 2009; Chen et al. 2007; Chen et al. 2006; Jain et al. 2012 ].
Benchmarks.
To the best of our knowledge, a set of benchmarks specific for the domain that we target does not exist. We have chosen MediaBench [Lee et al. 1997] because it fits very well the expected needs of the ultra-low-cost segment: an abundant data processing during HP-HPT/HP-LPT mode, and relatively small workloads at ULE mode [Szewczyk et al. 2004; He et al. 2011; Werner-Allen et al. 2006] . For instance, sensor applications that monitor wind, sea level, temperature, tsunamis, etc., should be data intensive at HP-HPT/HP-LPT mode while the amount of data to be processed at ULE mode should be much smaller. We classify benchmarks into two categories depending on the cache requirements: (i) SmallBench, where workloads fit into very small cache sizes (e.g., 1KB) due to small data volume (adpcm c, adpcm d, epic c, and 5 Access time variation between 8T and 10T SRAM cells at 0.7V is less than 5%. 6 We have used the philosophy of microachitectural power modeling deployed in Wattch, which is based on counting the accesses to different SRAM structures. Such philosophy is regarded as accurate enough as it has been used recently to develop, for instance, the power model of the IBM Power 7 processor [Floyd et al. 2011; Huang et al. 2012] . epic d) and (ii) BigBench, where larger cache space is required to fit the workload due to large data volume (g721 c, g721 d, gsm c, gsm d, mpeg2 c, and mpeg2 d). SmallBench benchmarks are used during ULE operation whereas BigBench ones are used during HP-HPT and HP-LPT operation.
Due to lack of specific benchmarks resembling full applications for the target domain, we have built an artificial application in order to report overall energy savings for the whole application lifetime. Essentially, we built the application following the scheme of a typical application: the infinite loop executing the "ULE mode code", the condition check to enter HP mode, executing the "HP mode code", and entering again to the ULE mode routine with the infinite loop, as shown in Figure 4 . We assume in our application that "ULE mode code" is a program from the SmallBench benchmarks suite whereas "HP mode code" is a program from the BigBench set. Note that we did not model the transition from HP mode to ULE mode and vice versa. However, the performance impact of such transitions is negligible because they occur seldom [Szewczyk et al. 2004; He et al. 2011; Werner-Allen et al. 2006] . In order to report overall energy for the whole application lifetime, we take average energy results for SmallBench programs and assume they execute 99% or 99.9% of the total application lifetime. We also average results for BigBench programs and assume they execute 1% or 0.1% of the total application lifetime.
Processor Evaluation
In this section, we present the analysis of the whole processor. All results correspond to the baseline configuration presented in Table I . We have run simulations on the described processor when using different nonhybrid and hybrid caches at different voltage levels (HP-HPT, HP-LPT, and ULE modes). Cache configurations considered must provide full cache space at HP mode (HP-HPT or HP-LPT) and at least one cache way at ULE mode. Therefore, 6T+8T and 6T+10T configurations are not considered at HP-LPT. The cache configurations considered are as follows:
-HP-HPT: 6T, 8T, 10T, 6T+8T, 6T+10T, and 8T+10T; -HP-LPT: 8T, 10T, and 8T+10T; and -ULE: 10T, 6T+10T, and 8T+10T.
First, we evaluate different cache configurations and show performance, energy, and power results. Along with this, we study the impact of cache size on performance, energy, and power. Then we study total on-chip energy distribution in terms of dynamic and leakage energy for different cache configurations at different operation modes. Finally, we compare performance, energy, and power across different operation modes for all hybrid and nonhybrid cache configurations considered. As stated before, we use SmallBench benchmarks at ULE mode whereas BigBench ones are used at HP-HPT and HP-LPT modes for all experiments.
4.2.1. Metrics. In order to provide meaningful results, we consider the following metrics for each benchmark:
-execution time; -dynamic energy per instruction; -leakage power per cycle; and -energy-delay product (EDP).
In EDP calculation, we consider total energy, which is the sum of dynamic and leakage energy.
Results for Different Cache Configurations.
Next, we present results in terms of execution time, dynamic energy per instruction, leakage power per cycle, and EDP for Fig. 5 . Normalized average execution time for 8-way nonhybrid and 7+1 hybrid cache ways designs at HP-HPT, HP-LPT, and ULE modes when varying cache size. Note that cache size at ULE mode for 6T+10T and 8T+10T hybrid designs is 512B, 1KB, and 2KB, respectively, due to disabled 6T and 8T ways.
the processor configuration presented in Table I when using different cache configurations. We have chosen a 4KB/8KB/16KB 8-way cache for all nonhybrid configurations. In the case of hybrid designs, a 7+1 hybrid configuration has been considered, where 7 ways are implemented with either HPT or LPT and 1 way with LLT. At HP-HPT/HP-LPT mode, all cache ways are enabled and a 4KB/8KB/16KB cache size is available, whereas at ULE mode 7 ways are turned off, thus providing 512B/1KB/2KB of cache space. We have studied other hybrid configurations with different associativities and number of HPT/LPT/LLT ways at different operation modes, but relative trends hold across different configurations and no meaningful variation is observed in any metric, so we omit details for those configurations. Since execution time varies noticeably across benchmarks, results have been normalized with respect to the 4KB pure 6T, 8T, and 10T cache configurations at HP-HPT, HP-LPT, and ULE mode, respectively, for each benchmark.
Note that the operating frequency is identical for all configurations under the same operation mode (i.e., HP-HPT, HP-LPT, or ULE), despite the fact that caches may have different latencies. However, adapting the frequency to the cache latency would require redesigning all the remaining components to fit the new cycle time. Thus, we have decided to keep the same frequency across configurations. Similarly, to avoid further sources of variation in the results, all SRAM arrays except L1 caches have been implemented with 10T SRAM cells, so they operate properly at any voltage level. Note that L1 caches are the main energy consumers in our simple core. Thus, varying the type of SRAM cells used at the same mode in the core has little impact on the trends observed (i.e., power).
Relative energy and performance for different benchmarks (BigBench suite for HP-HPT and HP-LPT modes and SmallBench suite for ULE mode) are quite similar, thus indicating that the impact of the cache configuration is not particularly dependent on the program run. This is so because caches are the main energy consumers and access frequency is not drastically different across benchmarks, so the effects on different sources of energy on each benchmark are relatively similar. Therefore, results are presented in the form of normalized average across benchmarks.
Figures 5, 6, 7, and 8 depict normalized average results when running BigBench and SmallBench benchmarks at HP-HPT, HP-LPT, and ULE mode, respectively. We have observed the following for each metric.
(1) Execution Time. Normalized execution time for all configurations at HP-HPT and HP-LPT modes is similar because operating at such modes does not require turning off any cache way, so all the ways for all configurations are always turned on. There is a small degradation in normalized execution time at ULE mode (2% on average) due to the smaller cache size in 6T+10T/8T+10T hybrid configurations (6T/8T ways are disabled), leading to more memory accesses to serve extra misses. Also, relative trends hold across different cache sizes. Speedup for 16KB caches with respect to 4KB ones (e.g., 7.8% at HP-HPT) is just slightly better than that of 8KB caches (e.g., 6% Fig. 6 . Normalized average dynamic energy per instruction for 8-way nonhybrid and 7+1 hybrid cache ways designs at HP-HPT, HP-LPT, and ULE modes when varying cache size. Note that cache size at ULE mode for 6T+10T and 8T+10T hybrid designs is 512B, 1KB, and 2KB, respectively, due to disabled 6T and 8T ways. Fig. 7 . Normalized average leakage power per cycle for 8-way nonhybrid and 7+1 hybrid cache ways designs at HP-HPT, HP-LPT, and ULE modes when varying cache size. Note that cache size at ULE mode for 6T+10T and 8T+10T hybrid designs is 512B, 1KB, and 2KB, respectively, due to disabled 6T and 8T ways. Fig. 8 . Normalized average energy-delay product (EDP) for 8-way nonhybrid and 7+1 hybrid cache ways designs at HP-HPT, HP-LPT, and ULE modes when varying cache size. Note that cache size at ULE mode for 6T+10T and 8T+10T hybrid designs is 512B, 1KB, and 2KB, respectively, due to disabled 6T and 8T ways.
at HP-HPT) across different operation modes, so we consider 8KB caches as the best choice.
It can be concluded that the proposed hybrid designs exhibit basically the same behavior as nonhybrid ones in terms of performance (execution time) at each different operation mode.
(2) Dynamic Energy per Instruction. Using 8T and 10T SRAM cells instead of 6T ones increases dynamic energy per instruction at HP-HPT mode due to additional transistors in their architectures. For nonhybrid 8T and 10T caches, dynamic energy increases more than 1.5x and 2.0x, respectively. In the case of hybrid designs, values resemble very closely the sum of values for nonhybrid configurations weighted by the fraction of cache space devoted to each particular technology type (i.e., number of cache ways devoted to one technology, divided by the total number of cache ways). Therefore, hybrid designs introduce some overheads at HP-HPT mode (e.g., 14% for the 6T+10T hybrid cache). At HP-LPT mode, pure 10T caches have larger overhead (around 16% on average) with respect to the pure 8T designs than hybrid 8T+10T designs (2.8% on average). However, at ULE mode, our hybrid 6T+10T/8T+10T designs achieve large savings (up to 86%) with respect to the pure 10T designs due to turning off 6T/8T ways.
Dynamic energy per instruction grows quite linearly with cache size, mainly due to the increased number of bit lines to discharge. This trend is similar for all cache configurations across different cache sizes. Finally, the relative increase in dynamic energy per instruction across different cache sizes is larger at HP-HPT and HP-LPT modes than at ULE mode because dynamic energy is the dominant component in total energy at 1V and 0.7V.
It can be concluded that hybrid designs introduce some overheads in terms of dynamic energy per instruction relative to the 6T and 8T nonhybrid designs at HP-HPT/HP-LPT mode. However, 6T and 8T nonhybrid designs cannot be used at ULE mode, so our hybrid caches are the best choices. On the other hand, the proposed hybrid designs achieve significant savings at ULE mode with respect to pure 10T caches, where energy is the primary concern.
(3) Leakage Power per Cycle. Relative trends for leakage power per cycle at HP-HPT mode across all configurations resemble quite closely those for dynamic energy per instruction. On the other hand, there is a larger variation across configurations at HP-LPT mode. For instance, pure 10T caches exhibit larger overhead (42% on average) with respect to the pure 8T designs than hybrid 8T+10T caches (8% on average). At ULE mode, large savings are achieved (up to 78% with respect to the pure 10T designs) with our hybrid 6T+10T/8T+10T designs due to turning off 6T/8T ways.
Leakage power per cycle grows with cache size. Leakage does not grow much when moving from 4KB to 8KB, since efficient SRAM array arrangements are found, thus keeping bit-line leakage low. However, the array arrangement determined automatically by CACTI for 16KB caches is not that efficient and thus leakage grows noticeably. This trend, however, is different for 6T and 8T arrays when compared to 10T ones, whose leakage is relatively higher for 8KB caches as shown at the ULE mode.
(4) EDP. Relative trends for EDP across configurations and cache sizes at the same mode follow the trends observed for total energy. This is because execution time does not vary at HP-HPT and HP-LPT modes while such variation at ULE mode is negligible (2% on average). The relative values for EDP at HP-HPT and HP-LPT modes have almost the same trend as those observed for dynamic energy per instruction, because dynamic energy dominates total energy when operating at such modes. As opposed to the case of HP-HPT and HP-LPT mode, leakage energy is the dominant factor at ULE mode. We observe that relative trends for EDP at ULE mode highly correlate with those reported for leakage power per cycle.
Overall, this set of experiments proves that energy consumption for the whole processor is highly dependent on the particular cache configuration used for L1 data and instruction caches. Different benchmarks do not introduce noticeable variations in any metric. Based on the results, we conclude that an 8KB, 7+1 hybrid cache configuration is the most convenient choice. We also conclude that, as expected, dynamic energy is the dominant factor at HP-HPT and HP-LPT modes, whereas leakage is the dominant factor at ULE mode. We show that the proposed hybrid designs achieve exactly the same average performance as nonhybrid ones at HP-HPT/HP-LPT mode at the expense of a small dynamic energy and leakage power overhead when compared to nonhybrid designs. Also, we show that the proposed hybrid caches have significantly smaller dynamic energy and leakage power at ULE mode than nonhybrid ones. Finally, we show that our hybrid cache designs can efficiently and reliably operate across a wide range of voltages (modes), consuming ultra-low energy at ULE mode as well as providing the high performance needed at HP-HPT/HP-LPT mode (e.g., 6T+10T/8T+10T), as required for our target market. Finally, our hybrid caches provide deterministic performance behavior since each operation mode leads to deterministic cache size, thus enabling the strong performance guarantees needed for running critical applications on top.
4.2.3. Energy Breakdown. The next set of results focuses on the total on-chip energy 7 and its distribution in terms of dynamic and leakage energy when varying supply voltage for all nonhybrid and hybrid configurations considered. Since cache memories are the main energy consumers in our target scenarios, we break down energy into the following categories: L1 cache dynamic energy for data and instructions (EdynL1), L1 cache leakage energy for data and instructions (EleakL1), dynamic energy for the rest of the chip (Edyn no-L1), and leakage energy for the rest of the chip (Eleak no-L1). Results for some configurations using 6T and 8T cells have been omitted since they do not provide further insights. In particular, 8T-based designs at HP-HPT mode are always worse than 6T-based ones, thus omitted. Similarly, 6T-based ones are unsuitable for HP-LPT operation because 6T ways must be turned off. Finally, 8T-based designs at ULE mode achieve almost identical results to 6T-based ones, so only 6T ones are shown.
Figures 9(a), 9(b), and 9(c) show the on-chip energy breakdown for all configurations when operating at different voltage levels. Most of the energy corresponds to dynamic energy at HP-HPT mode as shown in Figure 9 (a). In particular, 75% of the energy consumed is dynamic energy on average across the different configurations. Most dynamic energy corresponds to L1 caches, whose contribution to the total chip energy is always above 50%. Thus, L1 caches' dynamic energy is the main energy contributor at HP-HPT mode. Interestingly, cache energy contribution is lower for those configurations where 6T cells are used. This is because the absolute energy of the cache is lower whereas the rest of the chip energy remains nearly constant. As explained before, 6T SRAM cells are the preferred choice for HP-HPT operation. Note also that the 6T+10T energy breakdown is highly similar to that of the pure 6T, because only one cache way in 6T+10T is made of 10T SRAM cells.
As supply voltage is decreased, dynamic energy contribution is lower and leakage energy increases. This trend is shown when operating at HP-LPT mode (Figure 9(b) ). In this case, dynamic energy is around 60% of the total energy, most of it due to L1 caches. We observe that 8T SRAM cells consume less energy than 10T ones at HP-LPT, thus reducing the fraction of both dynamic and leakage energy devoted to the cache when compared to 10T-based configurations. For instance, cache energy is lower for the 8T configuration than for the 8T+10T one, which is still lower than that for the 10T one. Again, the energy breakdown of the hybrid configuration (8T+10T) resembles much more that of the pure 8T one than the one of the pure 10T cache.
As expected, operating at NST voltage leads to higher contribution of leakage energy and significant dynamic energy decrease. Moreover, as opposed to HP-HPT and HP-LPT operation where 8KB caches are in place, ULE operation has only 1KB caches for the 6T+10T configuration. This effect is shown in Figure 9 (c), where leakage is around 70%-75% of the total energy. L1 cache dynamic energy is still significant, although not as much as leakage. As shown, dynamic energy is only around 25% of the total energy at ULE mode and most of such energy corresponds to the L1 caches. The pure 10T configuration incurs significant energy overheads due to its larger cache size (8KB) and obtains negligible performance gains. This effect translates into a much higher contribution of L1 caches in the energy breakdown.
4.2.4. Impact of Voltage. Finally, we compare all metrics of interest across different voltage levels in Figure 10 . All configurations considered at HP-HPT, HP-LPT, and ULE modes are normalized with respect to the pure 6T designs operating at HP-HPT mode. Figure 10 plots execution time, dynamic energy, leakage power, total energy, and EDP. All the values reported correspond to the average of all programs run (BigBench for HP-HPT and HP-LPT modes, and SmallBench for ULE mode).
It can be noticed easily that the execution time increases drastically when voltage is decreased. The main reason for such behavior is the difference in operating frequencies at different operation modes. For instance, execution time on average at ULE mode is around 210x higher than that at HP-HPT mode.
Operating at HP-LPT delivers around 50% dynamic energy per instruction reduction for all configurations. Reduction at ULE mode is around 9x for the configuration where all cache ways work normally (pure 10T), whereas such reduction is around 83x for our hybrid caches (8T+10T and 6T+10T) due to disabled 6T/8T ways.
Leakage power per cycle at HP-HPT and HP-LPT modes exhibits smaller variations across all configurations than for dynamic energy per instruction due to the small impact of the lowered voltage (from 1V to 0.7V) on leakage. Reduction at HP-LPT mode is around 33%. As expected, trends observed for leakage power per cycle at ULE mode are similar to those observed for dynamic energy per instruction at ULE mode.
In terms of EDP, ULE mode is the least interesting design point. Although dynamic energy and leakage power savings at ULE mode are around 10x for 10T designs and 85x for hybrid 8T+10T/6T+10T designs, delay increases more than 200x and therefore EDP is dramatically increased (up to 40x for pure 10T and up to 5x for hybrid 8T+10T/6T+10T designs). However, the main concern at ULE mode is the total energy, which is drastically reduced as shown in Figure 10. 4.2.5. Overall Energy Savings. In this section, we report overall energy savings for the whole lifetime of the artificial application that we described in Section 4.1.4. We have selected the proposed 6T+10T hybrid cache and nonhybrid 10T caches as a representative baseline and compared their total energy for two different scenarios:
-duty cycle of 1%: ULE mode lasts for 99% of the total application lifetime (i.e., HP mode lasts for 1% of the time), and -duty cycle of 0.1%: ULE mode lasts for 99.9% of the total application lifetime (i.e., HP mode lasts for 0.1% of the time).
Basically, we take average energy results for SmallBench programs and assume that they execute 99% or 99.9% of the application lifetime, as well as average results for BigBench programs and assume that they execute 1% or 0.1% of the application lifetime, respectively. Figure 11 plots the energy breakdown for the whole application lifetime for 10T-based and hybrid 6T+10T cache designs. Beside the dramatic energy reduction of the proposed hybrid 6T+10T cache with respect to the baseline (already explained in the previous sections), we observed that both operation modes (HP and ULE mode) have a nonnegligible contribution to the total energy during the application lifetime.
-HP mode: Its energy consumption per time unit is around 2 orders of magnitude higher than that at ULE mode. -ULE mode: Its energy consumption per time unit is much lower than that at HP mode, but most of the time is spent in ULE mode. Figure 11 shows that the energy contribution of each operation mode highly depends on the duty-cycle values. The contribution of ULE mode to total energy is more significant than that of HP mode when the duty cycle is below 1%. However, if the duty cycle is around 1%, HP mode energy consumption is the largest contributor for the 6T+10T cache. Therefore, we can conclude that energy reduction is very important and cannot be neglected, even at HP mode where performance is the primary concern. 4.2.6. Comparison with Existing Approaches. Next, we present a detailed comparison between our proposed hybrid caches and existing cache designs with deterministic behavior [Zhou et al. 2010; Ghasemi et al. 2011] . Comparison is quantitatively done in terms of performance (total execution time), total processor energy per instruction (EPI), and cache area. In order to do a fair comparison, we have implemented these designs for both DL1 and IL1 caches. All caches have been designed to have the same yield at ULE mode (i.e., 99.9%). Our baseline is the 7+1 hybrid 6T+10T cache.
We have implemented the designs proposed by Zhou et al. [2010] , where all 8 cache ways use 6T cells and are protected with Single Error Correction Double Error Detection (SECDED) codes [Chen and Hsiao 1984] at cache word (32 bits) granularity 8 . We have accounted for these energy and area overheads introduced by SECDED check bits in our simulations as well as an additional latency of one clock cycle for SECDED encoding/decoding, but we have not accounted for the energy consumed by encoding/decoding circuits. Ignoring these overheads is against our technique in the comparison. The size of 6T cells is calculated according to the methodology proposed in Zhou et al. [2010] using elementary probability calculation to match the target cache yield at ULE mode. According to Zhou et al. [2010] , all cache ways are always enabled at both modes.
Regarding the designs proposed by Ghasemi et al. [2011] , they are in spirit similar to our baseline. The only difference is that larger 6T cells are used in ULE ways instead of the 10T cells. Therefore, at HP-HPT mode all cache ways are enabled, whereas at ULE mode only 1 cache way implemented with large 6T cells keeps operating. The size of large 6T cells is determined using the probability calculation methodology proposed in Ghasemi et al. [2011] to match the cache yield of 99.9% at ULE mode. According to our simulations, the access time variations at 0.35V for SRAM arrays implemented with these large 6T cells is around 40% relative to SRAM arrays implemented with 10T cells, so we assume a 3-cycle cache latency instead of the regular 2-cycle latency at ULE mode. Access time variations at 1V are negligible, so a 2-cycle latency is assumed at HP-HPT mode. Figure 12 shows execution time and EPI results at HP-HPT and ULE modes when comparing existing deterministic caches [Zhou et al. 2010; Ghasemi et al. 2011] with hybrid 6T+10T cache designs targeting a 99.9% cache yield. All results are normalized with respect to the baseline cache.
Results show that the design in Zhou et al. [2010] experiences a performance drop of 9.7% and 14% at HP-HPT and ULE mode, respectively, due to the extra cycle for SECDED encoding/decoding. The design in Ghasemi et al. [2011] achieves the same Fig. 12 . Normalized execution time, total EPI, and cache area when comparing existing deterministic caches [Zhou et al. 2010; Ghasemi et al. 2011] with hybrid 6T+10T cache designs targeting a 99.9% cache yield.
performance as the baseline at HP-HPT mode, but experiences a significant performance loss (around 49%) at ULE mode due to the additional cycle for each cache access.
The main drawback of the designs in Zhou et al. [2010] and Ghasemi et al. [2011] is their significant area overhead with respect to our proposed design (75% and 32%, respectively, as shown in Figure 12(c) ). This directly translates into higher energy consumption at both modes, as can be seen in Figure 12(b) . In other words, these designs are overdesigned to operate reliably at ULE mode in comparison with our proposed hybrid cache design.
CONCLUSIONS AND FUTURE WORK
Hybrid high-performance low-energy and ultra-low-energy reliable operation is the road to follow in many emerging market segments such as environment, body, and urban life monitoring, but poses many challenges such as energy efficiency, integration, and fabrication costs. These challenges are particularly tough for SRAM memories due to the inability of high-voltage efficient SRAM cells to work at low voltages and the lack of efficiency at high voltage for those SRAM cells suitable for low voltage operation.
In this article, we propose new, single-Vcc-domain hybrid L1 cache architectures for reliable hybrid voltage operation that meet all specific and stringent needs of batterypowered ultra-low-cost (e.g., below 1 USD) systems. The proposed cache designs rely on combining heterogeneous SRAM cell types so that some of the cache ways are optimized to satisfy high-performance requirements during high-Vcc operation (HP ways), whereas the rest of the ways provide ultra-low energy consumption and reliability during NST Vcc operation (ULE ways). Our results show that, first, the proposed hybrid caches can efficiently and reliably operate across a wide range of voltages, consuming little energy at ULE mode, as well as providing high performance with small overheads at HP mode, as required for our target market. Second, we show that the proposed hybrid designs achieve exactly the same average performance when compared to conventional nonhybrid designs at HP-HPT/HP-LPT mode at the expense of small dynamic energy and leakage power overheads. Third, we show that the proposed hybrid caches significantly reduce dynamic energy and leakage power at ULE mode with respect to nonhybrid ones (around 9x). Likewise, our proposed caches exhibit deterministic behavior since available cache size is deterministic at all operation modes, thus enabling strong performance guarantees, as needed for running critical applications on top.
Finally, our experiments show these trends are consistent across different combinations of cell types, cache sizes, and associativity values, and open the door to further research in the design of hybrid microarchitectures. The next step will be devising more efficient designs for the rest of the hybrid processor components such as register files, TLBs, and the like.
