Fault-Tolerant Voltage-Scalable (FTVS) SRAM cache architectures are a promising approach to improve energy efficiency of memories in the presence of nanoscale process variation. Complex FTVS schemes are commonly proposed to achieve very low minimum supply voltages, but these can suffer from high overheads and thus do not always offer the best power/capacity trade-offs. We observe on our 45nm test chips that the "fault inclusion property" can enable lightweight fault maps that support multiple runtime supply voltages.
INTRODUCTION
Moore's Law has been the primary driver behind the phenomenal advances in computing capability of the past several decades. Technology scaling has now reached the nanoscale era, where the smallest integrated circuit features are only a couple of orders of magnitude larger than individual atoms. In this regime, increased leakage power and overall power density has ended ideal Dennard scaling, leading to the rise of "dark silicon" [Esmaeilzadeh et al. 2011] .
Owing to the extreme manufacturing control requirements imposed by nanoscale technology, the effects of process variation on reliability, yield, power consumption, and performance is a principal challenge [Gupta et al. 2013] and is partly responsible for the end of Dennard scaling. The typical solution is to leave design margins for the worst-case manufacturing outcomes and operating conditions. However, these methods naturally incur significant overheads.
Memory structures are especially sensitive to variability. This is because memories typically use the smallest transistors and feature dense layouts, comprise a significant fraction of chip area, and are sensitive to voltage and temperature noise. Memories are also major consumers of system power, are not very energy proportional Hölzle 2007, 2009] , and are frequent points of failure in the field [Schroeder et al. 2011; Sridharan and Liberty 2012] . It is clear that the design of the memory hierarchy needs to be reconsidered with the challenges brought about in the nanoscale era.
One way to reduce the power consumption of memory is to reduce the supply voltage (VDD). Static power, dominated by subthreshold leakage current, is a major consumer of power in memories [Flautner et al. 2002] and has an exponential dependence on supply voltage [Weste and Harris 2011] . Since leakage often constitutes a major fraction of total system power in nanoscale processes [Kumar et al. 2009 ], even a minor reduction in memory supply voltage could have a significant impact on total chip power consumption.
Unfortunately, voltage-scaled SRAMs are susceptible to faulty behavior. Variability in SRAM cell noise margins, mostly due to the impact of random dopant fluctuation on threshold voltages, results in an exponential increase in probability of cell failure as the supply voltage is lowered [Wang and Calhoun 2011] . This has motivated research on Fault-Tolerant Voltage-Scalable (FTVS) SRAM caches for energy-efficient and reliable operation.
Many clever FTVS SRAM cache architectures use complex fault-tolerance methods to lower min-VDD while meeting manufacturing yield objectives. Unfortunately, these approaches are limited by a manifestation of Amdahl's Law [Amdahl 1967 ] for power savings. Large and flexible fault maps combined with intricate redundancy mechanisms cost considerable power, area, and performance. It is important that designers account for these overheads, as they ultimately limit the scalability of such approaches. In this work, we make several contributions:
(1) The SRAM memories on our custom 45nm SOI embedded systems on chip are characterized. We observe the fault inclusion property: particular bit cells that fail at a given supply voltage will still be faulty at all lower voltages.
(2) A new static power versus effective capacity metric is proposed for FTVS memory architectures, which is a good indicator of runtime power/performance scalability and overall energy efficiency. (3) A simple and low-overhead multi-VDD fault-tolerance mechanism combines voltage scaling of the data array SRAM cells with optional power gating of faulty data blocks at low voltage. (4) We propose Static Power/Capacity Scaling (SPCS) and Dynamic Power/Capacity Scaling (DPCS), two variants of a novel FTVS scheme that significantly reduces overall cache and system energy with minimal performance and area overheads. (5) SPCS and DPCS achieve lower static power at all scaled cache capacities than a more complex FTVS cache architecture [BanaiyanMofrad et al. 2011 ] as well as per-cache way power gating.
RELATED WORK
There is a rich body of literature in circuit, architecture, and error correction coding techniques for FTVS deep-submicron memories. A summary of related work is provided here, with emphasis on architectural solutions. The reader may refer to Mittal [2014] for a more complete survey of architectural techniques for improving cache power efficiency.
Leakage Reduction
Two of the best-known architectural approaches are those generally based on Gated-VDD [Powell et al. 2000; Cheng et al. 2014] and Drowsy Cache [Flautner et al. 2002; Bardine et al. 2014] . The former dynamically resizes the instruction cache by turning off blocks that are not used by the application, exploiting variability in cache utilization within and across applications. The latter utilizes the alternative approach of voltage scaling idle cache lines, which yields good static power savings without losing memory state. Neither approach improves dynamic power nor accounts for the impact of process variation on noise margin-based faults, which are greatly exacerbated at low voltage [Kumar et al. 2009; Wang and Calhoun 2011] . This issue particularly limits the mechanism of Flautner et al. [2002] .
Fault Tolerance
In the fault tolerance area, works targeting cache yield and/or min-VDD improvement include Error Correction Codes (ECCs) and architectural methods [Shirvani and McCluskey 1999; Agarwal et al. 2005; Ozdemir et al. 2006; Kim et al. 2007; Koh et al. 2009; Ansari et al. 2009; Rossi et al. 2011; Alameldeen et al. 2011a Alameldeen et al. , 2011b Qureshi and Chishti 2013] , none of which explicitly address energy savings as an objective. A variety of low-voltage SRAM cells that improve read stability and/or writability have also been proposed, for example, 8T [Chang et al. 2005] and 10T [Calhoun and Chandrakasan 2006] , but they have high area overheads compared to a 6T design. Schemes that use fault tolerance to achieve lower voltage primarily for cache power savings include Hussain et al. [2008] , Wilkerson et al. [2008] , Abella et al. [2009] , and Sasan et al. [2009] , two very similar approaches from Ansari et al. [2011] and BanaiyanMofrad et al. [2011] , and several others [Sasan et al. 2012; Zhang et al. 2012; Han et al. 2013; Chakraborty et al. 2014; Mahmood et al. 2014] . All of these approaches try to reduce the minimum-operable cache VDD with yield constraints by employing relatively sophisticated fault-tolerance mechanisms, such as address remapping, block and set-level replication, etc. They are also similar in that they either reduce the effective cache capacity by disabling faulty regions as VDD is reduced (e.g., FFT-Cache [BanaiyanMofrad et al. 2011] ), or boost VDD in "weak" regions as necessary to maintain capacity (e.g., Sasan et al. [2012] ). Han et al. [2013] and Kim and Guthaus [2013] both utilize multiple memory supply voltages. Ghasemi et al. [2011] proposed a lastlevel cache with heterogeneous cell sizes for more graceful voltage/capacity trade-offs. Finally, variation-aware Non-uniform Cache Access (NUCA) architectures have been proposed Hijaz and Khan 2014] . The approach by Hijaz and Khan [2014] mixes capacity trade-offs with latency penalties of error correction.
There are two recent papers that contain some similarities to the concepts in this article, yet were developed concurrently and independently of our [Mohammad et al. 2014; Ferreron et al. 2014] . Mohammad et al. [2014] discuss power/capacity trade-offs in FTVS caches and a similar architectural mechanism, primarily focusing on yield improvement and quality/power trade-offs. Ferreron et al. [2014] focuses on systemlevel impacts of block disabling techniques in a chip multiprocessor with shared caches and parallel workloads.
Memory Power/Performance Scaling
With energy proportionality becoming a topic of interest recent Hölzle 2007, 2009] , there have been several works that target main memory. Fan et al. [2005] studied the coordination of processor DVFS and memory low-power modes. In MemScale [Deng et al. 2011] , the authors were some of the first to propose dynamic memory frequency scaling as an active low-power mode. Independently of MemScale, also proposed memory DVFS. MemScale was succeeded by CoScale [Deng et al. 2012] , which coordinated DVFS in the CPU and memory to deliver significantly improved system-level energy proportionality.
Novelty of This Work
Our approach is different from the related works as follows. The circuit mechanisms are similar to those of Gated-VDD [Powell et al. 2000] and Drowsy Cache [Flautner et al. 2002] , but are combined with fault tolerance to allow lower voltage operation using 6T SRAM cells. However, the proposed scheme can also be used with any other cell design. To the best of our knowledge, our scheme is the first to use voltage scaling on data bit cells only, along with optional power gating blocks as they become faulty for additional energy savings. Due to architecture, floor planning, and layout considerations imposed by per-block power gating, we also discuss the advantages and disadvantages of this feature. Our approach emphasizes simplicity to get good energy savings using low-overhead multi-VDD fault maps that can be used for static (SPCS) or dynamic (DPCS) power/capacity scaling. SPCS and DPCS both achieve dynamic and static energy savings, as VDD is not boosted for cache accesses.
This work could be supplemented with other innovations. Although soft errors are not currently handled, this scheme can be supplemented with ECC. Circuit-level approaches to coping with aging are orthogonal to this work and could be incorporated as well. Our insights could be augmented with those of Mohammad et al. [2014] and Ferreron et al. [2014] to enable knobs for dynamic power/capacity/quality scaling. Although not explored in this article, we believe that DPCS can also be used to improve system-level energy proportionality by coordinating with existing CPU and main memory DVFS approaches. We leave these possibilities to future work.
THE SRAM FAULT INCLUSION PROPERTY
Once the noise margin of a memory cell collapses at low voltage, continuing to reduce voltage further will not restore its functionality. We refer to this behavior as the fault inclusion property: any faults that occur at a given supply voltage will strictly be a subset of those at all lower voltages. Put more formally:
1, if memory cell i is faulty at supply voltage v 0, otherwise, where 1 ≤ i ≤ n, and n is the number of bit cells in the memory. Fault Inclusion Property: To verify the previous formulation, tests were performed on four ARM Cortex M3-based "Red Cooper" test chips manufactured in a commercial 45nm SOI technology Agarwal et al. 2014] in collaboration with colleagues in the NSF Variability Expedition. Each test chip had two 8kB SRAM scratchpad banks and no off-chip RAM. A board-level mbed LPC1768 microcontroller accessed the SRAM and controlled all chip voltage domains through JTAG. The board and a test chip are shown in Figure 1(a) .
With each test chip's CPU disabled, March Simple-Static (SS) tests [Hamdioui et al. 2002] were run on both banks using the mbed to characterize the nature of faults as the array VDD was reduced. For each SRAM bank on each chip, a test suite was repeated five times. In each run, VDD scaled down one step from nominal 1V in 25mV increments. At each voltage level, faulty SRAM locations were logged at byte granularity.
1
No distinction was made between different underlying physical causes of failure (read stability, writability, retention failure, inter-cell interference, etc.). As dynamic failures are not the focus of this study, the test chip was operated at a 20MHz clock to minimize the chance of delay faults occurring in the SRAM periphery logic. This is because the test chip uses a single voltage rail for the memory cells and periphery.
As expected, faults caused by voltage scaling obeyed the fault inclusion property. This trend is depicted graphically in Figures 1(b) -(e) for one of the test chips. Faulty byte locations were consistent in each unique SRAM bank at each voltage level. The patterns could be verified with repeated testing after compensating for noise, indicating that the faults were not caused by soft errors or permanent defects, but rather caused by variability in cell noise margins. These results suggested the use of a compact fault map such that multiple VDDs can be efficiently supported.
A CASE FOR THE "POWER VERSUS EFFECTIVE CAPACITY" METRIC
As previously mentioned, related works in FTVS cache architecture commonly use min-VDD as a primary metric of evaluation. While supply voltage is a very important control knob for power, it is only one factor that determines the static and dynamic energy consumption of a cache memory. Most FTVS schemes, including ours, require a portion of the memory to operate at full VDD or otherwise be resilient for guaranteed correctness.
Amdahl's Law Applied to Fault-Tolerant Voltage-Scalable Caches
When evaluating power reduction from voltage scaling, it is important to consider the proportion of cache components operating at high VDD. This is due to a manifestation of Amdahl's Law [Amdahl 1967 ] when interpreted for power and energy. Note the following relationship, which is similar to the traditional version of Amdahl's Law for speedup [Hennessy and Patterson 2012] , but for power reduction instead via FaultTolerant (FT) Voltage Scaling (VS).
Note the additional Fraction FT overhead term in the denominator, which accounts for the additional fault-tolerance logic needed to scale voltage to the desired min-VDD on part of the memory (Fraction VS ). In fact, this modified formulation of Amdahl's Law also holds more generally whenever constant overheads are incurred for speedup or power reduction. This interpretation of Amdahl's Law might be overlooked when trying to achieve a lower min-VDD using complex fault-tolerance schemes. Thus, supply voltage alone should not receive too much emphasis in FTVS techniques.
For example, consider two competing FTVS approaches, Scheme 1 and Scheme 2, where only the data array voltage can be scaled. Both Scheme 1 and Scheme 2 have the same cache size, block size, associativity, etc. Assume Scheme 2's total tag array power overhead is 20% of the nominal data array power (full VDD) due to a large and complex fault map, and Scheme 1's tag power overhead is 5%, thanks to a smaller and simpler fault map. The baseline cache has a tag array power overhead of only 3% because it has no fault map.
Let the data array voltage be scaled independently for Scheme 1 and Scheme 2 such that Scheme 2's voltage is lower than Scheme 1's due to better fault tolerance. Suppose the data array leakage power in Scheme 1 is now 30% of its nominal value, and the data array leakage power in Scheme 2 is now 20% of its nominal value. Scheme 2 will save 61.1% of static power against the baseline cache, while Scheme 1 will save 66.0%, despite operating at a higher data array voltage.
Problems With Yield Metric
Another common metric is memory yield in terms of functional reliability, performance, power, etc. Often, the min-VDD is determined for a particular cache configuration based on expected fault probabilities, the particular fault-tolerance mechanism, and a desired target yield such as 99%. While one can claim that a particular scheme is functional within the design envelope, definitions of functionality vary considerably. For example, one cache architecture may be considered functional if it simply operates in a "correct" manner, regardless of power consumption. Another architecture may define the cache to be functional only if at least 50% of blocks are non-faulty. Thus, in general, min-VDD/yield should not be over-emphasized. The min-VDD metric can be more useful for caches that are on the same voltage domain as the processor core. However, in the future, as indicated by industry trends such as Intel's fine-grain voltage regulation (FIVR) design in its Haswell architecture [Bowhill et al. 2015] , this is likely not to be the case.
Proposed Metric
We believe that a new metric, static power versus effective capacity, should be used to guide the design of FTVS cache architectures in addition to the other metrics. This metric accounts for the supply voltage as well as the efficacy and overheads of the particular fault tolerance or capacity reduction mechanism (e.g., power gating).
Simple FTVS schemes can fare similarly or even better than complex ones in terms of power, performance, and/or area, all with less design and verification effort. This is because the overall failure rate rises sharply in a small voltage range as shown earlier in Figure 1 . The complexity required for tolerating such a high memory failure rate may not be worth the small array power savings from an incrementally lower voltage.
With this metric, we focus on static power to allow for analytical evaluation. This is a reasonable simplification in the case of large memories where static power constitutes a large fraction of memory energy. We apply this metric during our evaluation in Section 7.
Note that static power versus effective capacity is not necessarily a good metric for choosing the capacity of a cache at design-time and nominal voltage. Nor does the metric imply that smaller caches are more efficient in general. Rather, this metric is meant to capture the runtime scalability of the cache power and performance while operating below its designed maximum capacity. This is useful because the full cache capacity may not be needed at all times to deliver good performance, and hence, its capacity can be temporarily reduced to lower power and improve energy efficiency.
POWER/CAPACITY SCALING ARCHITECTURE
Our scheme has two main components: the architectural mechanism, described in Section 5.1, and the policy. We propose a static policy (SPCS) and two different dynamic policies (DPCS Policy 1 and DPCS Policy 2), described in Sections 5.2, 5.3.1, and 5.3.2, respectively. All three policies share the same underlying hardware mechanism presented next.
Mechanism
Industry trends point toward finer-grain voltage and frequency domains in future chips. For example, Intel has introduced fine-grain voltage regulation (FIVR) that is partly onchip and on-package with its Haswell architecture [Bowhill et al. 2015] . This allows for many voltage domains on the chip on the level of per-core, per-L3 cache, etc. Thus, future chips might support separate voltage rails for each level of cache in order to further decouple logic V min from memory V min , especially if architectures can exploit this feature.
The power/capacity scaling mechanism primarily consists of a lightweight fault map and a circuit for globally adjusting the VDD of the data cells. The data array periphery, metadata (Valid, Dirty, Tag, etc.) array cells, metadata array periphery, and the processor core are all on a separate voltage domain running at nominal VDD, where they are assumed to be never faulty. Data and metadata arrays otherwise use identical cell designs. Voltage is not boosted for data access, granting both dynamic and static energy savings. To bridge the voltage domains of the data array with its periphery, the final stage of the row decoder is also used as a downward level-shifting gate to drive the wordline. Similarly, column write drivers also are downward level shifting, while sense amplifiers used in read operations restore voltage levels to the full nominal VDD swing. This circuit approach is validated by a very recent SoC design from Samsung [Pyo et al. 2015] , which independently arrived at the decision to employ a dual-rail design to decouple logic and bitcell supply voltage.
Overall access time may be affected by up to 10% in the worst case at low voltage, as found later in Section 6.3. This is because the impact of reduced cell voltage is only one part of the overall cache access time, and near-threshold operation is avoided. Note that voltage boosting during cell access could be utilized to allow even lower array voltages than used in this work, as the proposed design is limited by SRAM read stability.
If the number of voltage domains remains constrained, a version of the architecture with a single voltage rail for the core and all peripheries and a shared voltage rail for all cache data arrays is also feasible. However, this can restrict DPCS capabilities for energy savings by coupling cache voltages together while limiting voltage scaling policies. The overall architecture is depicted in Figure 2 .
5.1.1. Fault Map. The low-overhead fault map includes two fields for each data block that are maintained in the corresponding nearby metadata subarray in addition to the conventional bits (Valid, Dirty, Tag, etc.) , as shown in Figure 2 . The first entry is a single Faulty bit, which indicates whether the corresponding data block is presently faulty. Blocks marked as Faulty can never contain valid or dirty data and are unavailable for reading and writing. The second field consists of several fault map (FM) bits, which encode the lowest nonfaulty VDD for the data block. For V allowed data VDD levels, K = log 2 (V + 1) FM bits are needed to encode the allowed VDD levels (assuming the fault inclusion property). Figure 2 depicts the V = 3 configuration for a small four-way cache, requiring one Faulty and two FM bits per block.
Power Gating of Faulty Blocks.
Power gating transistors can be used to attain additional power savings for faulty data blocks that occur at reduced voltage. Blocks may span one or more subarray rows. When a block's Faulty bit is set in the metadata array, a downward level-shifting inverter power gates the block's data portion. We assume that the power gating implementation is the gated-PMOS configuration from Powell et al. [2000] , chosen because it has no impact on cell read performance, negligible area overhead [Powell et al. 2000; Cheng et al. 2014] , and good energy savings. A powergated block at low VDD is modeled as having zero leakage power. This is a reasonable assumption because the block would likely be power gated at a low voltage that caused it to be faulty in the first place. We verified the functional feasibility of this mechanism via SPICE simulations.
Designers might choose to omit power gating of faulty blocks for two major reasons. First, if power gating is used, the metadata subarrays should be directly adjacent to their corresponding data subarrays as shown by Figure 2 . This is so the Faulty bit can control the power gate mechanism and the row decoder can be shared between subarrays, although the latter benefit is not explicitly modeled. However, this constrains the cache floor plan.
Second, by allowing for per-block power gating, both power rails must be routed in the wordline direction. Unfortunately, this is not a common approach in industry, where a thin-cell SRAM cell is preferred for above-threshold operation (e.g., as seen by layouts in Sinangil et al. [2011] and Ebrahimi et al. [2015] ). However, others have used wordline-oriented rails successfully in a variety of low-voltage design scenarios [Calhoun and Chandrakasan 2006; Verma and Chandrakasan 2008; Singh et al. 2008; Cheng et al. 2014; Chang et al. 2015] . Nevertheless, there are other important factors to consider when deciding on power rail orientation.
Without advocating for either approach, we believe that fine-grain power gating of faulty blocks could be viewed as an additional benefit from wordline-oriented rails. Without loss of generality, in the rest of the article we assume the presence of perblock power gating, except in Sections 7.2 and 8.2, where we directly compare the two approaches.
Transitioning Data Array Voltage.
It is the responsibility of the power/capacity scaling policy, implemented in the cache controller, to ensure the Faulty and Valid bits are set correctly for the current data array VDD. After all blocks' Faulty bits are properly set for the target voltage, the data array VDD transition can occur. The controller must compare each block's FM bits with an enumerated code for the intended VDD level. Equation (3) captures this logical relationship for all V . We assume that V = 3 levels are allowed throughout this article, corresponding to K = 2 FM bits. Note that our fault map approach scales well for more intermediate voltage levels if needed.
The procedure to scale array VDD is described by Algorithm 1. A voltage transition has a delay penalty to update the Faulty bits and then set the voltage. The cache controller must read out the entire fault map set by set, with each way in parallel. Next, it compares the FM bits for the set with the target array VDD bits. Finally, it writes back the correct Faulty bits for each block in the set. We assume that it takes two cycles to do this for each set. After all Faulty bits are updated, the voltage regulator can adjust the data array VDD, which takes VoltageRegulatorDelay clock cycles. The total VoltageTransitionPenalty in Algorithm 1 is then equal to 2 * NumSets+ VoltageRegulatorDelay clock cycles. Note that VoltageTransitionPenalty in Algorithm 1 does not include the time to write back any cache blocks. The performance and energy impact of any such writebacks are captured accurately in our modified gem5 [Binkert et al. 2011 ] cache model and simulation framework leveraged in Section 8.
Cache Operation In Presence of Faulty Blocks.
Since some blocks may be faulty at any given supply voltage, the cache controller must consult the fault information during an access. Neither a hit nor a block fill are allowed to occur on a block that is marked as Faulty. Thus, all blocks that are currently Faulty must be marked as not Valid and not Dirty, and their Tag bits should be zeroed. When fulfilling a cache miss, the traditional LRU scheme is applied on each set, with the special condition that any blocks marked as Faulty are omitted from LRU consideration. This can change the effective associativity on each set. 5.1.5. 3C+F: Compulsory, Capacity, Conflict, and Faulty Misses. Cache misses can be categorized into three buckets, commonly known as the "3 C's": compulsory, capacity, and 2 Note that in practice, it is not feasible to classify misses in this fashion at runtime, as the full access history for all referenced data must be maintained.
When disabling faulty blocks, it is not clear how resulting misses should be classified. One pitfall would be to simply attribute misses caused by faulty blocks as capacity misses. However, faulty blocks also reduce local associativity. Moreover, consider blocks that are only temporarily faulty during execution with DPCS. An evicted block from a faulty set might be referenced later, causing a miss, even though the set may no longer contain faulty blocks.
Thus, another category of misses can be defined for theoretical purposes. A faulty miss is one that occurs because the referenced data was evicted or replaced earlier from a set containing faulty blocks. To be considered a faulty miss, the reference should otherwise have hit in a cache where faulty blocks never previously occurred in the set. Note that disabling faulty blocks in the L1 cache can affect miss behavior in the L2 cache. For simplicity, we ignore intercache dependencies in the 3C+F classification.
5.1.6. Variation-Aware Selection of Allowed Runtime Voltage Levels. The proposed mechanism requires that each set must have at least one non-faulty block at all allowed voltages, as there is no set-wise data redundancy. This is the constraint that limits the min-VDD.
3
Higher associativity and/or smaller block sizes naturally result in lower min-VDD as shown later in Section 8.1, but they incur other design trade-offs. Nevertheless, as demonstrated later in the evaluation, a good power/capacity trade-off can be achieved even for four-way caches with blocks of 64 bytes.
The allowed runtime voltage levels for the SPCS and DPCS policies may be decided at design time by margining for yield, at test time to opportunistically remove some yield constraints, or at runtime, which can also account for aging effects. However, the design-time approach falls short of the other two, as it cannot exploit the individualized outcome of each chip. The test-time approach might considerably increase manufacturing cost. In this work, runtime choice of VDD levels is used (as opposed to the designtime choice from our prior work ). This eliminates yield margins by customizing the desired voltage levels based on the manifested fault patterns on each SRAM array. Note that voltage margins may still be left for noise resilience.
5.1.7. Populating and Maintaining Fault Maps with BIST. To populate the cache fault maps, one can use any standard Built-In Self-Test (BIST) routine that can detect faults with minimum granularity of a single block (e.g., the march tests from Section 3). The BIST routine can then apply the fault inclusion property to compress the representation of the fault maps by only encoding the minimum nonfaulty voltage for each block.
A drawback of fault map approaches in general is the BIST runtime and storage overhead. The proposed approach incurs longer testing time than typical single-voltage solutions. The overall testing time is O(N * V ), where N is the size of the cache, and V is the number of runtime voltage levels. However, the actual runtime complexity is likely less than this, as fewer blocks must be tested at successively lower voltages due to the fault inclusion property.
One approach is to run BIST at every system startup, which requires no permanent fault map storage and allows aging effects such as Bias Temperature Instability (BTI) to be considered. Particular data patterns stored in the SRAM cells are not expected to have a major impact on aging caused by BTI. This is because data tends to only briefly "live" in the cache and is assumed to be random across many workloads and over time. If data dependence does result in a noticeable impact on aging, the DPCS architecture can compensate with more frequent fault map characterizations.
Note that the fault inclusion property is still valid under aging conditions that affect threshold voltage (such as BTI or hot carrier injection). This is because static noise margins are strongly dependent on threshold voltages and decrease monotonically with VDD [Wang and Calhoun 2011] .
If testing time is an issue, the impact could be partly mitigated by proceeding concurrently with other system startup procedures, for example, DRAM fault testing and initialization of peripheral devices. If the system is rarely or never power cycled, then BIST can be run periodically during runtime using a temporary way-disabling approach resembling that in Alameldeen et al. [2011b] . Owing to the relatively low frequency and duration of such testing, the impact of BIST on overall system performance is expected to be small. If on-chip or off-chip non-volatile memory is available, the low-overhead fault maps could be characterized at test time or first power-up and stored permanently (similar to online self-test methods from Li et al. [2008] ). This has the advantage of eliminating runtime BIST and its associated performance overheads, but forgoes any compensation for aging aside from design guardbanding.
Idle low-power states are another consideration for the fault map design. In the event of a complete core shutdown without state retention, the fault maps must be (1) maintained in the tag arrays, costing idle power; (2) written to non-volatile storage; or (3) sacrificed, in which case BIST must be run again when the core is brought online. In the first case, tag array static power can limit the effectiveness of long-term core shutdowns. In the latter two cases, there is a significant energy and performance cost to shut down a core for a short duration. We leave the coordination of DPCS and CPU/memory power states to future work.
Static Policy: SPCS
In the static policy, the lowest VDD level is used on a per-chip basis that has at least 99% effective capacity (also subject to the yield constraint described in Section 5.1.6). This is set once for the entire system runtime. The only performance overhead to this policy is due to any additional misses that may be caused by the few faulty blocks.
The primary benefit of using SPCS is that voltage guardbands could easily be reduced to suit the unique manufactured outcome of each cache, while only minor modifications to the cache controller and/or software layers are needed. Power can be reduced with negligible performance impact. Because the SRAM cell and block failure rates rise exponentially as the supply voltage is lowered, additional voltage reduction beyond the 99% capacity point brings more power savings, but potentially a significant loss in performance. This phenomenon is described in more detail during the analytical evaluation in Section 7, and is one reason why more sophisticated fault-tolerance methods get diminishing improvements from voltage scaling.
Dynamic Policies: DPCS
A basic guiding principle behind cache operation is that programs tend to access data with spatial and temporal locality. If a program accesses a large working set of data in a short period of time, then large caches are likely to improve performance. Conversely, if a smaller working set is accessed during an equal length of time, then the cache capacity is less important. Note that in both cases, it is important that data is reused for the cache to be effective.
For these reasons, the SPCS policy described in Section 5.2 can be too conservative. This is because the full cache capacity may be overkill for a given application in a particular phase of execution. The primary motivation to consider a DPCS policy over SPCS is to exploit these situations when the whole cache is unnecessary to deliver acceptable performance.
The proposed DPCS policies described next in Sections 5.3.1 and 5.3.2 are only two possibilities among many. In both cases, we emphasize simple policies that can be easily implemented.
5.3.1. DPCS Policy 1: Access Diversity Based. This DPCS policy (described by Algorithm 2) tries to balance energy efficiency with performance by using spatial locality as the primary guiding principle. At the end of every time interval, if the fraction of available cache capacity that was touched falls below some fixed LowThreshold (LT), the DPCS policy reduces voltage and sacrifices cache capacity for the next interval. Conversely, if the cache pressure is high, then DPCS boosts voltage for the next interval and reenables previously faulty blocks, which increases the capacity once again. This policy benefits workloads that are relatively localized over discrete time intervals (small working sets), or those workloads that access memory infrequently. In both cases, the cache capacity sacrifice can yield significant energy savings with little visible performance impact at the system level.
However, the downside of this policy is that it is oblivious to the actual impact of faulty blocks. For example, a piece of code may frequently access a region of memory that maps to only a few cache sets, which may happen to contain some faulty blocks. As a result, miss rates and average access time will dramatically increase, hampering performance. The policy is incapable of reacting to this scenario, as it might only see 5% of the available cache capacity being used for this piece of code. It will not transition to a higher VDD, even though the decreased capacity on a few select sets matters a great deal.
5.3.2. Dynamic Policy 2: Access Time Based. In contrast to DPCS Policy 1, this policy (described by Algorithm 3 tries to trade off energy efficiency and performance using relative average access time as the main metric. The policy is similar to that of Section 5.3.1, but it uses different performance counters to guide DPCS transitions. The benefit of this policy is that it focuses on the "actual" cache performance as measured by average access time. Thus, the real impact of faulty blocks is captured. Revisiting the example from Section 5.3.1, this approach will scale up VDD to relieve the performance bottleneck. Furthermore, because this policy does its best to bound access time, it may be easier to estimate the impact of DPCS on system-level application performance.
However, this policy also has a significant drawback. Cache miss rates are strongly dependent on application behavior. Miss rates and average access time might increase simply due to the application referencing a previously unused data structure, causing compulsive misses. In such a case, the access time-based policy will increase the supply voltage in order to reduce the miss rate, but it would have little to no effect on performance. Thus, neither policy is superior in all cases.
MODELING AND EVALUATION METHODOLOGY
We assessed SPCS and DPCS using a combination of analytical and simulation-based evaluations. The overall framework is depicted in Figure 3 . We now describe the models and procedures used in the evaluation.
Modeling Cache Fault Behavior
Probabilistic failure models were used to analytically compare power/capacity tradeoffs, predict yield, and guide the generation of random fault map instances for the simulation part of our evaluations. Equation (4) summarizes the essential fault models, where p f (v) is the probability of cell failure (Bit Error Rate (BER)) at supply voltage v.
s is the yield of an a-way DPCS cache with s sets.
For our yield comparisons with SECDED and DECTED ECC, we assumed each method maintained ECC at the subblock level, which is significantly stronger than conventional block-level application. The number of ECC bits n − k required per subblock of length k bits with total subblock codeword of length n bits, as well as ECC yield, is given by Equation (5): 
System and Cache Configurations
The proposed mechanism as well as the SPCS and DPCS policies were evaluated for two system configurations: Config. A and Config. B. These are described by Table I . Each system configuration had a split L1 cache and an L2 cache. SPCS and DPCS policies were only applied to the L1D and L2 caches. The same policy was used at both levels. The policies' decisions were not coordinated between cache levels.
Technology Parameters and Modeling Cache Architecture
To maintain relevance to the test chip experiments from Section 3, MOSFET saturation and leakage current were drawn from the corresponding industrial 45nm SOI technology library. All other parameters are based on International Technology Roadmap for Semiconductors (ITRS) data. BERs were computed using models from Wang and Calhoun [2011] and are shown in Figure 4 (a). Since the proposed architectural mechanism allows for SRAM access at reduced voltage, we adopted read stability for the BER calculations, which was the worst case for Static Noise Margin (SNM). Otherwise, we did not distinguish between causes of cell failure.
We calculated delay, static power, dynamic energy per access, and area of the modified and baseline cache architectures using CACTI 6.5 [Balasubramonian et al. 2009 ]. CACTI generated the optimal design for the nominal VDD of 1V using the energydelay product as the metric. We modified CACTI to reevaluate the energy and delay as the data array VDD was scaled down without reoptimizing the design. The L2 cache used slower and less leaky transistors than the L1 cache to keep power in check.
Our CACTI results indicated that as long as the data array VDD is above nearthreshold levels, reducing the data cell VDD impacts the total cache access time by approximately 10% in the worst case of four cache configurations at min-VDD. This performance impact is modeled as a one-cycle hit time penalty when the cache operates at the lowest DPCS voltage level in our simulations. This is based on a conservative assumption that the cache access critical path dictates overall cycle time.
Simulation Approach
To simulate the baseline (no VDD scaling and no faulty blocks), SPCS, DPCS Policy 1, and DPCS Policy 2 caches, we carefully modified the cache architecture in the gem5 [Binkert et al. 2011] framework to implement SPCS and DPCS in detail as well as instrument it for cache, CPU, and DRAM power and energy. These used simple firstorder energy models, with constants for static power and dynamic energy/operation. This enabled the overall energy impact of extra CPU stall cycles and DRAM accesses to be captured in addition to the direct cache energy savings from SPCS and DPCS.
We used 16 SPEC CPU2006 benchmarks cross-compiled for the Alpha ISA using base-level optimization using gcc. The benchmarks were fast-forwarded for two billion instructions, then simulated in maximum detail for two billion instructions using the first SPEC reference data inputs. We modeled a single-core system to run the single-threaded benchmarks. The gem5 settings common to all system configurations are summarized in Table II . For each of the nonbaseline policies, 16 SPEC benchmarks, and two system configurations, gem5 was run five times for five unique fault map inputs. Thus, in total, we performed 512 gem5 simulations to provide some insight on the impact of manifested process variations.
6.4.1. Fault Map Generation and Selection of Runtime VDDs. For accurate simulation in the presence of faulty blocks at low voltage, probabilistic power/capacity scaling data based on CACTI results were input to a MATLAB model that generated 10,000 random fault map instances for each cache configuration. A script chose five fault maps corresponding to each quintile in the distribution as sorted by the min-VDD of each cache instance. For each of these five selected fault map instances, three runtime VDDs were selected based on the criteria described earlier in Section 5.1.6. The total static power of each cache instance was then computed based on the number of power-gated faulty blocks. The particular locations of faulty blocks were assumed to not matter, because eventually all blocks will be referenced roughly uniformly [Sánchez et al. 2013] .
The SPCS voltage for each fault map was minimized such that the capacity was at least 99%. The lower DPCS voltage was set such that capacity was at least 75% and no set was totally faulty (recall that we require each set to have at least one nonfaulty block at all voltages). In practice, out of 40,000 total generated fault map instances, the lower DPCS voltage was always limited by the latter constraint.
6.4.2. Parameter Selection for DPCS Policies. The DPCS policy parameters were set to reasonable values to reduce the huge design space, based on our analysis of policy behaviors on different workloads (examples are given later in Section 8.2.3). We assumed that the VoltageRegulatorDelay to scale the data array VDD is enough to minimize noise-induced SRAM errors and allow for a stable transition. For each cache configuration, the Interval parameter was set to be roughly 100× the respective VoltageTransitionPenalty in clock cycles. This is done so that the worst-case system-level performance impact of DPCS transitions would be bounded to 1% of overall runtime. Our choices of Interval are not guaranteed to be optimal across all applications, but they do provide some guarantees on the high-level performance impact of DPCS.
DPCS HighThreshold (HT) and LowThreshold (LT) for both policies were set to provide sufficient hysteresis, avoiding voltage-level oscillation for workloads in general over different phases of execution. For DPCS Policy 1 (access diversity based), the thresholds were set such that mild cache utilization (as measured by the working set size over an Interval [Dhodapkar and Smith 2002] ) would result in a voltage reduction. For DPCS Policy 2 (access time based), we ensured that the cache average access time would not suffer too greatly in relation to its hit time.
The choice of policy parameters might be chosen in other ways, for example, by leveraging other techniques for detecting phases of execution [Dhodapkar and Smith 2003; Dani et al. 2014] . The DPCS policy parameters could also be dynamically adjusted by online monitoring of workload behaviors and/or OS-based management. We leave these possibilities to future work, where we aim to adapt DPCS to the context of multicore systems.
ANALYTICAL RESULTS
In this section, results pertaining to yield, static power versus effective capacity, and area overheads are discussed without the use of architectural simulation.
Fault Probabilities and Yield
Using the derived static power from our technology data integrated into CACTI as well as the analytical fault models from Section 6.1, we were able to compare our proposed mechanism analytically with alternative approaches. The results for L1 Config. A and L2 Config. B are shown in Figure 4 . Similar trends are observed for the L1 Config. B and L2 Config. A caches, whose results are not shown for brevity. For yield versus design-time fixed VDD, we compare with SECDED/DECTED ECC and FFT-Cache [BanaiyanMofrad et al. 2011] , which is a recent FTVS approach achieving one of the lowest min-VDDs. For FFT-Cache and both ECC schemes, fault tolerance is applied at the subblock level of eight bytes (where the full block size is 64 bytes) for additional resilience. For the static power versus effective capacity metric described in Section 4, we compare with FFT-Cache and a generic way-granularity (associativity reduction) power gating scheme.
The probability of faulty bits (BER) is shown in Figure 4 (a), while the probabilities of faulty subblocks, blocks, sets, and DPCS yield are depicted in Figures 4(b) and 4(c). It is clear that in the region of 500-600mV, a steep drop-off in yield is encountered. This is due to exponential fault rates at the bit level, which becomes critical in this region. This trend agrees with our test chip measurements in Section 3 that showed that SRAM failures experience "avalanche-like" behavior.
In Figures 4(d) and 4(e), the yield of DPCS is compared with alternative approaches. Although our approach does not achieve the best min-VDD for constant yield, it did better than SECDED in all cache configurations, even though ECC is applied at eight-byte subblock granularity. In the L1 Config. A configuration, DECTED achieves moderately better min-VDD than our mechanism due to low associativity, which impacts yield of our approach. The 16-way set associativity of L2 Config. B results in lower min-VDD than both ECC schemes and nearly matches FFT-Cache. Note that DECTED typically incurs high area, power, and performance overheads to achieve the yields shown [Kim et al. 2007; Rossi et al. 2011] . Furthermore, SECDED/DECTED may be overkill for sparse voltage-induced faults, and as voltage is reduced, tolerating bit cell failures reduces the ability of these ECC schemes to tolerate soft errors. Nevertheless, these ECC schemes could be combined with our approach to handle both voltage-induced faults as well as transient soft errors. We leave this to future work.
Static Power versus Effective Capacity
Despite some yield weaknesses for low-associativity cache configurations, the proposed mechanism still achieves lower total static power at all voltages compared to FFTCache [BanaiyanMofrad et al. 2011] . This is due to the lower overheads of DPCS' fault tolerance compared to the complex approach of FFT-Cache. The difference arises from a significantly smaller fault map with only three extra bits per 64-byte block. In contrast, FFT-Cache needs two entire fault maps for each of the lower VDDs, compounding its existing high overheads and extra necessary logic on the critical path. Furthermore, under certain floor plan and layout conditions described earlier in Section 5.1.2, our approach can power gate blocks as they become faulty at low voltage, providing additional power reduction at low capacities compared to pure voltage scaling. This result is depicted in Figure 5 (a) along with a hypothetical version of FFT-Cache with power gating of lost capacity (gray dashed line). If the power gating circuitry described in Section 5.1.2 is omitted, then power savings will be moderately worse at capacities less than 100%, as shown by Figure 5 (a). However, overall area overheads would decrease by roughly 2%, as described next in Section 7.3. This simpler architecture may be better for SPCS, where power gating logic is only used for up to 1% of blocks that are allowed to be faulty. While the per-block power gating feature is useful for an improved power versus capacity curve, it is not absolutely essential to DPCS. This makes our architecture compelling even for traditional thin-cell SRAM layouts with almost no modifications to the macro floor plan.
Although our approach achieves lower power than FFT-Cache, it also has a lower effective capacity at all voltage levels than FFT-Cache due to a much simpler but weaker fault-tolerance mechanism, as shown by Figure 5(b) . This reaffirms the better yield of FFT-Cache. These results clearly illustrate a trade-off in power and capacity at each voltage level between the two approaches.
In the end, the superior power savings of the proposed approach make up for the capacity deficit with respect to FFT-Cache, as depicted by Figure 5(c) . Moreover, our approach with power gating of faulty blocks does better than generic associativity reduction via per-way power gating at all cache capacities. This is unlike FFT-Cache, which does worse than power gating below 50% cache capacity because of its high overheads. We found that for the L1 Config. A cache, our mechanism achieves 31.1% lower static power than FFT-Cache at the same 99% effective capacity. In addition, as cache capacity is scaled down with power, the gap increases in favor of our approach.
Area Overhead
Our CACTI results indicate that from the fault map alone, our area overheads compared to a baseline cache lacking fault tolerance do not exceed 4% in the worst case of all configurations. The additional area overheads from the power gating transistor [Powell et al. 2000 ] plus small inverter are estimated to be no more than 2%. The DPCS policy implementation is assumed to have negligible area overhead, due to its simplicity and that it could be implemented in software, as the cache controller typically includes the necessary performance counters. Furthermore, the fault map comparison logic is only a few gates per cache way. Thus, we estimated the total area overhead to be up to 6% among all tested cache configurations. These area overheads are a significant improvement compared to the reported overheads of other FTVS schemes such as 10T 
SIMULATION RESULTS
In this section, we discuss aggregate energy savings and performance overheads across a portion of the SPEC CPU2006 benchmark suite on both system configurations for several different fault map instances.
Fault Map Distributions
Using the fault models described in Section 6.1 and the methodology from Section 6.4.1, we generated 10,000 complete fault map instances for each of the four cache configurations, that is, L1-A, L2-A, L1-B, and L2-B. Each fault map contained the min-VDD for each cache block at a resolution of 10mV. Figure 6 (a) depicts the aggregate faulty block distribution across all 40,000 randomly generated fault maps. Figure 6 (b) depicts the histogram of the minimum global VDD that can be used across a cache instance. For the baseline case, any faulty block at a given voltage will limit the overall min-VDD. Clearly, the long tail from Figure 6 (a) hampers the traditional design approach (these numbers agree with our 45nm process guidelines for min-VDD as well). In the baseline caches, as nominal cache capacity increases, so does the min-VDD, regardless of associativity.
However, the cache min-VDD distributions for the proposed architecture follow a very different trend, as indicated by Figure 6(b) . Unlike traditional caches, the benefits of increased associativity outweigh increased nominal capacity for reducing VDD. This is because the voltage scalability of our approach is only limited by the constraint that no set may be totally faulty (recall the min-VDD requirement for our approach from Sections 5.1.6 and 6.1). As associativity increases, the likelihood of any set being completely faulty rapidly decreases.
Thus, our simple fault-tolerance scheme exhibits good scalability as cache memories increase in size and associativity. Moreover, the variance in min-VDD across many cache instances is significantly less than conventionally designed caches, meaning that even design-time choice of VDD can be done with narrower guardbands while maintaining high yield.
Architectural Simulation
We now discuss the impact of SPCS and DPCS on energy and performance. All the depicted results include the architectural design choice to power gate faulty blocks. We also conducted the same set of experiments without per-block power gating; the results are not shown for brevity. For the worst-case fault maps at the lowest voltages, the total cache static energy difference between the two approaches did not exceed 5%. At the Fig. 7 . Breakdown of averaged total system energy for baseline, SPCS, and DPCS policies P1 and P2, all normalized to the respective benchmarks' baseline system total energy. Note that because DPCS has little impact on overall runtime and DRAM accesses, the relative energy breakdown across CPU and DRAM can be renormalized to other systems in a straightforward manner. system level, the overall energy gap between the two design choices was less than 1%. Henceforth, we only discuss the architecture with the power gating feature included.
8.2.1. Breakdown of Energy Savings. Both SPCS and DPCS deliver static and dynamic power savings due to low-voltage operation even during cache accesses. The breakdown of total system energy is depicted in Figure 7 for the baseline, SPCS, DPCS Policy 1 (P1), and DPCS Policy 2 (P2) caches using the mean of five unique fault map runs. The variations in energy breakdowns across benchmarks in total baseline and SPCS energy are due to workload dependencies, that is, the extent to which they are CPU or memory bound.
SPCS achieved good and consistent energy savings across all benchmarks, achieving an average of 62% total cache and 22% total system energy savings with respect to baseline in both Configs. A and B. This is because cache efficacy is not significantly impacted at the 99% capacity point, and no voltage transitions occur to other power/capacity points.
Energy savings were almost always better for both DPCS policies P1 and P2 compared to SPCS and the baseline. This was in line with our expectations, as both policies never go above the SPCS voltage level because little performance would be gained. However, note that DPCS has some system-level energy overhead due to extra time spent in DPCS transitions and potentially more cache misses and CPU stall cycles compared to SPCS or baseline. For example, reduced power and capacity in L1 due to DPCS can cause more L2 accesses, resulting in a trade-off between L1 static power and L2 dynamic energy.
In nearly all cases, the access diversity-based DPCS Policy 1 (P1 in Figure 7 ) achieved equal or slightly lower total cache energy than the access time-based DPCS Policy 2 (P2). On average, P1 reduced total cache (system) energy by 78% (27%) compared to baseline on Config. A, and by 80% (26%), on Config. B. On average across all benchmarks and configurations, P2 saved nearly identical system-level energy as P1. However, in some benchmarks, such as perlbench on Config. A or lbm on Config. B, the gap between policies was noticeable. Discrepancies in energy savings between DPCS Policy 1 and DPCS Policy 2 could be explained by a combination of threshold parameters and the method of inferring performance impact at low voltage. For example, the cache capacity utilization measured by DPCS Policy 1 loosely relates to performance through working set analysis. In contrast, DPCS Policy 2 measures performance more directly through average access time. However, owing to the huge design space already presented, we do not have the resources to exhaustively test other policy threshold parameters. We leave static and dynamic policy optimization to future work.
Regardless of differences in policies, SPCS and DPCS demonstrate reduced total cache and system energy overall with little impact on the CPU or DRAM. As we will see in the next section, this can be attributed to low performance overheads while still reducing static cache power and some dynamic access energy.
Performance Overheads.
The energy savings presented previously come at a mild performance cost as measured by execution time. Figure 8 depicts the average performance penalty for SPCS, DPCS Policy 1, and DPCS Policy 2 for each benchmark with respect to the baseline system. Across all benchmarks on Config. A, the slowdown of SPCS is only 0.32 ± 0.06% with respect to baseline. For Config. B, the slowdown is 0.50 ± 0.02%. This shows that a 1% loss of cache capacity has a negligible impact on performance. Thus, even with the worst-case fault maps, the per-chip SPCS approach is effective at achieving energy savings with very little overhead. Thanks to the low variance in min-VDD with SPCS (see Section 8.1), an aggressive design-time choice of SRAM VDD (as opposed to our runtime approach) could also result in negligible performance impact while maintaining high yield and good energy savings. Figure 8 also shows how DPCS performance degrades compared to SPCS. On average, the access time-based DPCS Policy 2 (0.77%-1.76%) does equal or is better than the access diversity-based DPCS Policy 1 (0.94%-2.24%) on Config. A. This small performance advantage of P2 holds for Config. B. For both SPCS and DPCS, higher cache associativity improves performance consistency in addition to min-VDD across a variety of fault maps.
Interestingly, the per-benchmark performance of both policies appears to be dependent on the system configuration. For example, in Config. A, P1 does poorly on bwaves and is also very sensitive to fault map variation. In contrast, lbm performed very consistently for SPCS, P1, and P2 under Config. A, but suffered relatively more under P1 for Config. B. We believe these sensitivities are due to a combination of different CPU frequencies, cache hit times, and possibly fault map instances.
8.2.3. DPCS Policy Behavior. Finally, we illustrate some characteristics of each DPCS policy by examining short snippets of their behavior during execution. Figure 9 shows two trace fragments for different workloads, system configurations, cache levels, and DPCS policies. For most of the execution trace depicted in Figure 9 (a), the cache capacity is roughly 80%. This indicates that DPCS is in the lowest voltage mode. DPCS Policy 1 does not boost voltage unless the working set size exceeds 90% of the available capacity. This occurs several times around Interval 150. The cache occupancy then increases rapidly, indicating that some performance was probably recovered. However, higher miss rates and average access times do not necessarily cause high block touch rates. This can be seen around Interval 210, where the miss rate jumps to approximately 60%, but only 65% of the nominal capacity is referenced at that time. In this case, a DPCS boost would probably not help performance noticeably.
A very different trend is shown in Figure 9 (b). Since the cache is relatively large compared to the needs of the application, the working set size always remains below 20% of nominal capacity. However, note that the application regularly has a miss rate of up to 40%. DPCS Policy 2 adapts by transitioning to a higher cache capacity in an attempt to recover performance. However, the downside is that such an action is not guaranteed to affect the miss rate or recover performance.
CONCLUSION AND FUTURE WORK
In this work, we proposed SPCS and DPCS, a novel FTVS SRAM architecture for energy-efficient operation. Our proposed mechanism and policies leverage several important observations. First, using our 45nm SOI test chips, we observed the fault inclusion property, which states that any block that fails at some supply voltage will also be faulty at all lower voltages. To the best of our knowledge, no other FTVS approaches exploit this behavior. This allows for efficient fault map representation for multiple runtime cache VDD levels. Our approach also has the benefit of a simple and low-overhead implementation, which allows the caches to achieve better power/capacity trade-offs than some competing approaches that primarily focus on achieving low min-VDD at fixed yield.
We believe that DPCS has the potential to complement conventional DVFS for improved system-level energy proportionality. The application of DPCS in a real system would bring interesting challenges and opportunities to the software stack. Our future work seeks to develop coordinated DVFS and DPCS policies at the system level, taking into account heterogeneous system architectures, multicore processors, cache coherence, aging effects, and implications for power management of other system components. This approach to variation-aware design could also be extended to allow for approximate computing.
