Abstract-Modern digital IC designs have a critical operating point, or "wall of slack", that limits voltage scaling. Even with an errortolerance mechanism, scaling voltage below a critical voltage -so-called overscaling -results in more timing errors than can be effectively detected or corrected. This limits the effectiveness of voltage scaling in trading off system reliability and power. We propose a designlevel approach to trading off reliability and voltage (power) in, e.g., microprocessor designs. We increase the range of voltage values at which the (timing) error rate is acceptable; we achieve this through techniques for power-aware slack redistribution that shift the timing slack of frequently-exercised, near-critical timing paths in a power-and area-efficient manner. The resulting designs heuristically minimize the voltage at which the maximum allowable error rate is encountered, thus minimizing power consumption for a prescribed maximum error rate and allowing the design to fail more gracefully. Compared with baseline designs, we achieve a maximum of 32.8% and an average of 12.5% power reduction at an error rate of 2%. The area overhead of our techniques, as evaluated through physical implementation (synthesis, placement and routing), is no more than 2.7%.
I. INTRODUCTION
The traditional goal of IC design is for the product to always operate correctly, even under worst-case combinations of (process, voltage, temperature, wear-out, etc.) non-idealities. It is wellrecognized that designing for worst-case operating conditions incurs considerable area, power and performance overheads [5] , and that these overheads worsen with increased manufacturing or runtime variations in advanced technology nodes [24] . Better-than-worst-case (BTWC) design [1] allows reliability (in the sense of timing and hence functional correctness) to be traded off against performance and power. The central idea, as exemplified by the shadow-latch technique in Razor [5] , is to design for average-case conditions (thus saving area and power) while adding an error detection and correction mechanism to handle errors that occur with worst-case variabilities. System-level techniques, as exemplified by Algorithmic Noise Tolerance [13] , allow timing errors to proliferate into the system or application, but then exploit algorithmic and/or cognitive noise tolerance in mitigating errors at the application level. The use of such application-or system-level error detection and correction is assumed in proposed probabilistic SOCs [3] and stochastic processor architectures [18] , which are recent classes of BTWC designs.
Our work focuses on the optimized application of voltage overscaling for power reduction in the context of BTWC design. The impact of BTWC design techniques is often limited in highperformance digital designs by a critical operating point or "wall of slack" phenomenon that limits voltage overscaling and, more importantly, is a direct consequence of today's standard approach to power optimization. The Critical Operating Point (COP) hypothesis [19] (cf. Figure 1 modern digital design will have a critical operating voltage V c above which zero timing errors occur, and below which massive timing errors occur. The COP hypothesis is natural in light of how modern designs are optimized for power and area, subject to a frequency constraint: negative timing slack (on a combinational path) is cured by upsizing and buffering, while positive timing slack is traded off for area and power reductions. Thus, in the final design, many timing paths are critical, and there is a "wall of (critical) slack". The COP hypothesis states that overscaling beyond the critical voltage can abruptly cause error rates beyond what an error-tolerance mechanism can handle. According to [19] , this has been confirmed in generalpurpose microprocessors. A key motivation for our work is that COP behavior limits the applicability of voltage scaling in trading off reliability for power -even in the context of BTWC design. Our work seeks to improve the effectiveness of BTWC design techniques through power-aware slack redistribution, a novel design approach that enables extended voltage-reliability tradeoffs. Poweraware slack redistribution reapportions timing slack of frequently occurring, near-critical timing paths to increase the level of overscaling (i.e., reduce the minimum voltage) at which a given error-tolerance mechanism can maintain an acceptable timing error rate. The result is a design that fails more gracefully, and achieves significantly improved power savings with only small degradation of application performance.
In the following, Section II reviews previous work, and Section III formalizes the problem of achieving "gradual slope" in the timing slack distribution, and hence graceful degradation of correctness with voltage overscaling. Section IV describes our power-aware slack redistribution techniques, and Section V discusses implementation and experimental methodology. Section VI presents results and analysis, and Section VII concludes.
II. RELATED WORK

A. BTWC Designs
Better-than-worst-case (BTWC) design approaches allow circuits to save power by optimizing for normal operating conditions rather than worst-case conditions. One class of BTWC techniques allows adaptation to runtime conditions by specifying multiple safe voltage and frequency levels at which a design may operate, and allows for 978-1-4244-5767-0/10/$26.00 2010 IEEE 10A-1 switching between these states. Examples in this class are Correlating VCO [2] , [12] and Design-Time DVS [23] . Another class of BTWC designs uses "canary" circuits, including delay-line speed detectors [4] and triple-latch monitors [16] , to detect when critical timing (failure) is imminent and thus avoid any unsafe voltage scaling. Finally, Razor [5] and ANT techniques [13] provide error tolerance at the circuit-and algorithm-level, respectively. Their benefits under voltage scaling are limited not only by COP behavior but by the overhead of error correction that is increasingly required as voltage is scaled.
B. Design-level Optimizations
Design-level optimizations for timing speculation architectures ( [8] , [20] , [7] ) identify and optimize frequently-exercised timing paths, while other (infrequently-exercised) paths are allowed to have timing errors. EVAL [20] trades error rate for processor frequency by using system-level configurations to shift or otherwise reshape path delay distributions of various functional units. BlueShift [8] is an application of EVAL that identifies most frequently violated timing constraints and optimizes the corresponding timing paths using forward body biasing and path constraint tuning (PCT) (essentially, setup timing over-constraints). CRISTA [7] also addresses variationtolerant circuit design by using Shannon expansion-based partitioning to isolate critical paths. Critical paths with low activity are separated and de-optimized.
C. Cell Sizing
Many previous works use cell sizing for power or area recovery subject to timing constraints. Generally, positive (setup) timing slack on non-timing critical cell instances can be flexibly 'traded' for power and area objectives (gate-length increase or V th swap to reduce leakage, or gate-width decrease to reduce area and total power). Fishburn and Dunlop propose a fast iterative method to meet delay constraints called TILOS [6] . TILOS uses the Elmore delay model for transistor delays, and proposes a heuristic that sizes transistors iteratively, according to the sensitivity of the critical path delay to the transistor sizes, i.e., finding a greedy (maximum delay reduction / transistor width increase) "move" at each iteration. The method of Duet [21] performs simultaneous assignment of threshold voltage (V th ) and transistor width using a merit (sensitivity) function. Gupta et al. propose small biases of transistor gate length to further minimize leakage power [10] . They also present a sensitivity-based downsizing approach for transistor-level V th assignment [9] , and a post-layout, post-signoff gate length biasing technique for parametric yield (leakage and leakage variability) optimization [11] . Jeong et al. revisit a general linear programming (LP) formulation that can concurrently exploit multiple knobs ranging from multi-L gate footprint-compatible libraries to post-layout L gate biasing [14] .
Our work, detailed below, uses post-layout cell resizing to redistribute timing slack so as to achieve a switching activity-aware 'gradual slope' distribution of timing slack. We believe that ours is the first to do so specifically in the BTWC context; moreover, our proposed methodology can find frequently exercised paths rapidly (without repeated gate-level simulation as in BlueShift).
III. THE GOAL OF DESIGN OPTIMIZATION
We minimize power consumption for a given error rate by minimizing the voltage at which that error rate can still be observed. Traditional designs exhibit a critical wall of slack in which the path slacks for the majority of paths are similar and close to the critical path slack of the circuit (observe the red curve in Figure 2 ).
For traditional designs, our goal of aggressively reducing the operating voltage to save power is thwarted by the critical wall of slack, because scaling past the wall results in a catastrophic number of timing violations. To alleviate this restraint, we seek to reshape the slack distribution of a circuit to have a gradual slope rather than the steep slope that characterizes the critical wall (observe the blue curve in Figure 2 To achieve the desired slack distribution that permits aggressive voltage scaling, we must alter the slack of some paths to break down the critical wall. In our optimization approach, we increase the slack of frequently executed critical paths to reduce the onset of errors when voltage is scaled. Likewise, we can reduce the slack of rarely exercised paths, since these paths will not have a significant impact on error rate when voltage is scaled. Together, these path slack adjustments will reshape the slack distribution of a circuit and extend the range of voltage scaling. The goals of our optimization are expressed in Figure 2 .
IV. SLACK REDISTRIBUTION AND POWER REDUCTION
We now present our cell swapping approach to achieve a gradual, activity-(and hence power-) aware timing slack distribution for a given circuit design.
A. Power-aware Slack Redistribution Using Cell Swap Method
Our slack distribution optimizer is implemented in C++ and performs cell swapping (gate sizing only, with no logic resynthesis) using the Synopsys PrimeTime vB-2008.12-SP2 [27] tool and its built-in Tcl socket interface. Traditional optimization tools treat all paths equally: all negative-slack paths must be brought up to zero or positive slack. By contrast, to improve the performance-power profile of the design, we spend our optimization efforts on frequently-exercised paths in order to minimize error rates under voltage overscaling.
Our heuristic determines a target voltage corresponding to a specific error rate, and then 'over-optimizes' frequently-exercised paths using upsizing (i.e., increase of transistor width and hence drive strength) cell swaps. Figure 3 illustrates the challenges inherent in setting the target voltage. The figure shows path delay changes after slack optimization for a fixed target voltage, where the optimizer swaps cells in Paths A, B and C to reduce the slack of those paths with respect to the fixed target voltage. However, if the voltage at which the maximum acceptable error rate is observed is larger than the target voltage as specified by the red-dotted line, Paths A and C are optimized unnecessarily beyond the actual scaled voltage, and power is wasted.
Our slack optimization approach finds a target voltage after estimating error rates at each operating voltage, and iteratively optimizes paths while scaling voltage. At the initially selected voltage, the optimizer performs cell swaps to improve timing slack. After performing this timing optimization at the initially selected voltage, the voltage is scaled until the target error rate is reached. Figure 4 illustrates the optimization heuristic. Path A is optimized until the target voltage is reached, but Path C is not optimized, since Path C does not have negative slack at the target voltage. To enable proper voltage selection, we must accurately forecast error rates without resorting to time-consuming functional simulation. Accurate error rate estimation ensures that we do not over-optimize, resulting in too much area overhead, or under-optimize, thus limiting the possible extent of voltage scaling. For this purpose, we use toggle information of flip-flops that have negative timing slack. The toggle information consists of toggles from both negative-slack paths and positive-slack paths; in the error rate calculation, only toggles from negative-slack paths are considered. Hence, the error rate in a single flip-flop, ER f f , can be estimated by Equation (1) .
In Equation (1), T G f f is the toggle rate of flip-flop f f , T G p neg and T G p all are the toggle rates of a negative slack path p neg and all paths p all to flip-flop f f . We obtain an aggregate error rate as the summation of error rates in all flip-flops. However, this value will be significantly larger than the actual error rate observed during functional simulation, it does not account for errors that occur in the same clock cycle. Moreover, the existence of false paths also imparts pessimism to the estimated error rate. We therefore use a parameter α, obtained from experimental investigations, to compensate for this pessimism. The estimated error rate of a target design D is defined in Equation (2), with the compensation parameter α. In the experiments reported below, a value of α = 0.5 is used.
Figure 6 compares estimated error rates and actual error rates at each operating voltage. The estimated error behavior roughly matches the actual error behavior, and we can find an appropriate target voltage based on the estimated error rate.
After finding a target voltage, the slack optimizer finds negativeslack paths by tracing backward from flip-flop cells using a depth-first search (DFS) algorithm. We optimize by swapping (i.e., resizing) cells with other library cells that have identical functionality. We determine priority according to the switching activity of a path, defined as the minimum toggle rate over all cells in the path. Cell swapping is performed on all cells in a given target path, and only swaps that improve timing slack of the path are accepted. 1 There is significant order-dependence -and hence impact of prioritization -since we do not allow the optimizer to touch a previouslyswapped cell; the intuition here is that this helps avoid 'cycling' of configurations and also reduces runtime. After cell swapping, the optimizer checks the timing of fan-in and fan-out cells that have been previously touched. When there is no timing degradation in the connected neighboring cells, the cell change is finally accepted. 
Pick the critical path p with maximum switching activity; 3.
4.
while swap count is not zero do
5.
for i = 0 to |p| do 6 . 
12.
for all fanin and fanout cell c f an of c(i) do 13 . 24 . end while Algorithm 1 presents pseudocode of the optimizer. ComputeErrorRate(V target ) estimates error rates, as defined by Equation (2) . ER target is a target error rate, which can be set to the maximum allowable error rate. The FindCriticalPaths() function finds all negative-slack paths in the design, and ReportTotalPower(V target ) reports the total power consumption from Synopsys PrimeTime. In the pseudocode, the target voltage is iteratively scaled by an additional 0.01V until the error rate exceeds a target error rate. Then, the heuristic optimizes critical paths at the target voltage. If the power consumption is not reduced after the voltage scaling, the latest swaps are restored by the RestoreSwaps() function and the optimization is terminated. 
10A-1
Algorithm 2 Pseudocode for the power reduction.
B. Power-aware Post-processing
In addition to the above slack optimization, we can also reduce power consumption by downsizing cells on rarely-exercised paths. Algorithm 2 shows this post-processing heuristic. The power reduction procedure downsizes cells logical equivalents with smaller power consumption. Two parameters govern cell selection and swap acceptance. First, a cell is selected that has positive slack or is in a rarely-exercised path. The cell's toggle rate should be less than β, where the parameter β is set small enough for us to expect an insignificant effect on error rate. Downsizing cell swaps are accepted as long as they do not increase error rate; to this end, a second variable γ characterizes the cell's effect on neighboring cells. If the timing slack of the neighboring cells which have larger toggle rate than γ, the downsizing is restored. Within these constraints, the optimizer selects the best candidate cells to reduce power without affecting error rate.
V. METHODOLOGY Figure 5 illustrates our overall flow for gradual-slope slack optimization. The switching activity interchange format (SAIF) file provides toggling frequency for each net and cell in the gate-level netlist; it is derived from a value change dump (VCD) file from gate-level simulation using in-built functionality of the Synopsys PrimeTime-Px [27] tool. To find timing slack and power values at the specific voltages, we prepare Synopsys Liberty (.lib) files for each voltage value -from 1.00V to 0.50V in 0.01V increments -using Cadence SignalStorm TSI61 [28] . We use the OpenSPARC T1 processor [25] to test our optimization framework. Table I describes the selected modules and provides characterization in terms of cell count and area. Gate-level simulation is performed using test vectors obtained from full-system RTL simulation of a benchmark suite consisting of bzip2, equake and a sorting test program. These benchmarks are each fastforwarded by 1 billion instructions using the OpenSPARC T1 system simulator, Simics [17] Niagara. After fast-forwarding in Simics, the architectural state is transferred to the OpenSPARC RTL using CMU Transplant [22] . More details of our architecture-level methodology are available in [15] .
Switching activity data gathered from gate-level simulation is fed to Synopsys PrimeTime (PT) static timing tool through its Tcl socket interface. Timing slack and switching activity information is continually available from PT, through the Tcl socket interface, during the optimization process. After our optimization, all netlist changes are realized using Cadence SoC Encounter v7.1 [29] in ECO (engineering change order) mode.
Module designs are implemented in TSMC 65GP technology using a standard flow of synthesis with Synopsys Design Compiler vY-2006.06-SP5 [26] and place-and-route with Cadence SoC Encounter. As noted above, voltage scaling effects are captured by characterizing Synopsys Liberty libraries (using Cadence SignalStorm TSI61) at a number of operating voltages. Runtime is reduced by adopting a restricted library of 63 commonly-used cells (62 combinational and 1 sequential); the total characterization time for 51 voltage points is around two days, but this is a one-time cost.
Using our slack optimizer, we optimize the module implementations listed in Table I , and then estimate error rates by counting cycles with timing failures during gate-level simulation. We use a SCANlike test wherein the test vectors specify the value of each primary input and internal flip-flop at each cycle. This prevents pessimistic error rates due to erroneous signals propagating to other registers. We emulate the SCAN test by connecting all register output ports to the primary input ports, allowing full control of module state.
VI. RESULTS AND ANALYSIS
Our experimental results compare the performance of our slack optimization flow against several alternatives for 10 component modules of the OpenSPARC T1 processor [25] . In addition to traditional CAD flows targeting loose (0.8GHz) and tight (1.2GHz) timing constraints we also compare against an implementation of BlueShift [8] that optimizes paths in decreasing order of the product of negative slack (magnitude) and switching activity. When voltage is scaled, such paths cause the most timing violations, and we reduce errors by assigning tighter timing constraints during P&R with Cadence SoC Encounter. We perform gate-level simulation of modules to estimate error rates and power consumption at different voltages. For all experiments, we use a compensation factor of α = 0.5 (Equation (2)), and set β = γ = 10 −4 in Algorithm 2. Table II demonstrates the impact of slack optimization in reducing power consumption for our test modules. Benefits estimated at Table I. optimization time are compared to actual simulated results, showing the power reduction afforded for an error rate of 2%. Discrepancies between the actual and estimated results are primarily due to the inaccuracy of the error rate estimation technique. The slack optimizer achieves up to 25.8% power reduction by redistributing slack and extending the range of voltage scaling. The power reduction stage provides additional benefits, up to 3.6%, by downsizing cells on infrequently exercised paths.
We note that not all modules achieve substantial benefits. Power reduction from baseline is limited for sparc exu div, sparc ifu errdp and tlu mmu ctl. These modules have low switching activity, and their error rates remain below 2% even for when voltage is scaled down to 0.5V. Consequently, both the baseline and slack-optimized implementations achieve the same benefits for these modules. Significant benefits can be achieved when the original slack distribution of the module dictates that errors increase rapidly. In that case, the slack optimizer is able to redistribute slack and extend the range of voltage scaling. Figure 7 shows how error rate varies as voltage is scaled for each of the OpenSPARC T1 modules. The slack optimizer redistributes timing slack so that the error rate for a module increases more gradually as voltage is scaled down. Aggressive optimization can in some cases result in a lower error rate for tightly constrained P&R or BlueShift, but our goal is ultimately not to reduce error rate but rather to reduce power consumption. Figure 8 shows the power consumption of the modules at each operating voltage, demonstrating that although aggressive optimization can result in a lower error rate, this comes at considerable power expense due to increased area from cell upsizing. The area overhead of the slack optimizer is significantly lower than with the other approaches, since it targets the specific cells for which upsizing produces the most benefit. The additional power reduction stage even reclaims some of this area overhead, reducing power almost to that of the baseline at the same voltage. Table III shows the average area overhead of each design approach. The slack optimizer maximizes the benefits gained per each increase area cost. This efficient slack redistribution approach results in lower power for a given error rate, as shown in Figure 9 . Benefits are chiefly due to the ability to scale voltage to a lower level for the same error rate. Even though aggressive approaches can sometimes increase the range of voltage scaling further than the slack optimizer, the power overhead of these approaches outweighs the power savings of voltage scaling, and total power is even higher than that of the baseline in many cases. Power-aware slack redistribution, on the other hand, does well to reduce power consumption at the target error rates for the diverse set of modules, in spite of its slight area overhead (2.7%). Figure 10 shows the slack distribution of each design technique -traditional SP&R (tightly constrained), BlueShift PCT, and slack optimizer for lsu dctl. We note that power-aware slack redistribution results in a more gradual slack distribution. Thus, the slack-optimized design will have fewer failing paths as voltage is scaled down. Figure 11 compares the slack distribution for all modules before and after slack optimization. For some modules (tlu mmu ctl, sparc i f u errd p), the slack distribution is relatively unchanged after slack optimization (again, because the error rates of these modules are low), and optimization is not performed unless an error rate of 2% is exceeded. For sparc exu div, the slack distribution remains unchanged because the optimization heuristic is unable to reduce the delay on critical paths through cell swapping. VII. SUMMARY AND CONCLUSION Our work enables an extended power-reliability tradeoff in digital designs by optimizing the slack distribution for 'gradual-slope' in a toggle rate-aware manner. Our power-aware slack redistribution lowers the minimum voltage with acceptable timing error rate, and leads to designs that not only can be run with lower power, but that also fail more gracefully. We demonstrate the impacts of 'gradualslope' design on voltage overscaling and total system power, using modules from the OpenSPARC T1 benchmark and 65nm SP&R implementation. Our experiments show a maximum of 32.8% and an average of 12.5% total power savings over the baseline design at an error rate of 2% (cf. Table II ). The area overhead of our technique is no more than 2.7%.
Our ongoing research seeks CAD techniques for similar extended reliability-power tradeoffs for embedded memories, as well as the exploitation of heterogeneity in multi-core architectures to reduce average-case overhead of our gradual-slack optimization (with heterogeneously reliable and gracefully-degrading cores). Additionally, our present techniques can be augmented to consider metrics of 'architecture-level criticality' in addition to path timing slack, so as to further reduce overhead of increased resilience and more graceful system degradation with voltage overscaling.
10A-1 Fig. 9 . Power consumption at each target error rate of the OpenSPARC T1 modules. Table I (before and after slack optimization VIII. ACKNOWLEDGMENTS Work at UIUC was supported in part by Intel, NSF, GSRC, and an Arnold O. Beckman Research Award. Work at UCSD was supported in part by MARCO GSRC and the National Science Foundation. Feedback from Janak Patel, Steve Lumetta, Sanjay Patel, Naresh Shanbhag, and anonymous reviewers helped improve this paper.
10A-1
