Abstract-In this paper, we present and solve the problem of power-delay optimal soft linear pipeline design. The key idea is to use soft-edge flip-flops to allow time borrowing among consecutive stages of the pipeline in order to provide the timingcritical stages with more time and trade this timing slack for power saving. We formulate the problem of optimally designing the soft-edge flip-flops and setting the clock frequency and supply voltage so as to minimize the power-delay product of a linear pipeline under different scenarios using both deterministic and statistical static delay models. In our first problem formulation, timing violations are avoided by respecting deterministic worst case path delay bounds. Next, the same problem is formulated for a scenario where stage delays are assumed to be random variables, and we minimize the power-delay product while keeping the probability of timing violations bounded. The softedge flip flops are equipped with dynamic error detection (and correction) circuitry to detect and fix the errors that might arise from over-clocking. Although the system is capable of recovering from error, there is a tradeoff between performance and power saving, which is exploited to further minimize the power-delay product of the pipeline in our third formulation. Experimental results demonstrate the efficacy of our proposed algorithms for solving each of the aforesaid problems.
I. Introduction
W ITH THE increase in demand for battery-operated personal computing devices and wireless communication equipment, the need for power-efficient design has increased. In addition, rising levels of power dissipation and the resulting thermal problems have become the key limiting factors to processor performance. Due to the high utilization of pipelined data path in modern processors, it is a major contributor to power consumption of a processor, and hence, one of the main sources of heat generation on the chip [1] . Many techniques have been proposed to reduce power consumption of a microprocessor's pipeline such as pipeline gating [1] , clock gating [3] , and voltage scaling [4] .
In this paper, we present the problem of power-delay optimal pipeline design in a synchronous linear pipeline by means of voltage scaling and time borrowing through redesigning the Manuscript received March 25, 2011 ; accepted April 30, 2011. Date of current version September 21, 2011 . This work was supported in part by the National Science Foundation. This paper was recommended by Associate Editor J. Hu.
The authors are with the Department of Electrical Engineering, University of Southern California, Los Angeles, CA 90007 USA (e-mail: ghasemaz@usc.edu; pedram@usc.edu).
Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TCAD. 2011.2159218 flip flops. We propose mathematical solutions to this problem in deterministic and probabilistic frameworks. Our technique is based on the idea of utilizing soft-edge flip-flops (SEFF) for slack passing and decreasing the error rate in pipeline stages.
A linear pipeline composed of SEFFs is called a soft pipeline. Soft-edge flip-flops have a small transparency window which allows time borrowing across pipeline stages. SEFFs have been used for minimizing the effect of clock skew on circuit performance [7] , [8] and minimizing the effect of process variation on parametric yield [9] . In this paper, SEFF is utilized to compensate for unbalanced pipeline stage delays by means of time borrowing. It is observed that this imbalance of path delays of different pipeline stages is very common in pipelined circuits [6] .
In this paper, we describe a unified methodology for optimally selecting the transparency windows of SEFFs in a linear pipeline so as to achieve the minimum power-delay product for the pipeline by means of opportunistic time borrowing and voltage scaling. We take on three power-delay optimization problems as explained next. In the first problem formulation, timing violations are avoided by respecting the worst case path delays (calculated as deterministic values by static timing analysis) for every stage in a pipeline. Next, we formulate the same problem for a scenario where stage delays are assumed to be random variables, and find the solution with the minimum power-delay product while ensuring that the probability of timing violations in pipeline is lower than a threshold. Third, we allow timing violations to take place while implementing a mechanism to detect and fix the errors and accounting for the power and delay penalties of error correction.
Preliminary versions of this paper appeared in [10] and [11] . This paper substantially extends previous works in several directions.
1) Three general problem formulations are presented, along with one special case of the third formulation that is similar to the problem presented in [10] . The first formulation is similar to the one presented in [11] , but with major modifications. 2) This paper uses the power-delay product metric as the objective function of the optimization problems. Also, the timing constraints of time borrowing are redefined. 3) Designs of a number of SEFF circuits are introduced. 4) Experimental results have been redone and extended to reflect the aforesaid changes. 5) Mathematical proofs for convexity of problems and optimality of solutions are provided.
0278-0070/$26.00 c 2011 IEEE The remainder of this paper is organized as follows. In Section II, we provide some background on pipeline design. Soft-edge flip-flops and their characteristics are introduced in Section III. Section IV describes our proposed techniques for optimizing power-delay in a soft pipeline in different frameworks. Sections V and VI are dedicated to experimental results and a brief summary of related work, respectively, while Section VII concludes this paper.
II. Background A. Timing Constraints in a Pipeline
A simple (synchronous) 2-stage linear pipeline circuit is depicted in Fig. 1 . A linear pipeline is defined as a pipeline with the following properties: 1) processing stages are linearly connected, with no feedback loops; 2) it performs a fixed function; and 3) stages are separated by flip-flops which are clocked with the same clk signal. We call the set of flipflops that separate consecutive pipeline stages as a FF-set, e.g., FF 0 . . . 
where d i and δ i denote the maximum and minimum delays of combinational logic in stage i, T clk denotes the clock cycle time, t s,i and t h,i are setup and hold times of flip-flops in the ith FF-set whereas t cq,i−1 denotes clock-to-Q delay of flip-flops in i − 1st FF-set. N denotes the number of pipeline stages. Inequality (1) gives the constraint set on the maximum delays of combinational logic and flip-flop timing characteristics to prevent setup time violations. Conversely, (2) specifies the constraint set on the minimum delay of pipeline stages in order to prevent short path data race hazards. Notice that to account for the effect of clock skew, t skew , we can simply add t skew to the left side of (1) and subtract it from the left side of (2).
B. Combinational Logic Block Modeling
When the supply voltage of a combinational logic is changed, its delay can be obtained from alpha-power law [8] 
1 In the entire work, the interconnect delay would be integrated in the combinational logic's delay, and where we refer to combinational delay, it also includes the interconnect delay.
where α is a technology parameter which is around 2 for long channel devices and 1.3 for short channel devices, and V t denotes the magnitude of the threshold voltage of transistors. Coefficient λ j captures the effect of temperature increase (due to power consumption) on delay (which captures inverted temperature dependence effect, too [12] ). We assume the only source of temperature increase is the circuit's power consumption (based on circuit's thermal models [13] ), which is itself a function of voltage as given in (6) . Hence, the steady state temperature of a circuit can be calculated for a voltage v j . Note that (3) and (4) are used to calculate worst-case delays under the assumption that V t does not vary (no process variation). For the scenarios that consider V t variations, such as Section IV-D, it is precise to use probability distribution functions (PDF) of d ij and δ ij profiled at any voltage.
Additionally, the total power consumption of combinational logic, P Comb , changes as follows due to voltage scaling:
where E dyn and P leak are total dynamic energy dissipation and leakage power consumption of the combinational logic at nominal supply voltage V 0 .
C. Delay Variations
As technology scales, process, voltage, and temperature (PVT) variations are becoming critical design concerns due to their effect on logic and interconnect delay [14] . Process variations such as random dopant fluctuations, and gate-oxide thickness variations modulate MOSFET characteristics and parasitic components, causing variation in the switching delays of identical gates [15] , [16] .
The random maximum and minimum stage delays are described by PDF and cumulative distribution functions (CDF) with corresponding mean, µ, and variance, σ. In some works, e.g., [18] , [19] , this distribution has been assumed to be a Gaussian (Normal) distribution [17] . However, precise statistical timing analysis schemes have proposed non-Gaussian distribution models due to nonlinearity of max/min operations on delays of gates and paths and their correlation [20] - [22] .
In order to account for the random variations (Gaussian or non-Gaussian) of the path delays in (1) and (2), one should express the probability of violating the setup or hold conditions as a function of delay variations. The probability of satisfying setup time constraint in pipeline stage i with voltage v j for a given cycle time T clk,j , denoted by p setup,ij , can be written as probability of the maximum delay of combinational logic in that stage, d i , being less than the available time
where F d ij denotes the CDF of delay of pipeline stage i under voltage setting j. The probability of a setup time constraint violation in pipeline stage i is thus calculated as follows:
Similarly, given the CDF of minimum delay of stage i under voltage setting j, F δ ij , probability of violating (q hold,ij ) the hold time constraint of stage i may be calculated as follows:
Note that we ignore the effect of variability on flip-flop timing characteristics and only focus on the effect of variability on the combinational logic delays. To a first order, the clock-to-Q and setup-time of input and output flip-flops are much smaller than the maximum delay of combinational logic, and hence, we can ignore variations of flip-flop characteristics compared to the logic. This is, however, not true with respect to the hold-time and the minimum delay of logic. Therefore, we insert an adequate number of delay elements (see Section IV) to eliminate the hold time violation for the minimum value of hold time of flip-flops. 
D. Pipeline Delay Model
Average pipeline delay, denoted by D, is the primary performance metric in a pipeline. It is defined as the average time it takes to process one data/instruction unit and produce a valid output, as given by (10) Indeed, pipeline delay can be interpreted as the inverse of its effective throughput
We assume that the pipeline can process at most one data/instruction unit if it does not encounter timing violations, hence, T clk ≤ D. In a pipeline that processes each data in one cycle, its average delay is equal to the clock period, T clk [that is determined by the slowest pipeline stage, see (1)] However, if the pipeline stalls or gets flushed, for any reason, the average processing time of data/instruction increases. In other words, the delay is not simply the inverse of the clock frequency, rather it also probabilistically accounts for the overhead of correcting potential setup time problems in an over-clocked pipeline.
III. Soft-Edge Flip-Flop (SEFF)
The key design idea of a SEFF [5] is to create a transparency window right after (or before in case of backward time borrowing) the clock edge, during which the data can still be captured. This allows passing of timing slacks between adjacent pipeline stages [11] . Some SEFF designs are derived by applying modifications to conventional hard-edge counterparts. We focus on some of the most widely used flip-flop circuits in state-of-the-art processors [25] . SEFF designs based on master-slave FF, hybrid latch FF (HLFF) and monostablebased FF (MBFF) are studied in this paper. Fig. 2 illustrates the design of master-slave SEFF, used in IBM Power PC 603 processor. The key modification in the SEFF version is that by delaying the clock of the master latch, both master and slave latches are ON for the duration of transparency window. Fig. 2(b) illustrates the timing diagram for key signals of a master-slave SEFF. The dashed square highlights the transparency window which is the overlap of clk and its delayed version, clkd. If the overlap between edge of clk and the latching edge of clkd is larger than the delay through the master latch, the master-slave pair is transparent to the input during the window after the edge of main clock, clk. The delayed clock and its reverse-polarity can be produced locally for each FF-set (or multiple FF-sets that have equal transparency window size) by utilizing some inverter chain, appropriately sizing them and changing chain length in order to achieve the desired transparency window size.
The hybrid latch flip-flop [5] , is shown in Fig. 3 , which is originally a soft-edge flip-flop; here, we seek to make the size of its transparency window adjustable as required. Fig. 3 also illustrates the timing waveforms corresponding to operation of HLFF. In this figure, the shaded area represents the transparency window, which is created by overlap of clk and !clkd signals. During the time interval when both of these signals are high, both transistor stacks act as inverter gates to transfer D tō S and then to Q. In order to increase the transparency window size in the HLFF, delay of the delay element in Fig. 3(a) , should be decreased by the desired amount.
HLFF is one of the fastest SEFFs used in industrial designs, such as AMD K6 processor [25] for its advantages of high performance and relatively small area. Large power consumption, glitch activity, and somewhat complex implementation are its drawbacks [25] . Note that transparency window of this architecture is located before the clock edge. Hence, it is suitable for backward time borrowing schemes.
Monostable-based flip flop is another industrial negativeedge flip flop that we convert it to SEFF. MBFF suffers from large area and high power consumption [25] . In order to modify MBFF's circuit to admit an adjustable transparency window size, a delay element is inserted in its design, as illustrated in Fig. 4(a) . In this design, the first stage of the flip flop generates a short pulse on nodes S or R to trigger the S-R latch. The delay element essentially extends this pulse width, providing longer time for D to arrive and get captured in the SR latch. Fig. 4 (b) demonstrates timing waveform of this SEFF for D = 1 (for D = 0, the pulse applies to R). The triggering pulse can be de-asserted as early as a t 1 delay after the negative-edge of clk and is asserted exactly after a t 2 delay after the negative edge of !clkd.
Due to the practical advantages of Master Slave based SEFF we will focus on this design for the rest of this paper to derive equations and use it design problems. Similar equations and discussions hold for other SEFF designs.
A. SEFF Timing Characteristics
To optimally select the transparency window of the SEFFs, we must accurately account for the effect of the transparency window on SEFF's power consumption and its timing characteristics, i.e., setup time, hold time, clock-to-Q delay and D-to-Q delay. The setup time, t s , and hold time, t h of a SEFF may be modeled as linear functions of the transparency window size
where w denotes the transparency window size and a 0 through b 1 are technology-specific and design-specific coefficients. Experimental SEFF characterization data provided in Fig. 5 confirm the linear model for SEFF timing characteristics.
The clock-to-Q delay, t cq , of SEFF is practically independent of the transparency window width. It is defined as the delay between the positive edge of clock and the time that output is valid when input data arrives before the transparency window.
We define the term D-to-Q delay of a SEFF, t dq , to denote the input to output propagation delay of data when it is transparent. t dq is also independent of transparency window width (see Fig. 5 ).
If the supply voltage of the flip-flop can be adjusted to a new voltage level, v j , then the coefficients of linear models of setup and hold time as well as values of t cq and t dq will 
Timing characteristics of SEFF are measured by HSIPCE simulations (sweeping voltage) to determine voltage dependent values and coefficients through linear regression. 
B. Soft-Pipeline Timing Constraints
Introduction of a transparency window to a flip-flop not only modifies the timing characteristics of a SEFF, but also changes the timing constraints imposed on the pipeline due to implementation of time borrowing. Inequality (1) for the setup time constraint ignores the time borrowing effect between stages. However, hold time constraint does not change in case of time borrowing; note the t cq,(i−1) is in fact the window independent t cq,j for all of the stages. Fig. 7 illustrates setup time constraint fundamentals of a time borrowing operation among three consecutive stages, in which stage i uses the timing slack of stage i + 1, and stage i + 1 uses that of stage i + 2. In this figure, D i and Q i represent the input and output of FF-sets of stage i, respectively. In this case, the following timing constraint sets for time borrowing between stages i and i + 1 [26] should be met
Inequality (13) is in fact the same setup time constraint as (1) for a single stage which ensures that delay of ith stage is able to meet the setup time of its destination SEFF with time borrowing enabled. Inequality (14) assumes that stage i may borrow time from stage i + 1, but the accumulated delay of these two stages (plus setup time and clock-to-Q of SEFFs) should not exceed two clock periods. Note that in (14) for the SEFF-set i, data arrive within the transparency window and propagates to the output only after a delay of t dq .
In general, setup time constraints corresponding to an N-stage soft-pipeline under voltage state j can be written as follows:
Inequality set (15) describes setup time constraints applied to single stages and multiple stages involved in time borrowing. The parameter m denotes the depth of time borrowing in this equation. If m = 0, the inequality represents the setup time constraint within a single pipeline stage, and larger values of m produce the setup timing condition on accumulative delays of multiple consecutive pipeline stages. Also in the statistical framework, setup constraint violation probability may be written as follows:
As mentioned in Section II-C, the effect of variability on the flip-flop timing characteristics is negligible, and the random variables in (16) are d ij , which are correlated [20] , [27] . Let ρ ik denote the correlation between the maximum stage delays of stage i and k. Given the CDF of all d ij and ρ ik , we can estimate the CDF of summation of d ij , by assuming that it follows the same distribution function as any of d ij , with corresponding mean and variance calculated as follows:
Note that we assume the circuits that our proposed algorithms optimize are fully synthesized and mapped circuits and standard SSTA timing analysis has been performed on each pipeline stage. Such tools do account for various sources of variability and certainly consider the effect of spatial process variations and/or reconvergent fanout paths in their calculations.
C. SEFF Power Consumption Model
Power consumption of a SEFF is generally an increasing function of its window size, w. This is due to the fact that increasing the window size is performed by resizing and/or increasing the number of inverters in the delayed clock path; both methods result in an increase in the dynamic and leakage power consumption of the SEFF. Fig. 8 illustrates the total power consumption of a master-slave SEFF as a function of its window size, at two voltage values for a fixed clock period. The discontinuities (jumps) in the curve are due to a change in the number of inverters in delay path.
From Fig. 8 , one can conclude that power dissipation of the SEFF may be approximated as a linear function of the transparency window width, for a fixed clock period. To approximate effect of both dynamic and leakage power consumption for any window size and any clock period in the SEFF circuit, its power consumption may be calculated as follows:
where v denotes the supply voltage level, and k 0 (v) through k 3 (v) are voltage-dependent and technology-dependent coefficients which can be determined through HSPICE circuit simulation. In (19) , the two T clk dependant terms correspond to dynamic power consumption while the other terms correspond to leakage power.
IV. Power-Delay Optimization in a Pipeline Due to significance of both performance and power efficiency in pipelined circuits, we chose Power-Delay product as the cost metric to optimize the design of such circuits. Note that in the Power-Delay product, delay is not simply the inverse of the clock frequency, rather, as will be seen next, it is defined to also probabilistically account for the error correction timing overheads of potential setup time problems in an over-clocked pipeline. In this way, we are able to exploit the case where the increase in setup time violation and corresponding timing overhead is compensated by the decrease in the power dissipation.
In this section, we solve the problem of power-delay optimization in a linear pipeline using SEFF. We formulate the problem for three scenarios.
1) The stage delays are captured by the worst case delay estimates. 2) Statistical timing analysis is used to model the stage delays, and no timing violation is allowed. 3) The stage delays are still computed by statistical timing models, but timing failures are allowed to exist and automatically be detected and fixed. In scenario 1, we deal with deterministic values of the worst case combinational circuit delays, which are the maximum observed values of combinational circuit's delay, over all possible input combinations and under any possible operating conditions (different PVT corners.) Satisfying the timing constraints of (1) and (2) for these conservative delay values results in error-free operation of the pipeline. On the other hand, in scenario 2, we will consider the path delays as random variables and will use statistical timing equations and find the optimum solution for a limited error rate. Under scenario 3, we allow a few timing violations to occur and adopt an error detection mechanism to guarantee correct functionality of pipeline. In this framework, our solution considers the tradeoff between aggressively scaling pipeline frequency to improve delay, and the power and delay penalties due to error detection and correction. The key motivation for using SEFFs in a pipeline circuit is that some positive slack may be available in one or more stages of the pipeline. Utilizing SEFF allows passing this slack to more timing critical stages and utilize it for power optimization by voltage scaling.
A. An Illustrative Example
As an example, consider the three-stage pipelined circuit of (1), the minimum clock period is 500 ps, and no slack is available to the first stage of the pipeline. However, if FF1 is replaced with a SEFF with a transparency window of 50 ps, the available slack at the second stage is passed to the first stage, providing the first stage with 50 ps of borrowed time. Now since positive slacks are available in all stages of the pipeline, the circuit can be operated at higher clock frequency and/or a smaller voltage in order to reduce the power consumption, and possibly the power-delay metric (ideally, V DD may be reduced by approximately 10%, resulting in roughly 19% power saving).
B. Delay Elements
From (2), one can see that increasing the window size of the ith soft-edge FF-set puts a more stringent constraint on the hold time condition for the ith stage of pipeline. Therefore, if needed, delay elements may be utilized in the minimum-delay path(s) to alleviate the hold time constraint violation. Insertion of a delay element with a delay magnitude of z i would change (9) as follows:
Delay elements are indeed created by utilizing some inverters and appropriately sizing them in order to meet the desired delay lower bound while incurring minimum power loss. The power overhead of a delay element is denoted as follows:
where z is the desired delay and h 2 (v) and h 1 (v) are voltage dependent parameters to be determined by HSPICE simulations. Fig. 10 illustrates the linear model fitting on the measured data. Note that the delay elements are created by means of a buffer chain; to get larger delay, more buffers or larger loads are needed. This causes power dissipation increase with increased delay as shown in Fig. 10 , with discontinuity points due to change in the number of buffers.
C. Power-Delay Optimal Soft Pipeline (OSP)
The problem of power-delay optimal soft pipeline (OSP) design is defined as that of finding optimal values of the global supply voltage level, pipeline clock period, and the transparency windows of the individual soft-edge FF-sets in the design so as to minimize the total power-delay product of an N-stage pipeline circuit subject to setup and hold time constraints. From (19) , (6) and (21), total power consumption of pipeline is as follows: (20) where all terms with subscript j correspond to their value under supply voltage v j , i.e., k 3j = k 3 (v j ) and so on.
Delay of the pipeline (system delay) on the other hand is calculated by (10) . Since no errors are allowed in the pipeline, the delay is equal to the pipeline clock period (and thus, power-delay product is essentially equivalent to energy dissipation here). Hence, the problem of power-delay OSP may be formulated as follows:
The first and second sets of inequalities in (23) Referring back to Fig. 1 , for the sake of consistency with the input and output environments and to avoid imposing constraints on the sender or receiver of data for the linear pipeline circuit in question, we impose the boundary condition that the first and last FF-sets in the pipeline are composed of hard-edge FF whereas intervening FF-sets may be SEFFs.
To solve the problem stated in (23) efficiently, we enumerate all possible values for v, and for each fixed v we solve a quadratic program (i.e., we minimize a quadratic cost function subject to linear inequality constraints), which can be solved optimally in polynomial time. In the fixed supply voltage OSP problem formulation, P leak,i term drops out of the cost function, the last constraint disappears, and all others become only dependent on w i , z i and T clk variables. We refer to this version of the problem as OSP-FV, OSP with fixed voltage
Note that in OSP-FV problem, all the voltage-dependent coefficients, i.e., k 3 − k 0 in P SEFF and h 2 , h 1 in P DE equation, as well as the coefficients in t s,i , t h,i , t cq , and t dq are recalculated for the voltage under test. Also, E dyn , P leak , d i and δ i are given window-size-independent inputs [generated by profiling or given by (4)- (6)] for each voltage.
Lemma 1: In the optimal solution of OSP-FV design problem, the transparency window of the ith SEFF-set is equal to the time borrowed by combinational logic in the ith stage.
Proof: According to the discussion in Section II-A and Fig. 8 , the power consumption of a SEFF is a monotonically increasing function of the transparency window size while its setup time is a decreasing function of the same. Now, from the OSP-FV problem formulation of (23), a minimum decrease in the setup time of the ith SEFF-set t s,i which meets the long-path constraint in the ith stage of the pipeline, will produce the minimum increase in the power dissipation of the ith SEFF-set P SEFF,i . Therefore, the optimal solution is achieved by utilizing the smallest possible window sizes which prevent setup time violation.
Lemma 2: In the optimal solution of OSP-FV design problem, the delay element inserted in the ith stage of the pipeline is equal to the minimum extra time needed to meet the hold time constraint at the ith soft-edge FF-set.
Proof: According to the discussion in Section III, the power consumption of a delay element is a monotonically increasing function of the target delay value while the hold time of a SEFF is an increasing function of the same. Now, from the second inequality (hold time condition) in the OSP-FV problem formulation of (23), a minimum delay value z i added to the ith stage of the linear pipeline which meets the short-path constraint for that stage will produce the minimum increase in the power of the combinational logic in the ith P DE (z i , v) . Hence, the optimal solution is achieved by utilizing smallest possible delay elements which prevent hold time violations. Note that although SEFFs are custom-designed and their transparency windows are set only once at design time, implementing the optimal transparency window of SEFFs may not be practical. Because, for instance, device (transistor) size and hence delay of window generation circuitry of SEFF cannot be any arbitrary value. Therefore, we round off the optimal sizing solution to its closest larger-sized match that is implementable. Since this realized SEFF will have minimally larger transparency window size, it will not violate any setup time constraints, while increasing the power consumption as minimum as possible. However, if the hold time constraints are violated by this adjustment, then adding delay elements may be used in violating short paths to solve the problem, with negligible impact on power-delay metric of pipeline.
The pseudo-code presented in Fig. 11 summarizes the steps in OSP algorithm.
D. Statistical Power-Delay Optimal Soft Pipeline (SOSP)
In Section IV-C, we followed the conventional static timing analysis framework in which deterministic values of worst case circuit delays are used to specify the circuit timing. However, due to process and environmental variations in integrated circuits, the path delays may vary from one die to next and from one operating condition to the other. Consequently, the path delays may be modeled by random variables [15] . Therefore, we will replace the deterministic timing constraints with the probability of timing violations in a pipeline as given by (8) and (9) .
The problem of statistical power-delay optimal soft pipeline (SOSP) design is defined as that of finding optimal values of the operating voltage and frequency and the transparency window sizes of the individual soft-edge FF-sets in the pipeline so as to minimize the total power-delay metric in a soft pipeline circuit with N pipeline stages and S voltage states. As mentioned earlier, SEFF enables opportunistic time borrowing across adjacent stages of the pipeline in order to provide timing-critical stages with more time to complete their computations and thereby, reduces the probability of timing errors at a particular frequency.
Let q setup,ij and q hold,ij denote probabilities of setup time and hold time violations at stage i of the pipeline under supply voltage v j , as given in (17) and (20) . Assuming that the probability of encountering an error in a specific combinational circuit stage is independent of other stages, the probability of having a timing error in the entire pipeline, q pipeline,j is calculated by (25) . This probability should be limited to an extremely small value (e.g., 10e-12) to make failure of the pipeline virtually impossible
Now then, SOSP can be formulated as (23) . It minimizes the power-delay product of the pipeline, subject to an upper-bound on the error probability, denoted by ε
Note that even though the circuit delay is modeled as a random variable due to process variations, the power consumption is not. It is known that the effect of V t or L eff variation on dynamic power consumption is negligible [28] . On the other hand, since we do not make any modifications to the combinational circuit part (e.g., do not perform gate sizing or logic re-synthesis) leakage power of logic gates is not affected by our optimization. So we only consider the fixed values of the maximum amount (worst case) of leakage power consumption of combinational circuit.
Next, we approximate q pipeline,j which is given by (25) with a convex function to simplify the problem statements. Result of expanding (25) is a summation of all q setup,ij and q hold,ij and their mutual product of second and higher order. Since all error probabilities, q setup,ij and q hold,ij , are relatively small values (e.g., in the order of 1e-3 or 1e-4) the product of any two (or more) of such functions are negligible compared to the summation of first order terms and could be ignored. Therefore q pipeline,j may be written as a simple summation of q setup,ij and q hold,ij
Furthermore, to conveniently formulate the problems as quadratic programs, we approximate q setup,ij and q hold,ij as first order polynomial functions of SEFF characteristics and T clk
where qsT j , qsw j , qhd j , qhw j are coefficients (of T clk , window size, delay element and window size in q setup,ij and q hold,ij , respectively) corresponding to voltage setting j, and qs j (i) and qh j (i) are voltage and stage-delay dependent fixed terms. As a preprocessing step, we linearize the CDF of any max (min) stage delay around its μ + 3σ(μ − 3σ) point, i.e., for any x within a boundary around such point, F ij (x) ≈ α ij .x + β ij . Hence, (16) and (20) can be approximated as follows, and all coefficients, q * j , can be determined accordingly
Again, using Theorem 1, we conclude similar algorithm to solve the SOSP problem presented in (26) , to enumerate all possible values of v, and we solve a quadratic program for each v. We refer to this version as SOSP-FV, SOSP with fixed voltage, in which, variables are only transparency window sizes, pipeline clock period, and delay elements
Theorem 2: The SOSP-FV problem is a convex problem, and the optimal solution to it (if the feasible region is not empty), minimizes the objective function.
Proof: In general, the product or ratio of two convex functions are not convex [29] , and hence we used the additive approximation in (27) for q pipeline,j instead of (25) . Therefore, the objective function of SOSP-FV problem is a quadratic function of its variables (the transparency window sizes, delay elements, and clock period) while the constraints are linear. Now then, the convex optimization problem of SOSP-FV is efficiently solvable by using any commercial mathematical optimization tools. Of course, when a solution is obtained we must verify the condition for validity of approximations, but this has always been the case in our experimental results.
E. Error-Tolerant Statistical Power-Delay Optimal Soft Pipeline (ESOSP)
The problem formulations presented in Sections IV-C and IV-D conservatively calculate the pipeline operation clock period to avoid timing violations that cause pipeline errors. However, only for some specific combination of inputs is the critical path sensitized, and therefore, these formulations result in a pessimistic clock period. Instead, error-tolerant statistical power-delay optimal soft pipeline (ESOSP) algorithm on top of using SOSP techniques aggressively decreases the clock period to improve performance, while implementing a mechanism to capture and fix any possible timing violations due to this over-clocking. The proposed algorithm explores the tradeoff between delay improvement and increase in power as well as the power and delay penalties caused by timing errors.
An error handling mechanism is incorporated in our design to guarantee correct functionality under all conditions. Error detection and correction can be fully implemented in SEFF circuit, as described in the Appendix. In another method, error detection is built in SEFF circuit while error correction mechanism is supported by the architecture (through data/instruction flushing and replaying the same data/instructions this time under a transitory operating condition which is more conservative, e.g., lower frequency) (see the Appendix). If the error rate is relatively low, area and power overhead of FF design with built-in error detection circuit will be negligible, compared to FF with built-in error correction circuit.
For simplicity, we focus on the fixed voltage version of ESOSP problem, and generate the solution to original problem of ESOSP by combining the solutions to multiple instances of ESOSP-FV based on Theorem 1. Let P j denote the average total power consumption of pipeline under supply voltage v j , and P p,j denote the average power overhead when encountering an error at same voltage v j (this overhead includes the power consumed for computing erroneous data as well as flushing it and its following data units). Also, let γ denote the average delay (in clock cycles) including error detection and correction (such as flushing) delays. Given an error probability of q j under some voltage v j , the expected value of power-delay objective function may be written as follows:
In fact, error probability, q j , is a decreasing function of T clk . This is the source of tradeoff between power-delay metric of error-free and erroneous operation of pipeline. Decreasing T clk reduces the power-delay for error-free operation [the first term in (34) ], but increases q j and as a result, the error correction overhead (the second term in (34) .
Implementation of time borrowing across adjacent stages of the pipeline effectively reduces the probability of error due to timing errors, q j , and avoids the subsequent power and delay penalties of error correction step for any T clk . Increasing transparency window size, however, increases total power consumption. Fortunately, gained power saving tends to more than compensate for it.
Remember P j in (34) denotes the sum of power consumptions of the combinational logic blocks (that also includes delay elements and hard edge FFs) and SEFFs, without encountering an error. P j is a function of voltage, SEFF's window sizes and delay elements, and (22) can be rearranged as follows: (33) with A j and B j representing all the terms corresponding to constant values and coefficients of 1/T clk , respectively. For simplicity, let us assume the power overhead of error correction is β times that of only producing a data value without encountering an error, i.e., P p,j = β.P j (the value of parameter β is obtained from micro-architectural and circuit simulations).
The ESOSP-FV problem is defined as finding optimum w i s, z i and T clk in the following formulation:
Note that the objective function of (36) is a third-order polynomial with proposed linear approximations for q j , which can be solved using general convex optimization tools [30] , [31] . In Section IV-G, we introduce another constraint which bounds the undetected error probability, and should be added to (36) .
F. ESOSP for Profiled Operation
Dynamic voltage and frequency scaling (DVFS) is widely used to minimize the power consumption in microprocessors. The entire pipeline should meet timing constraints in every circuit state (also known as DVFS setting). A circuit state is uniquely identified by a supply voltage level which is simultaneously applied to all stages of the pipeline. Changing the voltage to bring about a new circuit state affects the power consumption of pipeline as well as combinational path delay and time budget of combinational circuit.
Consider a scenario whereby based on the system-level power management policy, it has been determined that the circuit will operate in each of its circuit states according to some probability distribution. We present another formulation to minimize the average expected power-delay product over all DVFS circuit states. More precisely, given the probability values for being in various circuit states during the active mode of pipeline operation, we attempt to minimize the power-delay product averaged over all such states.
Let π j denote the probability of being in circuit state s j (characterized for a given voltage level v j ). Then, the weighted cost function is defined as follows:
The ESOSP-profiled problem is thus formulated as follows:
Now then, ESOSP tries to minimize the power-delay product of the pipeline, and find the optimum set of clock periods, T clk,j (j = 1, . . . , S) under each circuit state, and a set of optimum window sizes, w i (i = 1, ..., N − 1), for each FF-set, and the optimum delay elements of each stage, z i (i = 1, ..., N) . Hence, for S circuit states and N pipeline stages, there are S + 2N − 1 optimization variables; in each circuit state, we apply the calculated optimum frequency to all pipeline stages. Notice optimum window size for each soft-edge FF-set (recall that the first and last FF-sets use always hard-edge FFs), as well as delay elements are design time decisions and these size assignments are independent of circuit state.
G. Bounding the Probability of Undetected Errors
An undetected error in the pipeline can occur due to a very long path that violates internal timing of SEFF. Normally, in a SEFF with built-in error handling mechanisms, the input data is re-sampled at a later time by utilizing a phase-shifted global clock signal, PS (see the Appendix). The undetected error probability is the probability of data arriving after T clk + PS which is calculated by (39)-notice that this equation is similar to (8) except that we have replaced T clk with T clk + PS because an undetected error occurs only when the arrival time of the correct data is later than the triggering edges of the PS Clock in the current cycle. Consequently, given the CDF for max stage delays, the probability of an undetected error in pipeline stage i and supply voltage v j is as follows:
The overall rate of undetected errors for all voltage levels is as follows:
To impose an upper bound on undetected-error probability, we include PS as a new variable of optimization to problem formulations with error detection technique enabled, along with the following constraint where ε UpperBound is user provided [typically in the same order as ε in (33), e.g., 1e-6 to 1e-10]
V. Experimental Results

A. Simulation Setup
To extract the parameters used in the optimization problem, we performed transistor-level simulations on soft-edge flipflops by using HSPICE [32] . We used 90 nm technology model [33] with nominal supply voltage of 1.2 V. Simulations have been conducted at die temperature of 85°C. In all experiments, the set of available voltage levels is {0.8 V, 0.9 V, 1 V, 1.1 V, 1.2 V}. We synthesized a number of linear pipelines, including some modified ISCAS89s benchmarks (denoted by TBx) and datapath and processor circuits to construct a set of benchmarks. SIS [34] and Synopsys Design Compiler packages were used for synthesizing benchmarks. We then performed timing simulations and used Synopsys PrimeTime to extract the static value of longest and shortest path delays of each pipeline stage under each voltage setting.
Next, we considered max and min stage delays of a pipeline to have probability density functions. For this, we run Monte Carlo simulations on fully synthesized and mapped logic circuits to generate the max/min stage delay distributions by monitoring the top 100 critical paths of each stage (identified using Synopsys PrimeTime timing analysis tool) affected by variations. We assumed a σ/μ ratio of 5% for sources of variation, i.e., threshold voltage and channel length, similar to [21] , and applied it to circuit simulations. We also assumed Finally, we formulate different algorithms given all the coefficients and parameters needed. To solve the mathematical problems developed in this paper, MATLAB [30] and TOM-LAB toolbox [31] have been used. The algorithms calculate the optimal values of the operating supply voltage and frequency and the transparency window sizes of the individual soft-edge FF-sets in the design that minimized the total powerdelay in the soft pipeline circuit.
B. Linear Approximation of General Stage Delay CDF
Given the delay distribution of all stages of pipeline, we apply the linear approximation of (32)- (34) where the error rate is below 5%. Fig. 12 illustrates the linear and piecewise linear estimates of sample CDF. The overall mean square relative error of the linear model was 1e-4 and that of piecewise linear approximation was 4.5e-6. In our simulations, we used piecewise linear approximation with two regions intersecting at 99 percentile of CDF; T clk determines the region of estimation for each stage. For estimating multistage delays, we use the average of coefficients of linear models of involved stages delays. For all testbenches, the error of this linear approximation (single stage and multistage) remained below 2e-4 for linear model and under 1e-5 for piecewise linear model, which is acceptable, and does not have a high impact on the results of our solutions.
C. OSP Simulation Results
In order to evaluate the performance of the proposed OSP algorithm, we assumed two conventional FF based approaches as the baselines for comparison: Baseline implements a conventional pipeline (which contains only conventional hard-edge FFs) and always runs at nominal voltage of 1.2 V. The second method is Base+VS which adds the support for voltage scaling to Baseline. Both baselines were operating at the minimum clock cycle time for the pipeline circuit. This clock period was calculated for each of the test pipeline circuits listed in Table I using standard timing equations of (1) and (2) (for regular FFs) and next the power dissipation of pipeline was subsequently computed. Next, OSP was run on each circuit, exploiting time borrowing across different stages, and thus, power saving. Percentage improvement of Power-Delay product by OSP with respect to Baseline and Base+VS on these benchmarks are provided in Table I . The first entry in this table is the name of benchmark. Specifications of benchmark, i.e., the max and min delays of each pipeline stage at nominal voltage are reported in the second through sixth columns of table. The next five columns report the optimum supply voltage (V*) and clock period (T * clk ) for Baseline (runs at nominal voltage), Base+VS, and OSP. The last two columns show the percentage of reduction in power-delay achieved by OSP (compared to Baseline and Base+VS algorithms) which are also depicted in Fig. 13 . As it can be observed, OSP achieves an average power-delay saving of 32% compared to Baseline by applying voltage scaling and time borrowing, and a saving of 10% compared to Base+VS by only time borrowing. In case of tb4, the saving is negative compared to Base+VS, since it has the same logic circuit duplicated in each pipeline stage (balanced stages). As expected, there is no room for time borrowing in it; hence the power overhead of added circuitry causes PDP loss. Note that by balanced, we refer to (nearly) equal stage delays.
An interesting observation in the results of Table I is that the optimum clock period calculated by OSP or Base+VS is much larger than the one of Baseline. This is because the objectives of these two algorithms are power-delay product (PDP), and in many cases, PDP is reduced when the supply voltage is reduced, and subsequently, T clk is increased. However, if the operating frequency of circuit is the important design criterion, a minimum frequency limit, f min , may be imposed by adding a linear constraint in the form of T clk <1/f min to the OSP problem formulations (and the other formulations.) For instance, we enforced f min to be higher than 85% of the Baseline frequency, for tb2 and tb4. In case of tb4, there is not a change since the result is already in that range. However, in case of tb2, the PDP saving of OSP (compared to Baseline) reduced to about 38% while its optimum operating voltage and clock period were found to be 1 V and 914 ps, respectively. Here, by limiting the minimum frequency of circuit, the benefit of voltage scaling is limited, but time borrowing is still useful in minimizing the clock cycle time.
To provide more insight into the results, we studied how SEFFs are used in a soft pipeline by solving OSP-FV. In this set of experiments, the supply voltage of each pipeline was set at the nominal value and OSP-FV was invoked to find the minimum values of T clk . Table II shows the optimum clock 
D. SOSP Simulation Results
Next, we considered randomness and variability of longest and shortest delays of pipeline stages (calculated as described in Section V-A. We then set up SOSP, as the quadratic program presented in (26) with the mentioned linear approximation for q i,pipeline , and solved it using TOMLAB optimization toolbox. It calculated the optimal values of operating supply voltage and frequency and the transparency window sizes of individual soft-edge FF-sets in the design that minimize the total PDP. By setting ε equal to the inverse of total number of critical paths, we avoid violation of timing constraints.
For purpose of performance comparison, we used two baseline methods similar to the case of OSP, i.e., Baseline is limited to the nominal voltage while Base+VS can also change the supply voltage. The baselines determined the maximum clock frequency of the circuits based on a statistical analysis similar to SOSP, except for utilizing hard-edge FFs in the pipeline circuit. 
E. ESOSP Simulation Results
Next, we measured the error penalties of error detection and correction in a pipeline by micro-architectural simulations. Then we set up and solved ESOSP problem as formulated in (38) , and next compared it to Baseline described in Section V-D, which calculates the optimum frequency of a conventional pipeline under nominal voltage. Since ESOSP benefits from voltage scaling, time borrowing and error tolerance, we studied the portion of total expected power-delay saving due to each of these techniques in the statistical framework.
Table IV summarizes percentage of improvement in powerdelay product of three techniques with respect to the Baseline algorithm described. The first one is Base + VS algorithm, that implements only voltage scaling (denoted by VS). The second algorithm is our proposed SOSP which combines voltage scaling and time borrowing (denoted by VS+TB). The third algorithm is ESOSP that adds error tolerance to SOSP. Table IV gives the optimum voltage and clock periods for the testbenches as well as the optimum overall error rate of pipeline, q total . Table IV also reports the details of optimum operating point of the soft pipeline along with the total error rate of pipeline. Fig. 14 illustrates the share of each technique in overall power-delay improvement with respect to Baseline.
Finally, we compared our ESOSP algorithm to an advanced baseline, Base + CS, which adopts the useful clock skew technique on top of Baseline. In this method, the pipeline stages are made balanced (by up to four FO4 inverter delays) by means of adjusting skew of clock for each individual stage. In contrast, ESOSP reduces the imbalance of pipeline by means of time borrowing. The results of this comparison show an average PDP saving of 38% for ESOSP over all testbenches. Compared to the 42.7% of average PDP saving of ESOSP with respect to Baseline, one can conclude that the share of PDP saving that was due to time borrowing reduces about 5%. The reason is that these two methods have almost the same effect on balancing the stage delays, and hence, clock period reduction gained by using SEFFs with respect to Base+CS is lower. However, using SEFFs enables dynamic (variable) time borrowing while the clock skew is a static (fixed) method for path delay balancing across different pipeline stages.
As far as the overhead of our proposed techniques (including OSP, SOSP, and ESOSP) is concerned, the area overhead of a SEFF compared to a normal FF is only the internal delay circuitry, which is small compared to the area of the original FF. In addition, compared to the size of rest of the pipeline, area overhead of SEFFs and extra buffers is miniscule. Finally, as far as the runtimes of our proposed algorithms are concerned, for all benchmarks, it takes less than 2s on a 2.4 GHz Xeon Pentium 4 PC (with 2 GB of memory) to run any of these algorithms in MATLAB/TOMLAB toolbox.
VI. Related Work
A. Soft-Edge Flip Flops
Soft-edge flip-flops have been used for minimizing the effect of clock skew on static and dynamic circuits [6] , [7] . Recently, the authors of [9] proposed an interesting approach to utilize SEFFs in sequential circuits in order to minimize the effect of process variation on yield. They formulated the problem of statistically aware SEFF assignment which maximizes the gain in timing yield as an integer linear program and proposed a heuristic algorithm to solve the problem. Also, SEFF has been utilized to reduce combinational circuit's soft error rate (SER) [36] by leveraging the effect of temporal masking caused by introduction of transparency window to SEFF circuit design. It is more delay and power efficient compared to circuit redundancy based techniques [36] .
B. Time Borrowing
The authors of [37] proposed an architectural framework, called ReCycle, which adopts clock skew based time borrowing to compensate for process variation in a pipeline latching elements. It solves a linear program to determine optimum clock skews of pipeline stages that improve maximum attainable frequency. It enables the pipeline to tolerate process variation, after fabrication.
In a recent work [38] , authors have optimized pipeline clock frequency by replacing flip-flops with pulsed latches to enable time borrowing, as well as skewing clock. Introduction of clock skew to an edge-triggered flip-flop has an effect similar to the circuit retiming in VLSI timing optimization-movement of the flip-flops across combinational logic module boundaries [39] . Although it achieves time borrowing as SEFF does, but it requires modification to the standard tools and it is a static solution and cannot account for circuit variability and other sources of uncertainty in the environment or input. It has been shown to be ineffective for addressing process variation and circuit imbalance [9] . Moreover, SEFF can pass data anytime during its transparency window, while a FF with skewed clock passes the data only at the shifted edge of clock. Obviously, adjusting clock for each individual flip-flop lifts this limitation at the cost of a complex design effort.
C. Integrated Error Handling Mechanisms
Razor flip-flop design was introduced in [5] that obtains an significant power reduction by adopting an smart opportunistic voltage scaling scheme. It only reduces voltage upon detection of timing errors in pipeline. It equips a pipeline with delay error detection capability as well as error correction mechanism. In a later work, the authors of [40] proposed two local tuning mechanisms in the context of Razor dynamic voltage scaling: per-stage voltage controlling and per-stage clock skew adjustment. Its drawbacks are rather complex to provide separate voltage supplies for each pipeline stage in physical implementation, plus the disadvantages of clock skewing technique mentioned earlier. In a recent work, Razor architecture has been revisited and Razor II has been proposed that provides both low-power operation and SER tolerance [41] . Its power saving is achieved by performing only error detection in the FF, while correction is performed through architectural replay. This allows significant reduction in the complexity and size of the FF, too. Our work efficiently combines the power saving integrated error handling mechanism of Razor, with the performance enhancer time borrowing technique. Similar to Razor, MicroFix architecture [42] takes the delay errors as the indicator to required DVFS action. It handles errors in a prediction based manner [42] .
VII. Conclusion
We presented and solved the problem of minimizing powerdelay product metric in a linear pipeline by utilizing softedge flip-flops to perform time borrowing between consecutive stages of the pipeline. We formulated the problem of optimally selecting the transparency window sizes of the SEFFs and the clock frequency of pipeline so as to optimize the powerdelay product of entire pipeline, in three different scenarios that assume deterministic worst case path delays or probabilistic random delays for pipeline stage delays. Also, by over-clocking the pipeline and allowing timing violations to occur and then recovering the errors, our proposed ESOSP algorithm exploited the tradeoff between performance and power saving to further minimize the expected power-delay product of a pipeline. Our experimental results demonstrated that the proposed technique is quite effective in reducing the expected power-delay of a pipeline.
Appendix SEFF with Built-In Error Detection/Correction
In a SEFF with built-in error detection, a secondary latch, called shadow latch, is added to the original flip-flop circuit (similar to Razor FF [5] , however, Razor integrates error correction circuitries, too, that increases flip-flop delay). This shadow latch re-samples the input data at a later time by utilizing a PS global clock signal. If there is a setup time violation in a pipeline stage, comparison of the two data values-sampled at the main and the PS clock edges-would detect the error. Fig. 15 illustrates the internal architecture of a master-slave SEFF with built-in error detection and Fig. 16 illustrates its operation. In this figure, data unit D1 is available early enough to correctly get latched in main FF at time t 1 , and in the secondary latch at t 2 . On the other hand, data D2 misses the latching window and cannot be captured at time t 3 ; instead D1 is stored in FF. However, at t 4 , the error detection unit re-samples the data and captures D2; XNOR of contents of the two latches indicates an error.
As illustrated in Fig. 17 , in a SEFF with built-in error correction (see Razor FF [5] ), a multiplexer selects between the data sampled at the main clock edge and PS clock edge, which is the correct data in case of any error. Compared to micro-architecture-based error correction mechanisms (e.g., flushing), this approach has less overall performance overhead, but higher power and area overheads.
Introduction of the PS clock, clkp, to design requires meeting an additional timing constraint to avoid undetected errors or short path violations in the following scenarios: 1) when the maximum delay of the preceding logic block is so large that the signal misses the triggering edges of both the main and PS clock edges, and 2) if the minimum delay of the combinational logic succeeding a flip flop is too short, new data overwrites the valid one, at PS clock edge and creates an error situation while it is not an error. Imposing the following constraint on the amount of phase shift avoids these scenarios: 
