Abstract-In order to achieve high tolerance against process, voltage, and temperature variations in the ultralow voltage (ULV) circuits, in situ error detection and correction (EDAC) techniques were presented. However, circuits adding the capability of error detection incur large hardware overhead, especially in ULV due to larger delay variability. In this paper, we analyze the hardware overhead of error detection techniques in pipelines based on three different sequential elements: flip-flops, two-phase latches, and pulsed latches. By exploiting the cycle-borrowing ability, we propose a technique called sparse insertion of error detecting registers on the two-phase latch-based and pulsed-latchbased pipelines to reduce the sequential logic area. Furthermore, we propose a delay-padding methodology using a multi-V t cell library in ULV circuits to reduce EDAC hardware overhead. The proposed techniques are applied on a benchmark six-stage pipeline operating at 0.35 V in a 65-nm CMOS. The analysis results show that our proposed techniques can reduce the total area by 26%-33% and the error detecting register count by 2.9-4.3× compared with conventional EDAC techniques.
I. INTRODUCTION

U
LTRALOW voltage (ULV) operation has gained a significant amount of attention for highly energy-efficient digital integrated circuits (ICs). The supply voltage (V DD ) of ICs can be scaled down to near or below the transistor threshold voltage (V t ) to increase the energy efficiency, prolong the battery lifetime, and miniaturize systems. Those benefits can enable a range of exciting applications such as medical implants, environment sensors, and microrobots [18] .
One of the most critical challenges in designing ULV digital ICs is to mitigate the delay variability. In the ULV W. Jin is with Columbia University, New York, NY 10027 USA, and also with Shanghai Jiao Tong University, Shanghai 200240, China (e-mail: kings2005@sjtu.edu.cn).
S. Kim regime, device current becomes exponentially sensitive to process, voltage, and temperature (PVT) variations. The large variability demands designers to add an excessive amount of margin to ensure correct operation under the worst case PVT condition. Such a margin, however, can severely limit the performance and energy efficiency of the ICs when they operate under the nominal or the best condition. It is shown that the worst case margin can force a chip to operate at only 10% of its potential performance, despite the fact that the chance to experience the worst case condition is slim [1] . Error detection and correction (EDAC) techniques [2] - [9] , [19] - [22] have been proposed to eliminate such margins while still ensuring correct operation across PVT variations. Conventional EDAC techniques use special pipeline registers with error detection ability. Those error detecting registers, which are employed as the receiving pipeline registers for critical and near-critical paths, capture incoming data at two phases: 1) at a clock edge and 2) during a detection window that is often the high phase of the clock. If those captured data at the above two phases are different, it is interpreted as a timing error. In those detection techniques, the signals that propagate through short paths may be captured in the detection window, causing false error detection. To avoid this false error detection, delay buffers are inserted into the short paths so that the delays of all the paths become longer than the length of the error detection window.
Dynamic voltage scaling (DVS) or dynamic voltage frequency scaling (DVFS) is often employed along with EDAC techniques [2] - [6] . The controller for DVS/DVFS can take the error rate from error detecting registers and modulate the operating conditions, i.e., V DD or cycle time (T CLK ), for making the circuits to operate on the edge of failure. The closed-loop systems allow us to remove the margin for static (e.g., process variations) and slow-varying variations (e.g., ambient temperature changes) without any postsilicon calibrations. Fast dynamic variations, such as voltage droops, local hot/cold spots, and coupling noise, can be detected and corrected by EDAC techniques. This way, the margin for almost all the variations can be removed.
Adding the capability of error detection to pipelined designs, however, can incur a large amount of hardware overhead. First, the error detecting registers often have 1.3-3.2× more transistors than regular registers [2] - [7] , which can significantly increase the sequential logic overhead. Second, the combinational logic area will be significantly increased because of short-path padding. To make things worse, the overhead of ULV designs is expected to further increase due to the large delay variability. For example, we need to insert more error detecting registers because the number of critical paths that can potentially violate T CLK becomes larger. In addition, more delay elements should be deployed to ensure the delay of short paths longer than the time of the error detection window under the worst case PVT condition. This paper starts with analyzing the area overhead for three representative EDAC designs, one using flip-flops (or flops) [2] , another using two-phase latches (or latches) [3] , [17] , and the other using single-phase pulsed latches [14] , [15] , in the context of ULV pipelines. The conceptual schematics of the three designs are shown in Fig. 1(a) -(c), respectively.
All of the designs are found to cause a large amount of hardware overhead. The flop-based technique [2] suffers from short-path padding, exhibiting a 2.2× increase in the combinational logic area and a 2.1× increase in total area. The insertion rate of error detecting registers (i.e., error detecting registers out of total registers) in flop-based designs also increases from 19% at nominal 1 V to 44% at 0.35 V.
On the other hand, the two-phase latch-based technique has an advantage that it needs no short-path padding and has a timing-borrowing ability. However, it has an inherently larger sequential overhead than flop-based design. In addition, the logic depth per latch stage is reduced by half, which can exacerbate the impact of local variations because the averaging effects through logic paths are reduced.
The pulsed-latch-based design can have the smallest sequential logic area (while ignoring pulse generation and distribution overhead). It also retains the timing-borrowing ability. However, the pulsed-latch-based pipeline has the tradeoff between cycle-borrowing ability and short-path padding; increasing cycle-borrowing ability requires more short-path padding, causing hardware overhead. It is also more difficult to distribute short pulses than regular clocks having 50% duty cycle because a narrow pulsewidth can undergo severe distortion in the ULV regime.
To reduce the overhead associated with the in situ error detection capability, we propose a technique to sparsely insert error detecting registers across pipeline stages instead of at every stage (as done in [3] ). We find that while the conventional techniques often provide an error detection window of 50% of T CLK per stage, more than half of the window is not utilized even under the worst case delay variation induced by dynamic variations. Therefore, in two-phase latch [ Fig. 1(d) ] [17] and pulsed-latch pipelines [ Fig. 1(e) ], timing violations produced in a stage do not need to be detected and corrected as long as they can be passed to the following stage via cycle borrowing. The passed timing violations either disappear when propagating through noncritical paths or are again cycleborrowed to the further subsequent stages without causing explicit timing errors. After signals propagate through several stages, however, delay increases may be accumulated to be close to 50% of T CLK . This is the point where we place error detectors. Note that the locations of error detectors are set based on the worst case assumption that signals travel only the critical paths across stages. The proposed technique, therefore, can maintain the same error detection coverage as the existing techniques.
In addition, we propose to perform short-path padding using a multi-V t standard cell library for the sparse pulsed-latch EDAC. The high-V t cells have a longer delay and less power dissipation than the regular-V t and low-V t cells, and thus can reduce the area overhead of padding. While this is well known for super-V t circuit design, we find that the use of the multi-V t library for short-path padding is very effective particularly for near-and sub-V t circuits since high-V t cells are exponentially slower than regular-and low-V t cells. With the multi-V t library, we can have a large amount of padding at low area and power overhead, allowing the use of wider pulses. This also alleviates the overhead associated with pulse generation and distribution.
In our benchmark circuits [multiplier-based, T CLK = 40 fanout-of-4 (FO4) delays, and V DD = 0.35 V], we find that for the two-phase latch-based design, error detection is needed only every six latch stages, i.e., the insertion sparseness of two-phase latch pipelines (N TL ) is six, reducing the number of error detecting registers by 4.3× compared with the conventional latch-based techniques [3] . Here N TL is defined as the number of latch stages that can be skipped between error detecting registers. This reduction and short-path padding-free property achieves a total area overhead of 27.5% compared with the conventional flop-based pipeline having no EDAC.
For the pulsed-latch-based design, we experiment multi-V t short-path padding as well as the sparse insertion technique. We find that the insertion sparseness of pulsed-latch pipelines (N PL ) is two, which reduces the number of error detecting registers by 2× compared with the conventional pulsed-latchbased techniques [14] , [15] . The total area overhead is reduced down to only 15% using pulsed-latch-based design compared with the conventional flop-based pipelines having no EDAC.
In addition, the proposed techniques can significantly reduce the error rate by 5.5 to 37× compared with the conventional techniques because many of the variation-induced delay increases are self-healed when signals travel through multiple stages via cycle borrowing. The substantial reductions in area overhead and error rate without compromising detection coverage make the proposed technique an attractive option to enable error-free ULV pipeline designs without imposing the worst case margin.
The remainder of this paper is organized as follows. In Section II, the overhead of three existing error detecting techniques is analyzed. We propose our sparse error detecting register insertion technique in Section III. Section IV describes our EDAC technique using sparse error detection and multi-V t short-path padding for pulse-latch pipeline. In Section V, we compare our proposed EDAC technique with the conventional techniques with a six-stage pipeline benchmark. Finally, we conclude this paper in Section VI.
II. OVERHEAD ANALYSIS OF THE EXISTING ERROR DETECTION TECHNIQUES IN THE ULV REGIME
In this section, we analyze the overhead of the three representative EDAC designs, flop-based [2] , two-phase latchbased [3] , and pulsed-latch-based [4] in near-and sub-V t operation.
A. Flop-Based Error Detection
1) Short-Path Padding:
The existing flop-based EDAC technique [2] suffers severely from the area overhead caused by the short-path padding. According to this technique, any data arriving in the error detection window is regarded as a timing error. However, the signals that propagate through short paths can also arrive in this window, causing false error detection. To filter the correct signals arriving through short paths from actual timing errors, delay elements (e.g., buffers or inverters) are typically inserted to ensure that the delays of the short paths are longer than the length of the error detection window.
Considering the worst case PVT condition, in fact, the delay of short paths must be longer than the time of the detection window. We perform an experiment using a single-stage 16-bit multiplier synthesized at 40-FO4 delays in a 65-nm CMOS to explore the relationship between minimal short path delay, minimal monitoring path delay, and V DD when considering a 3σ delay variation incurred by local process variations. Fig. 2 shows the results. As shown in Fig. 2(a) , short paths should be padded such that their delays are longer than 61% of T CLK at 0.35 V when considering a 3σ delay variation. In addition, more paths need padding at lower V DD again due to the larger delay variability at ULV regime. All of these require inserting a large amount of delay buffers, causing a 2.2× increase in the combinational logic area compared with the baseline design having no error detection capability. To reduce the overhead incurred by short-path padding, some previous works have proposed to reduce the duty cycle of the clock below 50% [4] , [5] , [8] or to generate an internal detection window in each error detector [6] . Whereas those approaches can relax the requirement of the short-path padding, they also reduce the length of error detection window. This is undesirable in ULV design due to the larger delay variability. In addition, the large delay variability makes it difficult to generate and distribute such a clock signal with a skewed duty cycle. It is also nontrivial to locally and robustly generate a pulse with a fixed pulsewidth defined by delay elements, as the quality of pulses may suffer from large variability.
2) Error Detector Insertion Rate: Another major source of area overhead in the conventional flop-based EDAC technique comes from the error detecting registers. Typically, the delays of critical and near-critical paths are estimated under the worst case dynamic variation. Those flops that receive data from those paths that can potentially violate T CLK are replaced with error detecting registers. In the ULV regime, more paths are likely to violate T CLK due to the higher sensitivity to the ranges of dynamic variations. A few notable examples of dynamic variations include IR drops, particularly in designs employing distributed power gating switches [10] , [11] , and coupling noise [12] , particularly in designs using multiple V DD s and V t s where the strength difference between aggressors and victims is large.
We investigate the number of critical and near-critical paths requiring error detection using the single-stage 16-bit multiplier across V DD s from 1 down to 0.3 V. The relationship is shown in Fig. 2(b) . Any path whose delay is longer than 76% of T CLK should be monitored at 0.35 V, whereas only the paths longer than 92% of T CLK need to be monitored at 1 V. The increased number of critical and near-critical paths requires that 44% of the total flops need to be replaced with error detecting registers compared with only 19% of the total flops at 1 V. Note that some conventional works targeting nominal V DD operation [2] , [5] , [6] achieve the lower replacement rates of 7%-17%, partly due to the imbalanced delays among stages. It is, however, common to find designs having similar delays among stages, and therefore such opportunistic savings in the replacement rate can be limited.
B. Two-Phase Latch-Based Error Detection
An EDAC technique based on two-phase latch sequencing has been recently proposed primarily focusing on reducing the architectural invasiveness [3] . An additional benefit of using latch-based sequencing for EDAC techniques is the elimination of the false error detection induced by short paths. Because each consecutive latch stage becomes transparent at an opposite phase of the clock signal, no new data from the previous latch stage are launched during the transparent phase of the current latch stage, inherently eliminating false error detection. This technique uses the cycle-borrowing window of latch stages as error detection window. If time borrowing is occurred, it is interpreted as timing errors.
1) Sequential Overhead: Transforming a flop-based design into a two-phase latch-based design can increase the sequential overhead. In our experiment, such a transformation performed on a 16-bit multiplier can increase the sequential logic area by 2.6× and the total area by 18%. This is because 1) a pair of latches has a larger area than a single flop and 2) the total number of latches is more than twice the number of flops, i.e., 16 flops (roughly equivalent to 32 latches) in the original design are transformed into 39 latches.
2) Error Detector Insertion Rate: Applying an EDAC technique to latch-pipelined designs can significantly increase the number of error detecting registers. This is because a latchpipelined design has more sequential elements than a flopbased design. In addition, the delay of one latch stage is shorter (close to half of that of one flop stage), which can aggravate the impact of local variations. In our multiplier test circuits, a latch stage has 1.7× higher variability than that of a flop stage. As a result, 23 out of 39 latches (58%) need to be replaced with error detecting ones, while 7 out of 16 flops (44%) are replaced in the flop-based design.
C. Pulsed-Latch-Based Error Detection
Pulsed-latch-based pipelines have the advantages in sequential area and timing borrowing. Pulsed-latch-based design does not need to divide every pipeline into two stages for logic as does two-phase latch-based design and has a timing-borrowing window as long as the pulsewidth (T pw ). EDAC can be applied on pulsed-latch-based pipelines [14] , [15] . In this case, we can use the timingborrowing window as the error detection window. However, pulsed-latch-based EDAC suffers from the area overhead caused by the short-path padding like the flop-based EDAC. To avoid false error detection from short paths, we need to increase the delay of the shortest path in pipelines to be larger than T pw .
1) Area Overhead of Pulsed-Latch-Based Design:
The area overhead for short-path padding corresponds to T pw . Having a larger T pw enables larger detection window and more cycle borrowing, but requires us to add more delay elements to short paths. In order to understand this tradeoff, we perform an experiment where we pad short paths in the automatic place and route (APR) phase using single-V t cells. We use a 16-bit multiplier with T CLK = 40-FO4 delays and V DD = 0.35 V. As shown in Fig. 3 , the area overhead can be as large as 30.9% if we want to use T pw of 1/3 T CLK . This is a significant amount of overhead that must be addressed.
2) Pulse Distribution Overhead: In addition to the shortpath padding, another source of overhead in pulsed-latchbased EDAC is pulse clock distribution. It requires strong buffers in the clock network to meet slew time constraints.
Slew time should be short enough to ensure sufficiently long time of pulses being high. We set the high phase time to 5-FO4 delays for meeting latch setup time and allowing some timing borrowing. Also, slew time needs to be short enough not to adversely affect C-to-Q delay of latches, and therefore we set a constraint that slew time needs to be < 3-FO4 delays.
With those constraints, we design and simulate the required buffer size to distribute pulses while sweeping T pw from 6 FO4 to 15-FO4 delays at 0.35 V. We use a clock network based on the merged buffer scheme to minimize skew and its variability [16] . The network drives 400 pulsed latches. As shown in Fig. 4 , we cannot reliably distribute pulses whose T pw is <6-FO4 delays since wire resistance disallows us to achieve the required slew time and thus the required highphase duration of 5-FO4 delays. The experiment shows that we need buffer sizes of 72X, 36X, and 18X for distributing pulses whose T pw 's are 6, 7, and 8-FO4 delays. For T pw > 8-FO4 delays, we can use the same slew time and thus the same buffer strength of 18X. Overall, it is beneficial to use wider pulses, in particular larger than 8-FO4 delays in this experiment, in the pulse distribution overhead perspective. It can cause up to 4× overhead in buffer sizes in networks to distribute narrow pulses.
III. SPARSE ERROR DETECTING REGISTER INSERTION
A. Concept
EDAC and DVFS techniques can track and compensate for static and slow-varying variations, which include systematic interdie, intradie, and random process variations, package and die supply voltage fluctuations, and ambient temperature variations, contributing a large portion of the entire variability of a chip. The remaining dynamic and fast-varying variations such as local V DD variations, coupling noise, and hot and cold spots need to be handled solely by EDAC due to the limited response time of the EDAC and DVFS closed loop.
While the conventional techniques often provide an error detection window of 50% of T CLK , we find that more than half of the window is not utilized even under the worst case delay variation induced by dynamic variations. Such underutilization should be optimized since it can directly reduce the hardware overhead associated with adding error detection capability.
To better utilize the error detection window, we propose to sparsely insert error detecting registers in pipeline circuits. The conceptual schematics of them are shown in Fig. 1(d) and (e). This strategy, applied onto two-phase latch and pulsed-latch pipelines, can place error detecting registers only every N stages, instead of every stage like in the conventional EDAC design based on two-phase latch and pulsed-latch sequencing as shown in Fig. 1(b) and (c) .
In the proposed sparse insertion of error detecting registers, we do not intend to detect the timing violations (or delay surplus) produced in every stage as long as it can be passed over to the next stage via cycle borrowing. The delay increases can disappear when propagating through noncritical paths, or they are cycle-borrowed again to the next stage. Eventually, a delay surplus can be accumulated and becomes larger than the size of the cycle-borrowing window. We insert error detecting registers before the accumulated delay surplus is expected to exceed the size of the cycle-borrowing window. Note that we set the insertion locations under the worst case condition that signals travel only the critical paths across pipeline stages. This makes our insertion technique can have the same error detection coverage as the conventional EDAC techniques.
Another significant benefit of our sparse insertion technique is the reduced error rate. A critical path in one stage may not directly feed another critical path in the next stage [13] . Therefore, some of the delay surplus produced in a stage can disappear in the next stage via cycle borrowing without explicitly flagging timing errors. This self-healing effect can reduce the overhead associated with detecting and correcting errors, saving energy, and improving throughput even under operating conditions with large variability.
B. Inverter Chain Study-Simulation Setup
To evaluate the effectiveness and robustness of the proposed sparse error detection technique, we perform SPICElevel simulations using 20 latch stage circuits where each stage has a 25-FO4 long inverter chain. In this section, we focus on two-phase latch pipelines. The application of the sparse insertion technique on pulsed-latch pipelines will be discussed in Section IV. First, we determine the minimum T CLK by measuring the delay of two adjacent latch stages that include the delays of an inverter chain and a pair of latches. A minimal margin of an FO4 delay is added to T CLK in order to account for input and clock uncertainties. Second, MonteCarlo simulations with local process variations are performed. Across the simulation, the data arrival time in each latch stage is observed to determine whether they are properly captured.
In this paper, the 6σ worst case delay variability from local process variations is used to account for all the dynamic variations. In ULV operation, a small amount of driving current and a relatively slow clock frequency can reduce the concern for inductive noise. In addition, the device current has a positive temperature coefficient, i.e., the current increases with the higher temperature, in the near-and sub-V t regimes, which can relieve the concern for temperature hot spots. To precisely estimate the amount of dynamic variation is a design-specific task and beyond the scope of this paper.
C. Inverter Chain Study-Conventional Case (N T L = 1)
First, we analyze the conventional case where error detecting registers are inserted in every stage, i.e., insertion sparseness or N TL is one. The required window is defined as the minimum amount of window needed to capture the 6σ worst case delay from the Monte-Carlo simulation. As shown in Fig. 5 , the simulation results indicate that the required amount of the error detection window increases as V DD is scaled down because the variability grows. The results, however, also show that even at 0.3 V, a small error detection window of 19% of T CLK is sufficient for the worst case dynamic variation, making the remaining error detection window of 31% of T CLK redundant.
In addition, we experiment with different length latch stages, particularly because the delay variability can become worse at the fine-grained pipeline stages, demanding wider error detection window. At 0.35 V, as shown in Fig. 6 , we observe that the required size of the window increases at finer-grained latch stages due to the diminishing amount of averaging effects. However, again, even for a 10-FO4 long latch stage (a 20-FO4 long stage in flop-based or pulsed-latch-based design), only 25% of T CLK is sufficient for error detection window when error detecting registers are inserted in every stage (i.e., N TL =1). This underutilized detection window is a motivating point for the proposed sparse insertion of error detecting registers.
D. Inverter Chain Study-Sparseness Optimization
The underutilization of the error detection window motivates us to investigate the way to sparsely insert error detecting registers coupled with two kinds of latch-based sequencing. This way we can accumulate delay surplus across stages without explicitly causing timing error, and the sparsely placed detection stage can utilize the entire error detection window. For the two-phase latch-based design, this error detection window is 50% of T CLK . For the pulsed-latch-based design, the error detection window is the same as T pw . Significant number of latch stages can be skipped before error detecting registers are inserted. At V DD > 0.4 V, the optimal sparseness is estimated to be larger than 20 latch stages.
First, two-phase latch-based design is analyzed. We use the test circuits as mentioned above at V DD s ranging from 1 to 0.3 V. The relationship between the required window and number of latch stages at different V DD s is shown in Fig. 7 . The required window almost linearly increases as the number of latch stages increases in our 6σ worst case MonteCarlo simulations. This is because we assume that every signal goes through critical paths in every stage. Self-healing happens when a signal goes through noncritical paths. However, when we determine the sparseness, we consider the worst case scenario without self-healing cases such that we can robustly detect timing errors. At 0.35 V, up to seven latch stages can be skipped without placing error detecting registers under the 6σ worst case dynamic variation. A notable observation is that the required error detection window is considerably low at V DD >0.4 V. This is because the large cycle-borrowing window (50% of T CLK ) and the added 1-FO4 (2.5% of T CLK in this case) margin, coupled with a smaller amount of delay variability, are sufficient to absolve delay variability from all the dynamic variations. N TL , therefore, can be larger than 20 latch stages (ten stages in flop-based or pulsed-latch-based pipeline), and not found in our simulation.
We also investigate the optimal sparseness across different lengths of latch stages from 10-to 100-FO4 delays at 0.35 V, which is shown in Fig. 8 . According to Fig. 8 , the optimal sparseness increases as the latch stages become longer due to the larger amount of averaging effects. When latch stage is 100-FO4 long, the optimal sparseness is 14. But for the very aggressive latch stage of 10-FO4 delays, the optimal sparseness is reduced to 4 (i.e., two stages in the flop-based or pulsed-latch-based pipeline), thus providing smaller benefits.
E. Optimal Sparseness N T L for General Pipelines
In large-scale pipeline designs, it is not trivial to perform a brute-force search for N TL as we did for inverter chains in Section III-D. A strictly nonoptimal yet effort-saving approach is to find the N TL for the top several longest paths among all the stages. Critical paths can be found using commercial tools for static timing analysis and automatic test pattern generation. For the found critical paths and near-critical paths, we can run a Monte-Carlo simulation with local process variations to estimate the mean (μ) and the standard deviation (σ ). This μ and σ can then be used conservatively for all the other stages. Finally, based on the law of the sum of independent random variables, we can estimate the optimal N TL that can fully utilize the detection window, but still cover the found worst case dynamic variation (i.e., the 6σ value) from 6 N TL σ 2 < Detection Window (1) where N TL is the sparseness of two-phase latch pipelines and σ is the standard deviation of path delay. For two-phase latchbased design, because the detection window is half of the clock cycle, N TL can be found from
where T CLK is the clock cycle.
IV. SPARSE ERROR DETECTION FOR PULSED-LATCH-BASED PIPELINES
In this section, we discuss the application of the sparse insertion technique on pulsed-latch pipelines. Additionally, we also propose a multi-Vt-based design methodology for enabling long yet area/energy-efficient short-path padding, which is critical for pulse-latch pipelines.
A. Short-Path Padding Using a Multi-V t Library
As shown in Section II-C, it is critical to enable the use of wider pulses in pulsed-latch pipelines since it can allow more cycle borrowing, longer error detection window, and also increase sparseness (N PL ) in the proposed sparse insertion. To use wider pulses, however, we must perform more shortpath padding. Therefore, it is critical to minimize the area and power overhead of such short-path padding.
Since high-V t cells have longer delay and less power dissipation than regular-V t and low-V t cells, here, we propose an automatic flow of short-path padding using a multi-V t cell library. Fig. 9 shows the flowchart of the proposed methodology. It starts with a logic synthesis tool (we use Synopsys design compiler) with constraints on both critical paths and short paths. The synthesis tool uses a multi-V t library, which is characterized at V DD = 0.35 V using Cadence Encounter Library Characterization in this paper. For a gate-level netlist, we can also perform short-path padding during the APR phase using an APR tool (we use Cadence Encounter). This step can consider the impact of parasitics on timing. Then we perform static timing analysis to determine if the timing constraints are met.
Using the flow, we experiment various amounts of shortpath padding using either single-V t or multi-V t libraries. For the test circuits, we use a 16-bit multiplier and perform padding during APR phase. As shown in Fig. 3 , in order to pad short paths to 13-FO4 delays, the single-V t -based padding causes 30.9% area overhead, while the multi-V t -based one exhibits 3.6× smaller overhead of 8.7%, confirming the efficacy of the proposed padding.
As a summary, the use of multi-V t library for padding enables significant reduction in padding overhead. This large degree of improvement is unique to only near-/sub-V t circuits since high-V t cells become much slower relative to regularand low-V t cells only in near-/sub-V t circuits. 
B. Sparse Insertion on Pulsed-Latch Pipelines
We apply the sparse insertion technique on pulsed-latch pipelines. We use the similar test circuits used in Section III, yet change them to have 20 pulsed-latch stages. Each stage is 50-FO4-delay long. We first investigate the relationship between the maximum sparseness of pulsed-latch pipelines N PL and T pw . Since the size of detection window is equal to T pw , based on (1), N PL can be found from
Equation (3) shows that we can achieve larger N PL by increasing T pw , which is different from the case of two-phase latch pipelines whose detection window is typically fixed to 0.5 · T CLK . Since using larger T pw causes more short-padding and related area/power overhead, this poses a tradeoff. We assume to pad short paths to use T pw = 13-FO4 delays and perform SPICE-level simulations to find N PL over V DD s from 0.3 to 1 V. The relationship of required window and number of latch stages is shown in Fig. 10 . Similar to Fig. 7 , the required window almost linearly increases as the number of latch stages increases in our 6σ worst case Monte-Carlo simulations. As shown in Fig. 10 , N PL is found to be 3 at 0.35 V under the 6σ worst case dynamic variation. Note that N PL = 3 is roughly equivalent to N TL = 6 in two-phase latch pipelines. At higher V DD s, N PL increases. At V DD > 0.4 V, the optimal sparseness is found to be more than 20 stages. Pulsedlatch pipelines exhibit similar sparseness with two-phase latch pipelines except that N PL is moderately smaller than N TL after accounting for latch and pulsed-latch stages since the considered pulsed-latch pipeline has smaller cycle borrowing window.
We also explore the required T pw for different N PL 's at a fixed V DD of 0.35 V. As shown in Fig. 10 , we need larger T pw in order to increase N PL . For example, if we want to insert error detection latches every three pulsed-latch stages (N PL = 3), T pw should be >47% of T CLK . 
V. CASE STUDY WITH SIX-STAGE PIPELINE CIRCUITS
In this section, we apply the proposed techniques on more realistic benchmark circuits, a six-stage pipeline design based on six 16-bit multipliers. We set V DD to 0.35 V and T CLK to 40-FO4 delays. The flop-based [ Fig. 11(a) ], two-phase latchbased [ Fig. 11(b) ], and pulsed-latch-based [ Fig. 11(d) ] sixstage pipeline circuits are shown. Fig. 11(c) and (e) gives error detector locations across different sparsenesses for two-phase latch and pulsed-latch pipelines, respectively.
A. Two-Phase Latch-Based Sequencing
Test circuits using two-phase latches are designed by industrial CAD tools and custom scripts. Six-stage pipeline flopbased circuits are retimed into 12 latch stage ones [ Fig. 11(b) ]. Retiming is performed using half the T CLK (20-FO4 delays) of the original flop-based pipeline. We reserve the cycleborrowing window for use when we add the proposed error detection ability. The latches, therefore, are treated as flip-flops during retiming, and the timing closure step becomes the same as that of the conventional flop-based design. Finally, custom automatic scripts are created to replace the flops in the odd stages with transparent-low latches and those in the even stages with transparent-high latches.
Before being equipped with error detection capability, the latch-based design exhibits an 18% larger area and a 2.4× larger clock load than the flop-based one, which translates into approximately 24% energy overhead in the test circuits. This is because a pair of latches has a larger amount of clock load than a single flop and also the total number of latches is larger than twice the number of flops for the same pipeline circuits. The flop-based and latch-based pipelines before and after retiming have 96 flops and 234 latches (96 transparenthigh and 138 transparent-low latches), respectively. This is inherent overhead that two-phase latch-based pipelines have. The gains from: 1) cycle borrowing and 2) the elimination of short-path padding when using the proposed technique, however, largely outweigh this intrinsic overhead as we will see in the next sections.
We then perform sparse insertion on the two-phase latchbased pipeline. We replace some of the latches with the error detecting ones based on an algorithm with several user-defined constraints as shown in Fig. 12 . The replacement process starts by finding the μ and σ of the critical path as we discussed in Section IV-E. In the test circuits, the longest path appears in the first half stage (the stage having transparent-low latches at the end), which has μ = 20-FO4 and σ = 1.4-FO4 delays (Table I) . Next, we find the N TL,optimal based on the T CLK and (2), which is found to be 6 at 0.35 V. We still explore several N TL values from one to six for verifying the nonlinearity caused by the circuit structures. For a given N TL,optimal , the algorithm then finds the size of the required window (RW), which represents the required size of the error detection window for the given insertion sparseness. Then through the algorithm, we can find the pipeline latches that receive the data from the paths with a slack smaller than the RW. For example, for the latch-based design with N TL = 1, the RW is found to be 9.2-FO4 delays (23% of T CLK ). The latches that receive data from the paths whose delay are longer than 10.8 FO4 (i.e., 20-9.2) need to be replaced with error detecting latches.
We also perform the same error detector insertion process for the flop-based pipeline. However, sparseness is set to one because it supports no cycle borrowing. Similar to the two-phase latch-based design, we extract the critical path to determine μ and σ (which are 40-FO4 and 1.6-FO4 delays) at 0.35 V. Using μ and σ , the RW is calculated to be 24% of T CLK (9.6 FO4). We find the flops that receive the data from the paths having delay longer than 30.4 FO4 (i.e., 40-9.6), and replace those with error detecting ones. Table I summarizes the results of the insertions process. For the two-phase-latch pipeline, the N TL,optimal is found to be six, which is in fact similar to the estimation using inverter chains in Section IV. At N TL = 6, the total number of error detecting registers is only 16, whereas at N TL = 1 like the conventional two-phase latch-based EDAC techniques [3] , 138 out of the 234 latches are replaced with error detecting registers according to the algorithm. A notable observation is that the number of error detecting registers for N TL = 3 is larger than that for N TL = 2. This is because the stage width in the middle of logic circuits is typically wider than that in the input or output parts of the logic circuits. In addition, the paths ending in the transparent-low latch stage were more critical than those ending in the transparent-high latch stage. This implies that there is an additional overhead-reducing opportunity to place an error detecting stage at the location of a pipeline where the bit width is small.
B. Pulsed-Latch-Based Sequencing
In order to obtain the pulsed-latch-based pipeline, we first replace the flops in the six-stage flop-based pipeline with pulsed latches [ Fig. 11(d) ]. There is no need to divide each pipeline stage into two stages. However, additional short-path padding is required. We use a target T pw of 13-FO4 delays, which is about 1/3 of T CLK of the test circuits. Note that T pw has a strong impact on N PL . We pad the short paths using industrial CAD tools and custom scripts with a multi-V t cell library. Fig. 13 shows the path delay distribution before and after the short-path padding. After padding, we observe the short path is padded to 16-FO4 delays and 11-FO4 delays longer than that of prepadding, while the critical path delays are the same. The area overhead of this padding is 6.6%. For comparison, if we pad the short paths using a single-V t library, the area overhead will be 30.9%, causing 4.7X larger area overhead.
We then perform sparse insertion on the pulsed-latch pipeline. Fig. 12 shows the algorithm for insertion, which is similar to that for two-phase latch pipelines. The main difference is the step to set N PL . As we use the T pw of 34% of T CLK , N PL is two. We could choose a T pw of 42% of T CLK for having N PL of three. However, this can cause more area overhead (see Fig. 3 ), which may offset the savings from the larger N PL . The following steps are to find RW and insert error detecting pulsed latches. The benchmark pipeline exhibits the stage delay of μ =40-FO4 and σ = 1.6-FO4 delays. After the process, we find that the total number of error detecting latches is 48. The results of the insertion process are summarized in Table I .
C. Comparisons of Error Detection Techniques
Finally, we compare six design techniques: 1) no error detection technique; 2) the conventional error detection technique based on flop-based sequencing [2] ; 3) the conventional error detection technique based on two-phase latchbased sequencing [3] ; 4) the conventional pulsed-latch-based error detection technique (N PL = 1); 5) the proposed sparse insertion technique based on two-phase latch (N TL = 6); and 6) the proposed sparse insertion technique based on pulsed latch (N PL = 2)-by applying them in the same benchmark pipeline circuits. The area and error detection register count are investigated. For 2), we use the well-known error detecting register circuits having a main flip-flop, a shadow latch, an XOR gate, and a meta-stability detector [2] . For techniques 3)-6), we use an error detecting register that has a main latch, a shadow latch with an opposite phase, and an XOR gate [3] . For techniques 4) and 6), we use the multi-V t cell library for short-path padding.
The comparison results of these six design techniques for combinational logic area, sequential logic area, and total number of error detectors are summarized in Fig. 14 . Fig. 14(a) shows that technique 2) can incur a more than 72% area overhead in combinational logic due to the excessive short-path padding requirement. Technique 3) uses two-phase latch-based sequencing and incurs little increase in logic area because no short-path padding is necessary. The sequential logic area of technique 2) is increased by 1.8× [as shown in Fig. 14(b) ] as 42 out of 96 flip-flops (44%) are replaced with error detecting registers. The area for sequential circuits and the total area in technique 3) are increased by 4.1× and 52%, respectively, compared with the design without error detection capability, i.e., technique 1).
The total area of the proposed technique 5) is reduced by 26% and 15% over conventional techniques 2) and 3), respectively. Compared with the baseline design having no error detection capability, the total area overhead is 27.5%. In addition, as shown in Fig. 14(c) , technique 5) significantly reduces the number of error detecting registers by 1.3× and 4.3× compared with techniques 2) and 3), respectively.
The total area of the proposed technique 4) is reduced by 33% and 23% over conventional techniques 2) and 3). Compared with the pulsed-latch-based technique 4) with N PL =1, the proposed technique 6) reduces the sequential logic area by 29.7% and total area by 9.3%. The proposed technique 6) reduces the sequential logic area by 41% and total area by 10% compared with technique 5). On the other hand, technique 5) has no need to perform short-path padding compared with technique 6). Compared with the baseline design with no error detection capability, the total area overhead of technique 6) is only 15%.
In addition to the area overhead, the proposed techniques can significantly reduce error rate since many of the potential timing violations (i.e., delay surpluses induced by variations) can disappear as signals propagate through noncritical paths across multiple stages. Fig. 15(a) shows the timing violation rates of the conventional error detection technique 3) and the proposed technique 5) across different temperatures. Fig. 15(b) shows the number of timing violations of technique 3) and technique 5) across different V DD s. The conventional flop-based design 2) is excluded as late arriving data cannot be propagated correctly to next stage without considering correction schemes. The timing violation rates are simulated by running the pipeline circuits with 300 random vectors at the fixed T CLK for 10 o C of temperature variations and −30% V DD variations. In the proposed design, a large fraction of delay increases are fixed via cycle borrowing before they impose timing errors in the detection stage. Contrarily, the conventional design exhibits a large amount of timing violation rate of up to 37X across temperature variations and 7X across V DD variations since any delay increases in a stage contribute timing errors. The smaller timing violation rate is critical to reduce the energy and throughput penalty associated with correction processes.
Moreover, the proposed techniques will detect all the timing errors and will not incur any accuracy loss in terms of computation results (e.g., multiplication results). This is because we determine the sparseness under the worst case assumption that no self-healing effects occur. For those detected errors, existing error-correction techniques such as instruction reply as well as our own correction technique based on local voltage boosting [20] can be used.
Finally, although in this paper, we focus on studying small-scale pipeline circuits, i.e., multipliers, we have successfully used some of our techniques in designing small-and medium-scale microcontroller and neural-network accelerator [20] , [21] . We will further explore the applicability of the proposed techniques on larger-scale complex digital systems. The major challenge would be the higher analysis cost to find the optimal location of placing error detectors.
VI. CONCLUSION
In this paper, we analyze the area overhead of applying EDAC on three different pipelines, each based on flip-flop, two-phase latch, and pulsed latch. To reduce the area overhead, we propose a sparse error detecting register insertion technique based on the cycle-borrowing ability of two-phase latch and pulsed-latch pipelines. We also propose multi-V t -based shortpath padding methodology for mitigating the overhead of delay buffers for pulsed-latch-based design. Experiments with sixstage benchmark circuits show that the proposed two-phase latch EDAC pipeline achieves an area overhead of 27.5% and the proposed pulsed-latch EDAC pipeline achieves an area overhead of 15%, compared with the flop-based baseline having no EDAC ability. Those are significantly smaller than area overhead that the conventional flip-flop EDAC pipeline can cause, which is found to be 72% in our experiment. These results make the proposed pipelines attractive for mitigating variability in near-and sub-V t voltage digital circuits.
