Abstract-Aggressive technology scaling impacts dramatically parametric yield, life-span, and reliability of circuits fabricated in advanced nanometric nodes. These issues may become showstoppers when scaling deeper to the sub-10-nm domain. To mitigate them, various approaches have been proposed, including increasing guard bands, fault-tolerant design, and canary circuits. Each of them is subject to several of the following drawbacks: large area, power, or performance penalty; false positives; false negatives; and insufficient coverage of the failures encountered in the deep nanometric domain. This paper presents various double-sampling architectures, which allow mitigating all these failures at low area and performance penalties and also enable significant power reduction.
I. INTRODUCTION
A GGRESSIVE technology scaling has dramatic impact on: (1) process, voltage, and temperature (PVT) variations; (2) circuit aging and wearout induced by failure mechanisms such as NBTI, HCI; (3) clock skews; (4) sensitivity to EMI (e.g., cross-talk and ground bounce); (5) sensitivity to radiationinduced single-event effects (SEUs, SETs); and (6) power dissipation and thermal constraints. The resulting high defect levels, heterogeneous behavior of identical circuit nodes, circuit degradation over time, and integrated circuits complexity, affect adversely fabrication yield and reliability.
Sensors monitoring the electrical characteristics of transistors (like ION sensors) as well as replica based canary circuits mimicking critical-path delays of the functional circuit [1] - [4] , can be used to: detect circuit degradation induced by aging or timing degradation induced by PVT variations; and activate the regulation of circuit operating parameters (like clock frequency, voltage, or body-bias) in response to this detection. However, these approaches cannot address certain failure mechanisms like single event effects and EMI. Furthermore, these approaches monitor dedicated test structures distributed over the die, which are not part of the operating circuit. Thus, as performance degradation induced by aging is a function of various design and operation parameters, the test structures may age differently from the transistors of the operating circuit. Also, the random sources of process variations may affect The author is with TIMA Laboratory, CNRS, UJF, Grenoble INP, 38000 Grenoble, France (e-mail: michael.nicolaidis@imag.fr).
Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TDMR.2014.2388358
differently test structures and operating circuits, and this is also true for voltage and temperature variations. Thus, monitoring the electrical parameters of dedicated test structures may result in false positives (i.e., the monitors indicate circuit degradation while this may not be the case for the operating circuit), or false negatives (i.e., circuit degradation has reached the threshold of failure but this is not detected by the monitors). Thus, monitoring the operating circuit itself is appropriate.
Monitoring the electrical characteristics of each individual transistor of a circuit will result on very large area and power penalties. The alternative solution is to check the impact of the failure mechanisms on the operation of the functional circuit concurrently with the execution of the application (concurrent error detection). Traditionally this is done by the so-called DMR (double modular redundancy) scheme, which duplicates the operating circuit and compares the outputs of the two copies. However, area and power penalties exceed 100% and are inacceptable for a large majority of applications. Furthermore, DMR relies on the assumption that only one circuit copy is faulty, which may not be valid for failures induced by variability and aging mechanisms. In particular, aging will have similar impact on two identical circuits performing identical operations.
Thus, there is a need for new low-cost error detecting schemes. This goal was accomplished by the double-sampling scheme introduced in [5] , [6] . Instead of using hardware duplication, this scheme observes at two different instants the outputs of the pipeline stages. Thus, it allows detecting temporary faults (timing faults, transients, upsets) at very low cost. In a recent paper [17] we presented a survey of this scheme including its basic architecture and its various employments, its implementations and evaluations by industry leaders; as well as several improved implementations enabling drastic area and power reduction, and/or fault-coverage increase. The present paper is an extension of [17] , in which we: analyze in details the fault coverage achieved by the various architectures; derive the durations of detectable faults and the opportunity windows of undetectable ones; present circuit modifications for maximizing the fault coverage; and establish the fundamental constraints that have to be satisfied by the new low-cost architectures.
II. DOUBLE-SAMPLING ARCHITECTURE AND APPLICATIONS

A. Basic Architecture
As discussed earlier, the traditional concurrent error detection scheme DMR, duplicates the circuit and induces inacceptable area and power penalties. In addition, its basic assumption Fig. 1 . a) The double-sampling scheme [5] , [6] . b) The same scheme extended to perform error correction [8] , [9] .
(only one of the two circuit copies is failed) is not compatible with certain aging mechanisms. To cope with these drawbacks, the concurrent error detection scheme proposed in [5] , [6] avoids hardware replication. Instead, it detects erroneous values by observing the output signals of each pipeline stage at two different instants (double-sampling). This is done as shown in to each output of the combinational logic; -Driving the redundant sampling-element by means of a delayed clock signal. That is, the regular flip-flop (Output FF) is rated by clock signal Ck and the redundant sampling element is rated by clock signal Ck + δ, which represents the signal Ck delayed by a delay δ. -Using a comparator to check the state of the regular flipflops against the state of the redundant sampling elements. Though this comparator checks a plurality of pairs of regular flip-flips and redundant sampling elements, for the sake of simplicity Fig. 1 (a) and the subsequent figures show just one of these pairs. -Capturing the output of the comparator by an error detection flip-flop (FF-E in Fig. 1 (a)) using as latching event the rising edge of the clock, and providing on its output the global error detection signal GE.
The delayed clock signal Ck + δ can be generated by adding delay elements on Ck. These elements have to be added locally in the regular and redundant sampling elements, as adding them on the global clock signal requires implementing two separate clock trees, which will increase area and power penalties and will also induce clock skews between the clock signals delivered by these trees. Another option proposed in [5] , [6] consists in using the rising edge of the clock as latching event for the regular flip-flop and its falling edge as latching event for the redundant sampling element. In this case, δ will be equal to the duration of the high level of the clock. The latter option is more advantageous as it eliminates the large number of delay elements required by the former option.
Short Path Constraint:
The double sampling architecture shown in Fig. 1 is a time redundancy scheme, which enables checking the data captured by the regular flip-flop at the rising edge of the clock against the data captured by the redundant sampling element at an instant occurring at a time δ after this rising edge. Interestingly, this time redundancy does not induce speed penalty. This is achieved by enforcing the short-path constraint: δ < D min −t hold (where t hold is the hold time of the redundant sampling element and D min is the minimum delay in any pipeline stage).
If this constraint is not enforced in a pipeline stage, then, the new data captured by the regular flip-flops of the preceding pipeline stage at the rising edge of the clock, can be propagated through a short path and reach the input of the redundant sampling element before the end of its hold time. On the other hand, if this constraint is satisfied, then, the input of each redundant sampling element is not affected until the end of the hold time, avoiding hold time violations.
Concerning the Detection Efficiency of This Scheme: -As the redundant sampling element captures its input at a time δ after the regular flip-flop, delay faults of duration lower than δ will affect only the value captured by the regular flip-flop and will be detected. If the clock of the regular flip-flop and the redundant sampling element is advanced by δ with respect to the input flip-flops of the pipeline stage, an error can occur in the regular flip-flop due setup-time violation. However, the redundant sampling element will not be affected as it captures its input at a time δ after the regular flip-flop. Thus, the error will be detected. Hence, error detection capability is good also for clock-skews. The double-sampling scheme enables error detection. To also perform error correction the double-sampling scheme can be combined with a retry scheme, which upon error detection repeats the latest operation [5] , [6] . The double-sampling scheme and the retry mechanism can be employed in various manners to achieve various goals.
B. Correction of Soft Errors and Timing Faults
As soft errors (SEUs and SETs) are not repetitive, repeating the latest operation will correct them. However, if the target faults also include timing faults, a different approach is needed. Timing faults will be reactivated and reproduce the error during retry if the repeated operation comprises more than one clock cycle [5] . Thus, [5] , [6] propose the following correction procedure for timing faults:
-Upon error detection, the latest operation is repeated (operation retry) to correct the error. -To avoid the occurrence of the same error the clock frequency is reduced during retry (e.g., divided by 2 by blocking the high clock level in every two clock cycles).
C. Reliability Improvement
References [5] , [6] propose as a principal application of the double-sampling scheme the improvement of reliability against both soft errors (SEUs, SETs) and timing faults. Then, as PVT variations, NBTI, HCI, cross-talk and ground bounce affect the circuit delays, the double-sampling scheme can cover all the five failure modes listed in the beginning of Section I, by selecting an adequate value for δ.
D. Self-Calibration
According to [5] , the double-sampling can be used: -In the field: for adapting the clock frequency of the circuit to the degradation of the circuit delays (due to fabrication-induced timing faults not detected by the fabrication tests, and new timing faults occurring in the field). This is done by reducing the clock frequency Fig. 2 . a) Alternative double-sampling scheme [5] . b) The same scheme used for timing-fault prediction [11] .
when the errors detected by the double-sampling scheme become very frequent [5] . It provides a self-calibration scheme, which is becoming increasingly suitable since circuit degradation over time is worsening as we pursue deep nanometric scaling. -After fabrication: for adapting the operating frequency of the circuit to the actual circuit delays. This is done by using the double-sampling scheme to detect timing errors activated by the fabrication test [5] , and adapting the clock frequency accordingly. This becomes increasingly useful in a context where the variations of process parameters increasingly affect parametric yield.
E. Speed Increase
Another application of the double-sampling scheme proposed in [5] consists in: increasing the operating frequency of a circuit by using a clock period shorter than the one allowed by the delays of the longest circuit paths; exploiting the doublesampling scheme for detecting infrequent errors when one of the longest paths is activated; and correcting them by means of a retry mechanism.
F. Power Reduction
A very important use of the double-sampling scheme was introduced later [8] , [9] , consisting in reducing power dissipation by aggressively reducing the supply voltage. This is possible because:
i. Reducing the voltage increases the circuit delays. ii. As proposed in [5] (and reported above in Section II-E.), the double-sampling scheme together with an error correction mechanism can be exploited to operate the circuit at a clock period shorter than the one allowed by the delays of the circuit. To achieve power reduction [8] , [9] combine the doublesampling scheme with an error correction scheme, which modifies the double-sampling circuitry for adding a local error correction mechanism. The principle of this mechanism is illustrated in Fig. 2(b) by means of a multiplexer, which upon error detection uses the content of the redundant sampling element (shadow latch in Fig. 2(b) ) to replace the content of the regular flip-flop (Output FF). To reduce speed and power penalties, this multiplexer is inserted in the feedback loop of the master latch of the Output FF rather than in the input of this flip-flop. The detailed implementation can be found in [8] , [9] . With this design, the error can be corrected locally by using the contents of the shadow latch, which is not affected by the timing error. However, this correction takes one extra clock cycle, breaking the temporal coherence with the other pipeline stages. Thus, temporal coherence is enforced [8] , [9] by using clock gating to stall all pipeline stages for one clock cycle. However, as the delay of clock-gating propagation to all pipeline stages may not be compatible with very fast designs, an alternative technique using counterflow pipelining is also proposed [8] , [9] . Using this implementation, up to 44% power dissipation reduction was achieved [10] by reducing the supply voltage at a subcritical level implying a mean rate of 1 error for every 10,000 clock cycles.
G. Failure Prediction
Reference [5] proposes a second implementation of the double-sampling scheme, shown in Fig. 2(a) . This scheme adds the delay δ on the data input of the redundant sampling element rather than on its clock signal. Thanks to this delay, a second sampling instant is created that precedes by δ the sampling instant of the regular sampling element (Output FF). As this scheme increases by δ the delays experienced by the redundant sampling element, it requires increasing the clock cycle by a timing margin equal to δ [5] , [7] . This is not the case for the schemes of Fig. 1 , which does not affect the clock cycle. A very interesting application using this scheme was proposed later [11] . It employs the implementation of Fig. 2(b) , where (similarly to the scheme of Fig. 2 (a) proposed in [5] ), a delay (Tg) is added on the input D of the redundant FF. As explained above, adding the delay Tg requires increasing the clock period by a time Tg. The circuit is not detecting any anomaly as far as D max +t su + Tg < T CK , where Dmax is the maximum circuit delay, t su is the flip-flop setup time, and T CK is the clock period. However, if due to aging-induced circuit degradation, the delays of the circuit increase to an extent that violates condition D max +t su + Tg < T CK , the output of the XOR gate will go to "1". Thus, this implementation acts as aging detector, and can be used to reduce the clock frequency for adapting the circuit to aging-induced degradations.
H. Failure Prediction Versus Error Detection
A significant advantage of the failure prediction scheme, with respect to the error detection scheme, is that it does not require implementing a retry mechanism. Indeed, considering that failure mechanisms are slow and circuit delays degrade gradually, condition D max +t su + Tg < T CK will cease to be satisfied sufficient time before condition D max +t su < T CK ceases to be satisfied and the circuit starts producing errors. Thus, the aging sensor will produce detections before the circuit starts to fail. These detections will activate clock frequency reduction or supply voltage increase, to prevent the occurrence of system failures.
However, the failure prediction scheme has not the potential of high power savings (or, equivalently, speed increase) achieved by the error detection double-sampling scheme. Indeed, PVT variations and circuit aging induce circuit delay increase that may produce infrequent errors during circuit operation. Though the frequency of these errors can be quite low they will result in inacceptable reliability. Thus, the voltage supply must be increased (or the clock frequency reduced) significantly to eliminate these errors. However, as the error detection double-sampling scheme together with the retry mechanism can be used to detect and correct these infrequent errors, we can avoid increasing the supply voltage and achieve drastic power reduction (or avoid reducing the clock frequency for achieving speed increase). For instance, [14] demonstrates 52% energy savings at 1GHz operation for an ARM ISA processor.
On the other hand, the failure prediction scheme detects the violations of the guard-banding, but does not detect the errors produced when the delay of a path exceeds the clock period. Thus, it should be operated at a voltage higher than the Point of First Failure. Hence, the failure prediction scheme cannot achieve the large power savings (or speed increase) achieved by the error detection scheme. As power reduction is the most stringent requirement for the upcoming process nodes, the error-detection scheme is best-suited for supporting deep nanometric scaling. Thus, the subsequent sections are concentrated on its limitations and improvements.
III. LIMITATIONS
The basic advantages of the double-sampling scheme are: -The avoidance of massive redundancy, which reduces drastically the area and power penalties with respect to the traditional error detecting schemes. -Its capability to detect the timing faults whose duration does not exceed δ, whatever is their multiplicity; while the traditional (DMR) error detection scheme may be inefficient against multiple timing faults, as these faults invalidate the assumption that only one of the two circuit copies is faulty. Though the double-sampling schemes presented in Section II are much more efficient with respect to the traditional error detection schemes, they have several limitations, which are presented in the rest of the paper together with new implementations allowing their improvement.
A. Metastability
As noted in [8] , [9] , the double-sampling scheme can be affected by metastability. This may happen if the timing fault is such that the setup or hold time constraint of the regular sampling element (Output FF in Fig. 1(a) ) is violated. Then, the voltage at the output of the regular sampling element can be at a level between the logic 0 and the logic 1. Thus, there is a non-null probability that this voltage level is interpreted as the incorrect value by the next pipeline stage and as the correct value by the comparator, resulting in undetected error. To cope with this issue, we may need to add a metastability detector on the output of Output FF, which will add extra area and power penalties. However, as the probability of metastability occurrence is very low, the reduction of the error detection efficiency of the double-sampling scheme will be very low too. Further reduction of the probability that the flip-flop output is interpreted as the incorrect value by the next pipeline stage and as the correct value by the comparator can be achieved at low cost by adding at the output of the flip-flop an inverter with logic threshold outside the metastability voltage region of the flipflop output. Thus, the double-sampling scheme will offer very high reliability improvements even if we don't use metastability detectors. Nevertheless, if the double-sampling scheme is used for reducing aggressively the supply voltage (for power reduction purposes), as proposed in [8] , [9] , the frequency of timing errors could deliberately become very high (e.g., one error at every 10.000 cycles [10] ). Then, to maintain a high reliability level under such high error occurrence frequency, the detection efficiency of the double-sampling scheme should be extremely high. To cope with this issue, references [8] , [9] use metastability detectors, which induce area and power penalties. To avoid using these detectors, a metastability-free doublesampling implementation is proposed [13] .
This implementation (referred as double-sampling timeborrowing-DSTB), exploits the fact that, as explained in Section III-B the delays of all circuit paths in the doublesampling scheme of Fig. 1 exceed the time duration of the high level of the clock. Thus, the data on the combinational block outputs are stable during the high level of the clock. Thanks to this property, the output flip-flop of Fig. 1 was replaced by a latch, as shown in Fig. 3 . This latch is transparent during the high-level of the clock (as allowed by the fact that its input is stable during this level). Thus, at the rising edge of the clock, it propagates to its output the value present on its input, enabling the next pipeline stage to start computation at this instant. Furthermore, for any timing fault whose extra delay does not exceed the time duration of the high level of the clock, there is no risk of metastability on the latch output, as the timing fault will disappear before the latching event of the latch (i.e., the falling edge of the clock). Metastability is however possible on the error detection path, but the probability that an erroneous value goes undetected due to this problem is very low [13] .
Finally, concerning error correction, the authors do not adopt the approach [8] , [9] exploiting the state of the redundant sampling element. Instead, they use instruction replay combined with clock frequency reduction, complying the retry-atreduced-clock-frequency approach proposed in [5] , [6] .
B. Miscorrections
The double-sampling scheme of Fig. 1 (b) (coined as Razor [8] , [9] ) produces miscorrections in the case of soft errors (SEUs and SETs) [12] . This is because these faults may alter without distinction the Output FF and the shadow latch. Then, if an SEU or an SET affects the shadow latch of Fig. 1(b) , this error will be detected and the erroneous content of this latch will be used to replace the correct content of the Output FF. The similar situation can occur for certain clock skews. On the other hand, this situation will not occur for delay faults. Hence, the double-sampling scheme of Fig. 1(b) is suitable for delay faults but not for soft errors and clock skews. Due to these problems and also for relaxing the timing constraints on the errorrecovery path, which may be inacceptable in high performance processors, Razor II [14] abandons the approach of Fig. 1(b) (using the state of the shadow latch for error recovery). Instead, instruction replay combined with clock frequency reduction is adopted, complying the retry-at-reduced-clock-frequency approach proposed in [5] , [6] . Note also that, similarly to [13] Razor II reduces metastability risk by replacing the Output FF by a transparent latch.
C. Short-Path Constraint and Redundant Sampling Elements Related Costs
In Fig. 1(a) , new data are captured by the Output FF and supplied to the subsequent pipeline stage at the rising edge of the clock. Thus, as noted in [5] , [6] , if the delay of some circuit path is shorter than δ + t hold (where t hold is the hold time of the redundant sampling element), we have hold time violation resulting in false positives. To avoid them, all path delays in the pipeline stages must be larger than δ + t hold (short-path constraint) [5] , [6] , leading to the following limitations:
i. Enforcing the short-path constraint requires adding buffers to increase the delay of certain paths [5] , [6] , resulting in additional area and power penalties. Also as process variations are worsening, increasing cost is required to enforce the short-path constraint for situations in which the short-path delays are reduced due to process variations. ii. Also, due to the cost required for enforcing the short-path constraint, we are obliged to use moderate values for the delay δ, limiting the duration of detectable faults.
These problems affect the three double-sampling schemes shown in Figs. 1(a), 1(b) , and 3.
Another source of area and power penalties in the doublesampling schemes of Figs. 1(a), 1(b), and 3 is caused by the use of redundant sampling elements. Although this redundancy is drastically lower than in the case of the conventional DMRbased error detection, it may still be undesirable in certain applications (e.g., in low-power applications, as sampling elements are power hungry).
IV. GRAAL ARCHITECTURE
The reason for which the schemes in Figs. 1(a), 1(b) , and 3 can not cope with large delay faults and also require using a redundant sampling element, is that all the pipeline stages of a flip-flop-based design compute their new values at the same time and as soon as the data are captured by the flip-flops a new computation cycle starts. The consequence is that the outputs of the combinational logic are stable for a short duration of time, leaving a short time that we can exploit for time-redundancybased error checking.
A flip-flop-based design is illustrated in Fig. 4(a) . In flip-flopbased designs, when the master part of a flip-flop is transparent the slave part is in memorization and vice versa. Thus, if we transform a flip-flop-based design (Fig. 4(a) ) into its equivalent latch-based design (Fig. 4(b) ), by moving the slave latches from the outputs of the master latches to the middle of the combinational circuits as shown in Fig. 4 (b) (where Ck and Ckb are replaced by non-overlapping clocks Φ1 and Φ2), we obtain a latch-based design in which the delay of each combinational logic stages CCi and CCi is roughly the half of the delay of the combinational logic CCi of the original flip-flop-based design. Therefore, the time separating the rising edge of Φ1 from the rising edge of Φ2 (and vice-versa), can be equal to the half period TCk/2 of Ck, and we can operate the latch-based design of Fig. 4(a) at the same frequency as the original flip-flop based design of Fig. 4(a) : at the rising edge of Φ2 data are ready on the outputs of CCi , and as latches S become transparent at this instant these data are applied to the inputs of CCi . Thus, as the delays of CCi do not exceed the half clock cycle its outputs are ready at the rising edge of Φ1. A similar scenario works for CCi − CC2 .
The double-sampling implementation for such designs is illustrated in Fig. 5 (GRAAL architecture [12] ). In this figure odd latch-stages (L1, L3, . . .) capture the outputs of odd combinational circuit stages (CC1, CC3, . . .) and are rated by clock Φ1; even latch-stages (L2, . . .) capture the outputs of even combinational circuit stages (CC2, . . .) and are rated by clock Φ2. Furthermore, each latch-stage is blocked during the low level of its clock and is transparent during the high level of its clock. This implies that the inputs of even latch-stages are guaranteed to be stable until the end of the low level of Φ1, and the inputs of odd latch-stages are guaranteed to be stable until the end of the low level of Φ2. Thus, we dispose plenty of time for comparing the inputs of the latches against their outputs, to detect faults of large duration without adding redundant sampling elements. Hence, the only cost for implementing the double-sampling scheme is the cost of the comparators (one XOR gate per latch, plus two OR trees, one compacting the outputs of the XOR gates of odd latch-stages into a single errordetection signal, and a second for similar purposes dedicated to the even latch-stages). Two flip-flops are also used, capturing the error signal generated by the two OR trees (error-detection flip-flops not shown in the figure) .
Thus, a first important advantage of the GRAAL architecture is that it does not use redundant sampling elements, reducing area and more drastically power penalty.
A second important advantage is that, the above-mentioned stability of the latch inputs does not depend on short path delays. Thus, we do not need to insert buffers in the combinational logic for enforcing the short-path constraint, reducing significantly area and power penalties. In fact, let t R2 be the instant of the rising edge of clock Φ2, and D OSP be the delay of the shortest-path in the odd pipeline stages. Then, as the values of event latch-stages are stable until the instant t R2 , the inputs of the odd latch-stages are guaranteed to be stable until the instant t R2 + D OSP , and the input of the odd error-detection flip-flop will be stable until the instant t R2 + D OSP + D OCMP . Then, considering t hold < D OSP + D OCMP (which is easily the case even for D OSP = 0 and very small OR trees), the input of the odd error-detection flip-flop will be stable even after t R2 + t hold , regardless to the value of D OSP . Thus, if we select the rising edge of Φ2 as the latching event of the odd error-detection flip-flops, we are not subject to short path constraints.
A third important advantage is that, as the duration of detectable faults is not limited by short-path constraints, we dispose ample time for detecting timing faults of large duration, and also we can freely increase the duration of the clock cycle to detect faults of any duration.
A. Fault-Coverage
To determine the duration of detectable delay faults we first need to determine the actual instant of comparison of the latch inputs against their outputs. This instant is equal to the instant of the latching event of the error-detection flip-flop minus the delay separating this flip-flop from the compared signals. Thus, if we use the instant t R2 of the rising edge of the clock Φ2 as the latching event of the odd error-detection flip-flop, then, the actual comparison instant t CO for the odd latch-stages is t CO = t R2 − D OCMP , where D OCMP is the delay of the odd comparator. As we have seen earlier the latch-based implementation can operate at roughly the same clock frequency as its equivalent flip-flop-based design, and at this clock frequency the outputs of odd combinational logic stages are ready at the rising edge t R1 of Φ1. Thus, the duration of detectable delay faults is equal to
Similarly, the duration of detectable delay faults for even stages is T Ck /2 − D ECMP . This represents a significant part of the maximum delay of the pipeline stages (which does not exceed T Ck /2). Further increase of this duration is possible, by adding a delay D O = D OCMP − t hold on the clock terminal of the odd error detection flip-flop. In this case, the latching instant of this flip-flop becomes t R2 + D OCMP − t hold , resulting on t CO = t R2 − t hold and a duration of detectable faults equal to t CO − t F1 = t R2 − t hold − t F1 = T Ck /2 − t hold . This is possible without introducing short-path constraints, since we found earlier that the input of the Note also that, due to the time borrowing property of latchbased designs, delay faults may not result on errors and error detections. Indeed, if under a delay fault the correct value on the input of a latch is not ready at the rising edge of its clock, but it is ready before the falling edge of this clock, the latch will capture the correct value at the falling edge of the clock, and no error will be detected by the double-sampling scheme. However, in this case the beginning of the computation of the next pipeline stage will be delayed. If due to this delay some latch of this stage captures a wrong value, the error will be detected by the double-sampling circuitry of this stage. If not, the same situation will be reported to the next pipeline stage, and so on, until an error is captured by some latch and detected by the scheme or errors never occur.
Also, a clock skew altering the relative positions of Φ1 and Φ2 is either tolerated (thanks to the time-borrowing properties of latch-based design), or detected if it induces an error.
Furthermore, we find that in odd pipeline stages any SET whose duration does not exceed t CO − t su − t F1 is detected. With the approach proposed above for extending the duration of detectable faults we have t CO = t R2 − t hold , resulting in a duration of detectable SETs equal to t R2 − t F1 − t su − t hold = T Ck /4 − t su − t hold (considering t R2 − t F1 = T Ck /4). The similar result holds true for even latch-stages. This duration is comfortable for detecting SETs in ground applications. However, in applications where very large SETs can be encountered, this duration may not be sufficient. Then, as this architecture is not subject to short-path constraints, we can increase the clock period at will to make the duration t R2 − t F1 − t su − t hold of detectable faults sufficient.
The detection capability for SEUs is also high. An SEU affecting an odd latch can escape detection only if the following two conditions are realized:
-The SEU occurs after the instant t CO − t su (thus it escapes detection because the comparator checks the content of the latch before the occurrence of the SEU). With t CO = t R2 − t hold , this instant is t R2 − t hold − t su . -The SEU occurs before the instant t F2 − t su − D MPP , where D MPP is the minimum delay of the sensitized paths connecting the affected latch with the subsequent stage of latches (thus the erroneous value will be captured by at least one of these latches, as it will reach it before the falling edge of Φ2 minus the setup time). Thus, the opportunity window for the SEU to go undetected and create errors is [t R2 − t hold − t su , t F2 − t su − D MPP ], and its duration is t F2 − t R2 + t hold − D MPP = H W + t hold − D MPP . These conditions are constraining and make the occurrence probability of such events small. To further reduce this probability we can reduce the duration of this window by reducing the duration H W of the high level of the clock. To make this probability null in all circumstances, we need to make null the value H W + t hold − D SP (where D SP is the delay of the shortest path of any pipeline stage). If this is not possible only by reducing H W , we also need to insert buffers in the circuit short paths to increase D SP .
To conclude, the GRAAL architecture achieves high detection capability (especially for delay faults and clock skews) at low area and even lower power cost. These claims were validated by implementing this architecture in the 32 bits icyflex1 low-power DSP/MCU processor [15] . Nevertheless, though the GRAAL architecture offers many advantages in terms of area, power, and detection efficiency; the latch-based design style is less popular than flip-flop-based design. Hence, improved double-sampling architectures for flip-flop-based design are suitable. Such architectures are presented next.
V. DOUBLE-SAMPLING ARCHITECTURE FOR LARGE FAULTS AND VARIOUS APPLICATION REQUIREMENTS
In the double sampling architectures for flop-flop-based designs presented in the Figs. 1 and 3 , the duration of detectable delay faults is equal to the difference between the falling edge and the rising edge of the clock (i.e., the duration of the highlevel of the clock). Also, for avoiding false error detections, these architectures require that the high level of the clock is shorter than the shortest circuit delay. Hence the duration of detectable faults is necessarily shorter than the delays of the short circuit path. Increasing these delays to allow detecting faults of large duration will induce large area and power penalties.
To overcome this limitation we show that, if we modify the duty cycle of the clock in dedicated manner, the circuit enters a different operating mode in which it is able to detect faults of large duration (like large SETs encountered in space applications), or to perform early failure prediction instead of error detection. More precisely, we find that there are 3 duty cycle zones involving different circuit behaviour:
Duty Cycle Zone 1: The duration Hw of the high level of the clock is shorter than the shortest circuit delay.
This duty cycle zone corresponds to the short-path constraint used in the existing double-sampling schemes of Figs. 1 and 3, for which the circuit operation is well known. To recall this operation before describing two other duty cycle zones, let us consider the new values captured by the regular flip-flops of Fig. 1 at the rising edge of a clock cycle i. These values become the new inputs of the combinational logic, which produces its new output values before the rising edge of clock cycle i + 1. Thus, the regular flip-flops capture these values at the rising edge of clock cycle i + 1. As Hw is shorter than the shortest path delays, these values are still present on the combinational logic outputs at the falling edge of clock cycle i + 1, and are captured by the redundant sampling elements.
Duty Cycle Zone 2: Hw is shorter than the largest circuit delay and larger than the shortest circuit delay.
Let us consider the new values captured by the regular flipflops at the rising edge of any clock cycle i, which become the new inputs of the combinational logic. As Hw is shorter than the largest circuit delay, the combinational logic has no time to produce its new output values before the falling edge of clock cycle i. Also, as Hw is larger than the shortest circuit delay, at the falling edge of clock cycle i + 1 some outputs of the combinational logic can be affected by the values captured by the regular flip-flops at the rising edge of clock cycle i + 1. Thus, there is no clock cycle in which the redundant sampling elements are guaranteed to capture the correct values produced by the combinational logic. Hence, this duty cycle zone cannot be used for implementing double-sampling.
Duty Cycle Zone 3: Hw is larger than the largest circuit delay.
In this zone the circuit enters a new operating mode not considered in the known double-sampling implementations discussed in the previous sections. In fact, even if Hw does not obey the short-path constraint, no false error detections are produced. Indeed, at the rising edge of any clock cycle i new values are captured by the regular flip-flops and become the new inputs of the combinational logic. As Hw is larger than the largest circuit delay, the combinational logic will produce before the falling edge of the clock cycle i its output values corresponding to these inputs. Thus, at the falling edge of clock cycle i, the redundant sampling elements capture these output values (instead of clock cycle i + 1 in duty cycle zone 1). These output values are also captured by the regular flip-flops at the rising edge of clock cycle i + 1. Therefore, comparing the values captured by the redundant sampling elements at the falling edge of clock cycle i against the values captured by the regular flip-flop at the rising edge of clock cycle i + 1 (e.g., by using a comparator whose output is captured by an error detection flip-flop at the falling edge of the clock), will enable detecting faults of duration up to the duration Lw of the low level of the clock (which corresponds to the time difference between the falling edge of clock cycle i and the rising edge of the clock cycle i + 1). As in this mode there are no constraints concerning the duration of Lw, its duration can be adapted at will to detect faults of any duration. This zone is therefore important for covering large SETs in space applications. In these applications we also need high coverage of SEUs. According to the application requirements, the designer can decide to implement any of the above modes. For instance, if the application requires detecting faults of large duration, the circuit can be designed to operate in duty cycle zone 3. However, if a design should be adaptive to various application requirements (detection of faults of moderate duration, detection of faults of large duration, or failure prediction), we can design the circuit in a manner that the application can operate it adaptively in any of the following 3 modes, by controlling the duty cycle of the clock:
Mode 1: Hw is less than the shortest circuit delay.
Here the circuit is operated at its maximal frequency and detects faults of moderate duration (less than the shortest circuit delay).
Mode 2: Hw is larger than the largest circuit delay and Lw is larger than the target (large) fault duration.
Here error detection capabilities are traded against speed: the circuit is operated at a speed lower than permitted by its longest paths, but is able to detect faults of any duration by adapting the duration Lw of the low level of the clock to be larger than the target duration of detectable faults.
Mode 3: Hw is larger than the largest circuit delay and Lw is equal to a target timing-margin used for failure prediction.
Here the circuit is operated at a speed slightly lower than its maximum speed. This mode requires careful design as Lw can be quite small (e.g., less than 10% of clock period).
As these modes are determined by controlling the duty cycle of the clock, no modification is needed in the double-sampling architecture for changing from the one to the other, except for the flip-flops capturing the error-detection signal(s). In fact, the outputs of the XOR gates comparing the contents of the regular and redundant sampling elements have to be compacted by a multi-input OR-tree to create a global error-detection signal (GE). If the number of sampling elements is very large, the ORtree has to be pipelined, as shown in Fig. 6 , where flip-flops capturing partial error detection signals PE have been inserted in the OR-tree. In mode 1, the regular sampling elements capture first the values present on the outputs of the combinational logic at the rising edge of the clock, then, the redundant sampling elements capture the same values at the subsequent falling edge of the clock. Thus, the flip-flops used in the OR-tree will have to capture the values provided by this tree at the subsequent edge of the clock, which is its rising edge. On the other hand, in modes 2 and 3, the redundant sampling elements capture first the values present on the outputs of the combinational logic at the falling edge of the clock, then, the regular sampling elements capture the same values at the subsequent rising edge of the clock. Thus, the flip-flops used in the OR-tree will have to capture the values provided by this tree at the subsequent edge of the clock, which is its falling edge. As a consequence, in mode 1 the flip-flops of the OR-tree must use the rising edge of the clock as latching event, while in modes 2 and 3 they must use its falling edge. Thus, in Fig. 6 , a multiplexer controlled by the Mode signal, provides the required clock polarity on clock signal MCK used in the flip-flops of the OR-tree.
Concerning the pipelined OR-tree, in modes 2 and 3 the delay of the first pipeline stage in Fig. 6 should not exceed the time separating the latching event of the flip-flops of the ORtree from the latching event of the regular sampling elements. This time is equal to the high level of the clock, which in modes 2 and 3 is larger than the largest circuit delay. On the other hand, in mode 1 the delay of the first pipeline stage in Fig. 6 should not exceed the time separating the latching event of the flip-flops of the OR-tree from the latching event of the redundant sampling elements. Thus, it is equal to the duration Lw of the low level of the clock used in mode 1. As in mode 1 this duration is shorter than the largest circuit delay, it will also be shorter than the high level of the clock used in modes 2 and 3. Thus, to accommodate all the above constraints, the first pipeline stage in Fig. 6 will be implemented to have a delay shorter than the shortest duration Lw used in all cases of employment of mode 1.
If the enforcement of this constraint requires implementing the pipeline of Fig. 1 with more than one stages, we have also to determine the maximum delay that can be used in all but the first pipe-line stage. As in each operating mode the flipflops of all stages of this pipe-line use the same edge of the clock as latching event, we just need to ensure that the delay of these pipe-line stages does not exceed the clock period. As in all operating modes the clock period is larger than the largest delay of the circuit, the above constraint can be enforced by implementing these pipe-line stages to have a delay not exceeding the largest circuit delay.
The above discussion concerns the implementation of all three operating modes in the same design, enabling satisfying various application requirements. On the other hand, if a design needs to satisfy only mode 1, or only modes 2 and/or 3, then, the MUX in Fig. 6 is not required, and in implementations targeting mode 1 the flip-flops of the pipe-line in Fig. 6 will use the rising edge of the clock as latching event, while in implementations targeting mode 2 and/or 3 they will use as latching event the falling edge of the clock. Furthermore in implementations targeting mode 1 the first pipeline stage in Fig. 6 will be implemented to have a delay shorter than the duration Lw of the low level of the clock, while in implementations targeting mode 2 and/or 3, this delay has to be shorter than the circuit largest delay Dmax.
VI. PARTIAL ELIMINATION OF SHORT-PATH CONSTRAINTS
A cause of area and power penalties for Figs. 1 and 3 (and mode 1) is the enforcement of the short-paths constraint, which requires adding buffers in the short paths. To reduce this cost we can eliminate the short-paths constraint from certain paths [7] as described next.
The regular flip-flops can be partitioned in 3 groups: The first group includes the flip-flops fed by paths, which all have delays larger than the duration H W of the high level of the clock. This is illustrated in Fig. 7 by the circuit cone with D min > H W . This group verifies the short-paths constraint and does not need adding buffers.
The second group includes each flip-flop, which is fed by paths having delays larger than Hw as well as by paths having delays shorter than H W . This is illustrated if Fig. 7 by the circuit cone with D min < H W and D max > H W . For this group, buffers have to be inserted in the paths having delays shorter than Hw, to enforce the short-paths constraint.
The third group includes the flip-flops fed by paths, which all have delays shorter than H W . This is illustrated in Fig. 7 by the cone D max < H W (the maximum delay of the cone is smaller than H W ). As all delays of the third group are short, this group is not sensitive to delay faults. Thus, double-sampling implementations targeting only delay faults may leave these flip-flops unprotected to avoid the related cost. However, if we also wish to cover SEUs, SETs and clock skews, we need to protect them. Using the double-sampling schemes of Figs. 1 and 3 will require enforcing the short-path constraint. To avoid this cost, we can exploit the duty cycle zone 3, presented in Section V. Indeed, as the third group includes the flip-flops, which are fed by paths having delays shorter than H W , these flip-flops satisfy the conditions of duty cycle zone 3. To also satisfy the requirements concerning the latching event of the error detection flip-flop, we use two separate comparators. The first comparator provides an error indication signal for the first and second flip-flop groups, and the second comparator provides an error indication signal for the third group. Then, a flip-flop captures the output of the second comparator at the falling edge of the clock. The output of this flip-flop is ORed with the output of the first comparator, and the resulting signal is captured by the global error detection flip-flop at the rising edge of the clock, providing the global error detection signal GE. Thanks to this implementation, we can protect the third group of flip-flops without paying the cost for enforcing the short-paths constraint. Note also that, if the one or the other of these comparators check a large number of signals, and has large delays, its OR tree can be pipelined following the implementation described in Section V.
VII. ARCHITECTURE FOR ELIMINATION OF THE
REDUNDANT SAMPLING ELEMENTS Using a redundant sampling element for checking a regular flip-flop induces undesirable area and power penalties. This is particularly true for the power penalty, because sampling elements are power hungry [13] and power constraints are very tight in modern ICs. Thus, reducing the number of redundant sampling elements as proposed in [7] , is highly desirable. To reduce it we observe that:
i. The outputs of the XOR gates comparing the outputs of regular flip-flops against the outputs of redundant sampling elements, feed an OR tree that produces the global error detection signal, which is captured by a flip-flop. ii. As we don't use the individual error signals produced by the XOR gates we can: remove the redundant sampling elements; use an XOR gate to compare the input and the output of each regular flip-flop; use an OR tree to compact the outputs of the XOR gates into a global error detection signal; use a flip-flop FF-E (error detection flip-flop) to capture the global error detection signal. The implementation described in point ii is illustrated in Fig. 8 (please ignore for the moment the two small blocks designed by dashed lines). This figure illustrates the case where the comparator is not pipelined. If we have to check very large numbers of signals, the delay of the comparator may become very large and may need to be pipelined. In the following we analyze the non-pipelined case. However, this analysis is also valid for the pipelined case by considering the flip-flops of the first pipeline stage instead of the flip-flop FF-E. The case of the other pipeline stages is trivial: we just need to ensure that their delay does not exceed the clock period.
A. Operation and Constraints
The architecture of Fig. 8 is not conventional, as First, to avoid setup time violations, the following long-path constraints must be satisfied.
-Data starting from FF1 at the rising edge of a clock cycle i (latching event of FF1), should reach the error detection flip-flop FF-E earlier than a time t su before the rising edge of clock cycle i + 2 (latching event of FF-E). -Data starting from FF2 at the rising edge of clock cycle i + 1 (latching event of FF2), should reach FF-E earlier than a time t su before the rising edge of clock cycle i + 2. These conditions are written as:
with T CK the clock period, D CMP the delay of the comparator, Dmax the maximum delay of the combinational logic, and t su the setup time of FF-E. As D max < T CK (B) implies (A). To avoid hold time violations Fig. 8 must also satisfy the following short-path constraint: Data captured by FF1 at the rising edge of clock cycle i + 1 should not reach the input of FF-E before the end of its hold time in cycle i + 2. This condition is the short-path constraint for Fig. 8 and is written as:
with Dmin the minimum delay in any pipe-line stage, and t hold the hold time of FF-E. Let us consider three clock cycles i, i + 1, and i + 2. The propagation of the data captured by flip-flops FF1 at the rising edge of clock cycle i (instant t ri ), is illustrated in Fig. 9 by green-colored lines. At a time Dmin after t ri , the green data can reach some inputs of flip-flops FF2 through short-paths, but the values of these inputs are not yet stabilized. Then, at instant t ri + D max these values are stabilized. They will remain stable until the instant at which the new values (illustrated in Fig. 9 by red colored lines) captured by flip-flops FF1 at the rising edge of clock cycle i + 1 (instant t ri+1 ) start to influence the inputs of flip-flops FF2. This will happen at a time Dmin after t ri+1 . Thus, the propagation of the green data create stable values on the inputs of flip-flops FF2 in the time interval captured by flip-flops FF2 at instant t ri+1 . These values will remain stable on the outputs of FF2 until the rising edge of clock cycle i + 2 (instant t ri+2 ).
The outcome is that: the green values coming from the propagation of the data captured by flip-flops FF1 are stable on the inputs of flip-flops FF2 during the time interval [t ri + Dmax, t ri+1 + Dmin], and the same values are stable on the outputs of FF2 during the time interval
is within both these intervals. This is due to the relationships in bold, obtained below:
As the maximum delay of the combinational logic is shorter that the clock period (i.e., D max < T CK ), then t ri + D max < t ri+1 , and from the above bold relationship we also have t ri + Dmax < t ri+2 − D CMP − t su .
As t ri+2 − t ri+1 = T CK , (C) gives t ri+1 + Dmin > t ri+2 − D CMP + t hold .
As T CK > D min, we have t ri+2 > t ri+1 + D min, and from the above relationship we also have t ri+2 > t ri+2 − D CMP + t hold .
Therefore, the green values coming from the propagation of the data captured by flip-flops FF1 at the rising edge of clock cycle i, are stable on the inputs and the outputs of flip-flops FF2 (which by the way are the inputs of the comparator), during the time interval
Thus, the comparator compares these equal values and provides the result on the inputs of FF-E after a time D CMP . The interval
of stable values on the inputs of the comparator is translated on the interval of stable values [t ri+2 − t su , t ri+2 + t hold ] on the inputs of FF-E, which satisfies the setup and hold times of FF-E, resulting in valid comparison.
B. Duration of Detectable Faults
The instant t C of the actual comparison of the inputs and outputs of the regular flip-flops is:
where t EI is the instant of the latching event of the error detection flip-flop. At any clock cycle i + 1, the duration δ of detectable delay faults is equal to the time separating the latching event of the regular flip-flops (i.e., the rising edge t ri+1 of the clock cycle i + 1) from the subsequent instant of comparison t C . That is:
As the latching event of the error detection flip-flop FF-E is the rising edge of the clock, then, for the data captured by the FF2 at the clock cycle i + 1, the result of the comparison is captured by FF-E at the rising edge of clock cycle i + 2. Thus, in (1) t EI = t ri+2 , giving the comparison instant t C = t ri+2 − D CMP . Setting this value in (2) gives:
where T CK = t ri+2 − t ri+1 is the clock period.
C. Constraints Enforcement
From the above analysis we have to satisfy (B) (C) and (D):
To satisfy (E) we will always select a value for the detectable faults larger than the setup time t su of the flip-flops. This is not constraining because the setup time has small duration. Thus, practical durations δ of detectable delay faults should anyway be quite larger than t su (C) and (D) imply δ < D min −t hold (F).
We observe that condition (F) is identical to the short-path constraint for the standard double-sampling scheme of Fig. 1 , and the one of Fig. 3 . Thus, if for the target duration δ of detectable delay faults condition (F) is not satisfied, similarly to the standard double-sampling scheme, we will insert buffers in the circuit to increase Dmin to a value D min = δ + t hold + D marg , where the value of D marg is selected to enforce (F) with sufficient timing margins. This enforcement will imply identical area and power penalties as in standard double-sampling scheme of Fig. 1 , and the one of Fig. 3 . If the target duration δ trg of detectable faults is larger than the value of δ given by (D) (i.e., δ trg > T CK − D CMP , we can add a delay D X on the clock input of the error detection flip-flop. Then, the latching event of FF-E occurs at the instant t EI = t ri+2 + D X , and from (1) and (2) we find
Thus, by adding a delay D X = δ trg + D CMP − T CK on the clock terminal of FF-E, the value δ obtained from (D ) becomes equal to the target duration δ trg of detectable faults.
As adding the delay D X modifies the instant of the latching event of FF-E, the conditions (B) and (C) are also modified. We obtain instead:
Condition (B ) gives the long-path constraint for the implementation using a delay D X on the clock terminal of FF-E, and (C ) gives the short-path constraint for this implementation. From (D ) we find D CMP = D X + T CK − t su − (δ − t su ), which guarantees (B ) with a timing margin (δ − t su ). As the setup time of the flip-flops is a relatively small value, in practice, the target duration δ = δ trg of detectable delay faults will be much larger than t su . Thus, (B ) will be guaranteed with comfortable timing margins without imposing any circuit modification. In the very improbable case where a more comfortable margin is required, we may need to slightly increase D X at a value larger than the value D X = δ trg − T CK + D CMP required by the target duration δ trg of detectable faults. This will increase the timing margin in the enforcement the longpath constraint (B ), but also the duration of detectable delay faults at a value δ > δ trg .
It remains enforcing (C ). Combining (C ) and (D ) gives the same constraint as without D X :
As previously, we will enforce (F) by adding buffers to the short paths, in order to increase Dmin to a value larger than δ + t hold : i.e.,
where the value of D marg is selected to create sufficient timing margin in enforcing (F). Combining (G ) and (D ) gives D min +D CMP = D X + T CK + t hold + D marg , which guarantees the short-path constraint (C ) with a timing margin D marg .
Let us now consider the case where the target duration δ trg of detectable faults is smaller than the value of δ given by (D) (i.e., δ trg < T CK − D CMP ). From the fault coverage point of view this is acceptable, as the circuit will detect even faults of duration larger than the target one. However, a large value of δ will require higher cost for enforcing condition (F). To avoid this extra cost, we can add a delay D X on the data input of the error detection flip-flop. Then, the delay of the comparator is increased by D X . Thus, in all relations we have to replace
Thus, by adding a delay D X = T CK − D CMP − δ trg on the data input of FF-E, the value δ obtained from (D ) becomes equal to the target duration δ trg of detectable faults. Also, replacing D CMP by D CMP + D X in (B) and (C) gives
Condition (B ) gives the long-path constraint for the implementation using a delay D X on the data input of FF-E, and (C ) gives the short-path constraint.
, which guarantees (B ) with a timing margin (δ − t su ). As the setup time of the flip-flops is a relatively small value, in practice the target duration δ = δ trg of detectable faults will be much larger than t su . Thus, (B ) will be guaranteed with comfortable timing margins without imposing any circuit modification. In the very improbable case where a more comfortable margin is required, we may need to slightly reduce D X at a value smaller than the value D X = T CK − D CMP − δ trg required by the target duration δ trg of detectable faults. This will increase the timing margin in the enforcement the long-path constraint (B ), but also the duration of detectable faults at a value δ > δ trg .
Then, it remains to enforce (C ). Combining (C ) and (D ) gives
Combining (G ) and (D ) gives D min +D CMP + D X = T CK + t hold + D marg , which guarantees the short-path constraint (C ) with a timing margin D marg .
D. Comparisons and Adaptivity to Aging and Variations
From the above discussion, enforcing the short-path and long-path constraints and also enabling the detection of timing faults up to a target duration δ = δ trg , requires:
i. Adding buffers in the short paths to satisfy the constraint δ < D min −t hold . ii. Adding a delay D X = δ trg + D CMP − T CK on the clock terminal of the error detection flip-flop FF-E, if the target duration δ trg of detectable faults is larger than T CK − D CMP ; or adding a delay D X = T CK − D CMP − δ trg on the data input on the data input of FF-E, if the target duration δ trg of detectable faults is smaller than T CK − D CMP . We observe that the circuit modification described in point i is identical to the enforcement of the short path constraint in the standard double-sampling implementation of Fig. 1 , and of Fig. 3 . Thus, area and power costs for this enforcement are similar. On the other hand, the architecture proposed in Fig. 8 requires insignificant cost (i. e., the cost of one delay element D X as described in point ii). Thus, it removes the redundant sampling elements used in the standard double-sampling scheme by adding just a delay block. Hence, this architecture enables drastic reduction of area and power penalties.
Note that if the comparator is pipelined, one delay element D X has to be added to each flip-flop of the first stage of the pipeline. However, the number of these flip-flops will be very low with respect to the total number of flip-flops of the circuit (e.g., if the first pipeline stage of the comparator comports 10 levels of two-input OR gates, the number of flipflops required to pipeline the comparator will be about 0.1% of the number of checked flip-flops). Thus, the added delay elements will represent a very small amount of the circuit area.
So far, we have found that the architecture of Fig. 8 removes the redundant flip-flops at insignificant cost, resulting in a double-sampling architecture inducing very low area and even lower power penalty. However, it is worth asking if these gains come without any hidden drawbacks. To respond to this question we need to carefully examine relations (B ), (C ), and (D ). This examination reveals a certain weak point, for future technologies that could be highly prone to random variations. With such variations, it is possible that certain circuit delays increase and certain other decrease. Then, it is important to provide the design with self-adaptive capabilities:
i. If due to such variations the delay of some circuit paths exceeds the clock period, inducing frequent detections by the double-sampling scheme, the circuit should increase its clock frequency. This change increases T CK and thus reinforces constraint (B ). However, it weakens and may even violate constraint (C ). The solution to this issue is to implement a programmable delay line on the clock input of FF-E, so that when we increase T CK we can balance it by reducing at the same extend Dx, maintaining constant the sum D X + T CK in relations (B ), (C ), and (D ). ii. If D CMP increases to an extent that (B ) is violated, we can increase D X for establishing it. This action increases the duration of detectable fault (relation (D ), but weakens constraint (C ). However, as D CMP was increased the overall effect on (C ) can be nil. Nevertheless, attention has to be paid as the comparator has multiple inputs and the increase of its delay may affect only a subset of these inputs. Thus, for the remaining inputs (C ) could still be weakened. Then, if the increase of D X is higher than the timing margin D marg , (C ) will be violated for these inputs. Hence, as future technologies will become more sensitive to process variations and aging induced failures, we have to use during the design phase a higher margin D marg , in order to avoid this violation. iii. If D CMP decreases to an extent that violates (C ), we can decrease D X for establishing it. If D CMP decreases only for a subset of its paths, this action will decrease the duration of detectable faults (relation (D ), and will also weaken constraint (B ). However, the use of increasing margin D marg in increasingly sensitive technologies, as required in point ii, will avoid the violation of (C ) and no decrease of D X will be required. iv. If Dmin decreases to an extent that violates (C ), we can decrease D X for establishing it. The effect will be similar to point ii. However, similarly to point iii, thanks to the increased D marg , the violation of (C ) will be avoided.
The above discussion concerns the case of double sampling using the relations (B ), (C ) and (D ). The analysis for the case of double sampling using the relations (B ), (C ) and (D ) is similar and results in the similar means (adapting the clock frequency, controlling the value of D X , and using increasing margin D marg ) for implementing highly adaptive double-sampling.
To enable changing the value of D X , as required in cases i and ii, the value of this delay has to be programmable. Programmable D X is also useful during retry. Indeed, as mentioned in Section II-B, in order to correct timing faults the clock frequency must be reduced during retry. Thus, for enforcing the short-path constraint, a higher value of D X should be supplied. Also, in designs using differential voltage and frequency scaling (DVFS), different voltage-frequency pairs may require using different values of D X for enforcing the long-path and short-path conditions. Thus, using programmable D X is also useful for implementing DVFS.
In comparison, the adaptivity of the standard doublesampling scheme of Fig. 1 is done by modifying the clock frequency when the long-paths constraint is violated, and the high level of the clock when the short-path constraint is violated. Changing the high level of the clock is more complex and costly that changing the delay Dx. Thus, in this regard, the doublesampling scheme of Fig. 8 is most advantageous. However, this scheme also requires using more comfortable margin D marg , as the sensitivity to variations and aging induced failures will be increasing in future technologies. Therefore, the scheme of Fig. 8 enables eliminating the redundant sampling elements, but achieving a high adaptivity to increasing variations and accelerated aging in future technologies requires using a higher margin D marg , which increases the value of Dmin and the related cost. Most often the cost of the redundant sampling elements (in particular power cost) will be much lower that the implementation of a higher value for Dmin. But, in some rare cases this may not be the case. Thus, selecting the architectures of Fig. 1 or of Fig. 8 offer to the designer tradeoffs for minimizing cost.
VIII. CONCLUSION
The double-sampling architectures provide a low cost solution enabling mitigating numerous flaws concerning nanometric technologies, such as: PVT variations, accelerated circuit aging, EMI, soft-errors, power dissipation, and thermal issues. As a consequence, it has encountered keen interest in academic research and industrial R&D teams.
In this paper we describe the basic architectures of doublesampling; their various applications on error detection, failure prediction, self-calibration, reliability, yield improvement, power reduction, and speed increase; their evaluation by industrial teams; and their limitations. Then, we present a compendium of alternative implementations and their detailed analysis, enabling important improvements in terms of area and power cost, and fault coverage efficiency, which offer to the designer a wide space of solutions for meeting its goals at minimum cost. These implementations should gain significant importance in the future, as PVT variations, accelerated aging, and power densities, will increasingly affect deep nanometric technologies.
Note also that, an alternative to double-sampling employs transition detectors checking the stability of flip-flop inputs [16] . This approach has its advantages and drawbacks, and was successfully used in [11] , [13] , and [14] . But, due to space limitations we only discussed the double-sampling scheme.
