Low overhead Soft Error Mitigation techniques for high-performance and aggressive systems by Avirneni, Naga Durga Prasad et al.
Electrical and Computer Engineering 
Conference Papers, Posters and Presentations Electrical and Computer Engineering 
2009 
Low overhead Soft Error Mitigation techniques for high-
performance and aggressive systems 
Naga Durga Prasad Avirneni 
Iowa State University 
Viswanathan Subramanian 
Iowa State University 
Arun K. Somani 
Iowa State University, arun@iastate.edu 
Follow this and additional works at: https://lib.dr.iastate.edu/ece_conf 
 Part of the Computer and Systems Architecture Commons, and the Systems and Communications 
Commons 
Recommended Citation 
Avirneni, Naga Durga Prasad; Subramanian, Viswanathan; and Somani, Arun K., "Low overhead Soft Error 
Mitigation techniques for high-performance and aggressive systems" (2009). Electrical and Computer 
Engineering Conference Papers, Posters and Presentations. 127. 
https://lib.dr.iastate.edu/ece_conf/127 
This Conference Proceeding is brought to you for free and open access by the Electrical and Computer Engineering 
at Iowa State University Digital Repository. It has been accepted for inclusion in Electrical and Computer 
Engineering Conference Papers, Posters and Presentations by an authorized administrator of Iowa State University 
Digital Repository. For more information, please contact digirep@iastate.edu. 
Low overhead Soft Error Mitigation techniques for high-performance and 
aggressive systems 
Abstract 
The threat of soft error induced system failure in high performance computing systems has become more 
prominent, as we adopt ultra-deep submicron process technologies. In this paper, we propose two 
techniques, namely soft error mitigation (SEM) and soft and timing error mitigation (STEM), for protecting 
combinational logic blocks from soft errors. Our first technique (SEM), based on distributed and temporal 
voting of three registers, unloads the soft error detection overhead from the critical path of the systems. 
Our second technique (STEM) adds timing error detection capability to guarantee reliable execution in 
aggressively clocked designs that enhance system performance by operating beyond worst-case clock 
frequency. We also present a specialized low overhead clock generation scheme that ably supports our 
proposed techniques. Timing annotated gate level simulations, using 45 nm libraries, of a pipelined adder-
multiplier and DLX processor show that both our techniques achieve near 100% fault coverage. For DLX 
processor, even under severe fault injection campaigns, SEM achieves an average performance 
improvement of 26.58% over a conventional triple modular redundancy voter based soft error mitigation 
scheme, while STEM outperforms SEM by 27.42%. 
Keywords 
Parameter Variations, Soft Error, Dependable and Adaptive Systems, Overclocking 
Disciplines 
Computer and Systems Architecture | Systems and Communications 
Comments 
This is a manuscript of a proceeding published as Avirneni, Naga Durga Prasad, Viswanathan 
Subramanian, and Arun K. Somani. "Low overhead Soft Error Mitigation techniques for high-performance 
and aggressive systems." In 2009 IEEE/IFIP International Conference on Dependable Systems & Networks 
(2009): 185-194. DOI: 10.1109/DSN.2009.5270340. Posted with permission. 
This conference proceeding is available at Iowa State University Digital Repository: https://lib.dr.iastate.edu/
ece_conf/127 
Low Overhead Soft Error Mitigation Techniques for
High-Performance and Aggressive Systems∗
Naga Durga Prasad Avirneni, Viswanathan Subramanian, and Arun K. Somani
Dependable Computing and Networking Laboratory
Iowa State University, Ames, IA, USA
{avirneni,visu,arun}@iastate.edu
Abstract
The threat of soft error induced system failure in high
performance computing systems has become more promi-
nent, as we adopt ultra-deep submicron process technolo-
gies. In this paper, we propose two techniques, namely Soft
Error Mitigation (SEM) and Soft and Timing Error Miti-
gation (STEM), for protecting combinational logic blocks
from soft errors. Our first technique (SEM), based on dis-
tributed and temporal voting of three registers, unloads the
soft error detection overhead from the critical path of the
systems. Our second technique (STEM) adds timing er-
ror detection capability to guarantee reliable execution in
aggressively clocked designs that enhance system perfor-
mance by operating beyond worst-case clock frequency. We
also present a specialized low overhead clock generation
scheme that ably supports our proposed techniques. Tim-
ing annotated gate level simulations, using 45nm libraries,
of a pipelined adder-multiplier and DLX processor show
that both our techniques achieve near 100% fault cover-
age. For DLX processor, even under severe fault injec-
tion campaigns, SEM achieves an average performance im-
provement of 26.58% over a conventional triple modular
redundancy voter based soft error mitigation scheme, while
STEM outperforms SEM by 27.42%.
Keywords: Parameter Variations, Soft Error, Depend-
able and Adaptive Systems, Overclocking
1. Introduction
Nano-sized transistors, coupled with deployment in haz-
ardous environments, have magnified the reliability con-
cerns plaguing modern computing systems. Rapid enhance-
ments in VLSI technology have fueled the increasing appre-
hension of system hardware being susceptible to myriad of
faults. Many fault tolerance techniques are proposed at dif-
ferent levels of design hierarchy, starting from the design of
∗The research reported in this paper is partially supported by NSF grant
number 0311061, Information Infrastructure Institute (iCUBE) and the
Jerry R. Junkins Endowment at Iowa State University.
hardened latches to system-level fault tolerant architectures
[1, 2, 8, 11]. All these techniques strive to provide high de-
grees of fault coverage by providing redundancy in either
information, spatial or temporal domains.
In the past, single event upsets (SEUs) were a major con-
cern only in space applications, creating hard threats like
loss of control, and resulting in catastrophic failures. An
SEU is induced when a high energy particle, either from
cosmic radiations or decaying radioactive materials, strikes
the silicon substrate. If enough charge is deposited by the
strike, it causes a bit flip in the memory cell, or a transient
pulse in the combinational logic. The latter is referred to
as a Single Event Transient (SET). Radiation induced SETs
have widths in the range of 500ps to 900ps in the 90nm
process, as compared to 400ps to 700ps in the 130nm pro-
cess [13]. As a result, terrestrial applications also require
fault tolerance techniques to ensure their dependability.
In current and future technologies, the problem of soft
errors in combinational circuits is becoming comparable to
that of unprotected memory elements [14]. Providing fault
tolerance capabilities for random and complex logic is ex-
pensive, both in terms of area and power. Techniques such
as, duplication and comparison, and temporal triple modu-
lar redundancy (TMR) and majority voting have been pro-
posed to mitigate soft error rate (SER) in logic circuits [10].
These approaches incur performance overhead, even dur-
ing error-free operation. Also at this juncture, when static
power is comparable to dynamic power, logic replication is
not a viable alternative.
Increasing system wide integration force designers to
adopt worst-case design methodologies, which require
safety margins to be added to individual system components
to address parameter variations that include intra-die and
inter-die process variations, and environmental variations,
which include temperature and voltage variations [4, 5].
These additional guard bands are becoming non-negligible
in nanometer technologies. Designers conservatively add
these safety margins to salvage chips from timing failures
and shortened lifetime. Most systems are characterized to
operate safely within vendor specified operating frequency.
When they are operated beyond this rated frequency, timing
errors that lead to system failure may happen.
Overclocking, as a means to improve performance, is a
popular technique among high-performance enthusiasts [6].
Microprocessor vendors are even introducing overclocking
capabilities in their chipsets; examples being AMD’s Over-
drive and Advanced Clock Calibration techniques. Circuits
exhibit worst-case delay only when their longest delay paths
are exercised by the inputs. However, these worst-case de-
lay inducing inputs and operating conditions are rare, lead-
ing to room for performance improvement that overclock-
ers exploit [3]. The problem is that timing errors occur
at overclocked speeds and may lead to unpredictable sys-
tem behavior and loss of data. Aggressive, but reliable, de-
sign methodologies employ relevant timing error detection
and recovery schemes to prevent erroneous data from be-
ing used [7]. In [16], it has been shown that the operating
frequency can be increased reliably beyond the worst-case
limit; allowing systems to operate at an optimal overclocked
frequency, by adapting to the current set of instructions and
environmental conditions. Moreover, many systems operate
at a overclocked frequency, which is 15-20% higher than
worst-case frequency, without increasing the error rate be-
yond 1% [17].
Safety critical systems with hard real-time constraints
require wide fault coverage with no compromise in per-
formance. An interesting capability in nanometer design
space, we believe, is to provide soft error tolerant reliable
execution for high performance aggressive designs. In this
paper, we propose new ways of designing fault tolerant and
reliably overclocked register cells that enable systems to im-
prove both their performance and dependability.
1.1. Our Contribution
In this paper, we address the issue of soft errors in ran-
dom logic and develop solutions that provide fault tolerance
capabilities without requiring logic duplication. We pro-
pose two techniques that have low area and performance
overhead. Our first technique, SEM, replaces register ele-
ments in a circuit with Soft Error Mitigation (SEM) regis-
ter cells. SEM allows systems to operate without the over-
head of soft error detection circuitry. Our second technique,
STEM, concurrently detects and corrects soft and timing er-
rors using Soft and Timing Error Mitigation (STEM) reg-
ister cells. STEM cells have soft error mitigation capabil-
ities comparable to those of SEM cells, and they also sup-
port reliable overclocking. Both of our techniques employ a
distributed and temporal voting scheme that enables in-situ
error detection and fast recovery. For error detection and
correction, our temporal sampling mechanisms sample data
at three different time intervals. In both SEM and STEM
techniques, we support circuit level speculation. We allow
data to move forward speculatively, and when an error hap-
pens we void the computation and perform re-computation.
Both SEM and STEM cells require three clocks for
proper operation. Clock distribution and routing are sig-
nificant challenges in nanoscale technologies. Clock dis-
tribution network (CDN) consumes a significant portion of
the power, area and metal resources in an integrated circuit.
As a consequence, a specialized clock generation, distribu-
tion and routing scheme that minimizes the clock distribu-
tion overhead incurred by our fault mitigation techniques
is important. Also, to support reliable dynamic overclock-
ing, as discussed in [16], it is important to precisely control
the relative phase shifts of clock signals at high frequen-
cies. Therefore, we focus on developing an efficient local
clock manager (LCM), which helps in generating the re-
quired clock signals, with the desired phase shifts, locally.
The clocks, so generated and distributed, satisfy the timing
constraints required for proper working of our techniques.
We also analyze the area overhead incurred for developing
such LCMs.
For our initial experimental study, we integrated our data
sampling mechanisms into a two stage pipeline consisting
of an adder and a multiplier. Our results show that, with
STEM cells, performance of this system can be increased
by 55.93% over conventional TMR schemes, while provid-
ing near 100% fault coverage. In order to fully understand
the performance improvement and fault coverage that our
schemes can provide to a microprocessor, we experimented
with three micro-benchmark applications on a DLX proces-
sor. For the processor, our results show that SEM technique
achieves an average performance improvement of 26.58%
over the TMR scheme and STEM outperforms SEM by
27.42%, while providing near 100% fault coverage.
The remainder of this paper is organized as follows. In
Section 2, we describe our soft error mitigation technique
and recovery mechanism. Section 3 describes how both
timing error and soft error are concurrently detected and
corrected. In Section 4, we discuss the issues in designing
a pipeline system with our proposed soft/timing error miti-
gation techniques. Section 5 discusses the implementation
and area overheads of implementing a local clock manager.
We present our results in Section 6. Section 7 presents the
related work and Section 8 concludes the paper.
2. Soft Error Mitigation
Prior soft error mitigation techniques at the circuit level
are either based on temporal redundancy, spatial redun-
dancy or a combination of both. These techniques achieve
high degree of fault coverage, whilst degrading or trading
performance, silicon area and other resources. For exam-
ple, in [10], a specific design of a voting mechanism based
on temporal triple modular redundancy is discussed, which
mitigates all single event upsets. However, the overhead
incurred is very high, as the operating frequency of a sys-
tem built with such fault mitigation scheme must include
the delays of combinational logic blocks, phase shifts of the
clocks and the delay incurred by the voter. In this section,
we present a variant of this scheme, and show that with a
combination of local and global recovery, we can remove
the additional overhead imposed, by the fault mitigation
scheme, on the system operating frequency.
The intent of our scheme is to make systems operate at
frequencies same as that of non fault tolerant designs, by un-






























Figure 1: SEM Cell
case timing delay estimation. To keep the overhead of er-
ror detection and recovery off the critical path, we present
the following redundancy organization using our Soft Er-
ror Mitigation (SEM) cells. Figure 1(a) shows a gate-level
embodiment of a SEM cell. It consists of three registers
R1, R2 and R3, clocked by clock signals CLK1, CLK2 and
CLK3, respectively. Data is sampled at three different time
intervals T1, T2 and T3, and are stored in registers R1, R2
and R3, respectively.
TIMING CONSTRAINTS : Figure 1(b) shows the timing
relationship between the clock signals and the data sam-
pling intervals. Clock signals, CLK1, CLK2 and CLK3, have
the same frequency, but they are out-of-phase by an amount
governed by the timing constraints, explained below. Data
is stored in registers at the rising edge of the clock signals,
and strict timing constraints are required for efficient mit-
igation of soft errors. Contamination delay (TCD) is the
minimum amount of time beginning from when the input
to a logic becomes stable and valid to the time that the out-
put of that logic begins to change. Propagation delay (TPD)
refers to the maximum delay of the circuit, under worst-case
conditions. TPW is the soft error/noise pulse width.
Equations (1) and (2) ensure that registers R1, R2 and R3
are not corrupted by the same soft error. Since the system
is running at CLK1 frequency, data is forwarded specula-
tively to subsequent stages after latching in register R1, and
subsequent stages start their computation immediately.
Φ1 = T2 − T1 ≥ TPW (1)
Φ2 = T3 − T2 ≥ TPW (2)
Short paths present in the combinational circuit may cor-
rupt the data before it gets latched in registers R2 and R3.
Consequently, it is required to constrain short paths so that
the same data registered in R1 is also latched in registers
R2 and R3, during error-free operation. Equation (3) en-
sure that this condition is met, by increasing the contami-
nation delay above the desired combined phase shift values,
given by Φ1 and Φ2. Equation (4) makes sure that temporal
sampling happens only after the computation by the combi-
national logic is done. Our technique is capable of detecting
all SEUs happening on registers, and all SETs having pulse
duration less than TPW .
TCD ≥ Φ1 + Φ2 (3)
T ≥ TPD (4)
Table 1: Possible Soft Error Scenarios
CASE R1 R2 R3 ERROR BENIGN RECOVERY
I
√ √ √
0 0 No Recovery
II ×
√ √





1 1 No Recovery
IV
√ √
× 0 1 No Recovery
SOFT ERROR DETECTION AND RECOVERY : Table 1
presents the possible soft error scenarios that a SEM tech-
nique is capable of detecting and recovering from. The ta-
ble also lists the corresponding recovery mechanisms used.
Once the data is latched in registers R1, R2 and R3, they
are compared with each other as shown in Figure 1(a) to
produce ERROR and BENIGN signals. This comparison op-
eration completes the voting process required to detect soft
errors. On error detection, a single cycle system stall is all
that is required for complete recovery. Below, we explain
the different possible scenarios and the recovery mechanism
used when an error happens.
• CASE I : No soft error occurs. Data latched in all three
registers are correct. Both ERROR and BENIGN sig-
nals stay low, and no recovery mechanism is triggered.
System operation continues without any interruption.
• CASE II : A soft error corrupts the data latched in
register R1. ERROR signal goes high after the data
is latched in R2. Since the next stage speculatively
uses the data forwarded from R1, re-computation is re-
quired next cycle to ensure functional correctness. The
data stored in registers R2 and R3 are unaffected by the
soft error. During the next cycle, value stored in R2 or
R3 is loaded back into register R1 with the help of the
control signal LBKUP, completing the local recovery
process. Figure 1 (a) shows R3 being loaded into R1.
Global recovery, in the form of a stall signal sent to all
other SEM cells that are unaffected by the soft error, is
initiated and completed in one cycle.
• CASE III : A soft error corrupts the data latched in
register R2. Both the signals, ERROR and BENIGN, go
high once temporal data sampling is completed. This
is a false positive scenario. No recovery is required
as data forwarded to the next stage is correct. System
operation is not interrupted.
• CASE IV : This represents a case where register R3 is
corrupted with a soft error. In this case ERROR signal
stays low, while BENIGN signal is asserted high. No
recovery and interruption is required in this case too,
as BENIGN signal is high.
3
As can be seen, our scheme does not trigger error recov-
ery for false positive scenarios. Also, since the data latched
in R1 is speculatively used by the succeeding stages, as
soon as it is available, the error detection overhead is not
incurred during normal system operation. This is also a low
overhead solution, as it shuns the need for check pointing
at regular time intervals. Thus, we enable systems to miti-
gate soft errors, using SEM cells, without any loss of per-
formance, compared to a non fault tolerant design.
FAULT TOLERANCE ANALYSIS : The SEM technique
detects and recovers from all possible soft error scenar-
ios involving both SEUs and SETs. This scheme is well
suited for fast transient pulses. Since fast transients typi-
cally correspond to soft errors with high strike rate proba-
bilities, SEM cells have near 100% transient fault mitiga-
tion capability. Our scheme offers protection for pulses of
widths less than the phase shifts provided between the clock
signals. Any noise signal, whose pulse width exceeds this
limit, cannot be detected by our scheme.
3. Soft Error Mitigation in Aggressive Designs
Aggressive designs are based on the philosophy that it
is possible to go beyond worst-case limits to achieve best
performance by not avoiding, but detecting and correcting
a modest number of timing errors. In this section, we fur-
ther investigate the solution presented in previous section
for soft error mitigation, and explain how it can be modi-
fied for soft error mitigation in aggressive designs, which
uses reliable overclocking technique for improving system
performance. With a conventional voter design, to detect
and correct n errors simultaneously, we need to have up to
2n + 1 data samples. In our case, we have n = 2, since we
need to detect and correct both soft and timing errors. For
this analysis, we consider soft errors to be of only type SET.
A traditional fault tolerance technique requires five different
data values for guaranteeing both soft error and timing er-
ror detection and correction. The overhead incurred by this
approach is very high as it increases the number of regis-
ters by four times, and requires five different clocks to sam-
ple data at five different times. Our goal is to develop a
soft and timing error mitigation scheme that incurs minimal
overhead. The proposed Soft and Timing Error Mitigation
(STEM) cell is similar to the SEM cell in area complex-
ity. However, the error detection and recovery mechanism
is significantly different to address the requirements of con-
current soft and timing error mitigation.
ERROR DETECTION : Figure 2 shows a gate-level em-
bodiment of a STEM cell, which acts as an on-line-fault
monitor for soft and timing error mitigation. The working
of a STEM cell is as follows:
Once the data is latched in registers R1 and R2, they are
compared with each other. This comparison operation com-





















Figure 2: STEM Cell
safe [7, 16]. But in the presence of soft errors, this compar-
ison operation presents an ambiguous situation, as it is not
possible to distinguish which one of these two registers is
corrupted by an erroneous value. Also, value in R2 is not to
be trusted during the error recovery process.
If the comparison between R1 and R2 flags a mismatch,
register R3 is shielded from the incoming data value, and its
content is used to recover the system state. This is done be-
cause any soft error that happens after comparing R1 and
R2 has the potential to corrupt R3 and push the system
into an unrecoverable state. Only when there is no mis-
match between registers R1 and R2, register R3 is allowed
to latch the data safely. However, we have not yet ascer-
tained whether R3 is free from soft error. Therefore, we
perform another comparison operation to complete the error
detection process. After register R3 is updated, we compare
it with register R2, to detect any error happening in register
R3. If there is no mismatch, register R3 is trusted for error
recovery purposes. If they mismatch, then that represents a
case where register R3 is corrupted by a soft error. At this
point, it is possible to say that data latched in registers R1
and R2 are uncorrupted. The system is stalled for one cycle
for flushing out the erroneous value from R3, and loading
either R1 or R2 value into R3.
TIMING CONSTRAINTS : As is the case with SEM cells,
STEM cells also require strict timing constraints, to detect
and correct soft and timing errors. STEM cells must sat-
isfy Equations (1), (2) and (3). Equation (4) is modified as
shown in Equation (5) for STEM cells. Equations (1) and
(2) ensure that registers present in a STEM cell are not cor-
rupted by the same SET. Equations (3) and (5) ensure that
data latched in registers R2 and R3 are timing correct, i.e.
free from timing errors. The timing relationships shown in
Figure 1(b) still holds, with the caveat that Φ1 also includes
the extent of overclocking that is possible every cycle.
T + φ1 ≥ TPD (5)
ERROR RECOVERY : Table 2 lists all possible error sce-
narios with corresponding recovery mechanisms. In the ta-
ble, NE represents No Error; SE represents Soft Error and
TE represents Timing Error. In the following discussion,
4
we explain the various possible events that take place in the
STEM cell, and the associated recovery mechanism that is
used in case of an error. It employs either a single cycle or
three cycle fast local recovery based on the values of ERROR
and PANIC signals, shown in Figure 2.
Table 2: Possible Error Scenarios
CASE R1 R2 R3 ERROR PANIC RECOVERY
I NE NE NE 0 0 No Recovery
II SE NE NE 1 0 Load R3 into R1, R2
III NE SE NE 1 0 Load R3 into R1, R2
IV NE NE SE 0 1 Load R2 into R3
V TE NE NE 1 0 Load R3 into R1, R2
VI TE SE NE 1 0 Load R3 into R1, R2
• CASE I : No error case. Both signals, ERROR and
PANIC, stay low. System operation is not interrupted.
• CASE II, III, V, VI : This represents a case where one
of the registers R1 or R2 is corrupted. In this case,
ERROR = 1 and PANIC = 0. In this scenario R3 is not
updated, and the system recovers by loading R3 in to
R1 and R2 triggering re-computation. A three cycle
global recovery process is initiated, which includes:
one cycle stall for loading data back into the registers
R1 and R2, using LBKUP signal, and two cycles for
re-computation. This two cycle re-computation is re-
quired, as the error might have occurred because of
overclocking, and this error will repeat in R1, if suf-
ficient time is not given for re-computation. This pre-
vents recurrent system failures.
• CASE IV : Only R3 is corrupted. In this case,
ERROR = 0 and PANIC = 1. No re-computation is re-
quired. However, it is necessary to flush the erroneous
data from R3, to facilitate error recovery in subsequent
cycles. As data in only R3 is corrupted, “golden” data
present in R2 is loaded in to R3. This requires a single
cycle system stall, during which all STEM cells per-
form a local correction, using LPANIC signal.
FAULT TOLERANCE ANALYSIS : As is seen, the STEM
technique detects and recovers from all possible soft and
timing error scenarios, wherein the soft error is only of type
SET. Also, the case where ERROR = 1 and PANIC = 1
never happens by design.
Our technique leads to silent data corruption, if an SEU
happens in R3. However, since register R3 is only used as
a checkpointing register, a corrupted R3 value may lead to
failure, only if an error occurs in R1 or R2 in the next cycle.
Consequently, the possibility of a system failure because of
a SEU in R3 is heavily mitigated.
For Case VI, we expect that a TE or SE affects several
STEM cells, and the possibility of all cells having a TE in
R1 and SE in R2 is insignificant. Hence, we hope one of the
STEM cells will have the error signal triggered, preventing
R3 of all STEM cells from being loaded. If ERROR = 1,
then we do not look at PANIC signal. The fault coverage is
similar to that of the SEM technique, except that in case of
false positives, we still need to take appropriate corrective
action. In case of the SEM scheme, this value will be over-
written, as R3 is used only for error detection. However, the
STEM technique allows reliable overclocking, achieving
higher performance than those systems incorporated with
SEM cells.
4. Pipeline Design
The basic step in using SEM or STEM cells in a pipeline
is to replace all pipeline registers with either one of them.
Input clocks are to be constrained in a way, so as to provide
fault tolerance capabilities to the pipeline from soft error,
as well as, timing error when STEM is the cell of choice.
In this section, our discussion is based on the use of STEM
cells in place of pipeline registers. Using SEM cells follow
straight forward.
Figure 3 illustrates how STEM cells are integrated into a
processor pipeline. The figure depicts the data and control
flow for a five-stage pipeline processor. To the last stage
of the pipeline, which is writeback (WB), an extra write
buffer, is added. This is to ensure that data written to the
register file or memory is always free from timing errors.
Every pipeline stage register is replaced with STEM cells,
except for the write buffer registers. All error signals from
a pipeline stage are logically OR-ed to generate the stage
error for that pipeline stage. Global error signal,GERROR,
is generated from all pipeline stage error signals, by com-
bining them using another ”OR” function. Similarly, global
LPANIC signal is generated from individual PANIC signal
from all STEM cells. Timing errors may occur once the
operating frequency exceeds the worst-case frequency esti-
mate. As explained in the previous section, our data latch-
ing scheme of STEM cell guarantees sufficient time be-
fore latching values in registers R2 and R3. However, data
latched in all three registers are susceptible to soft errors
that are uniformly distributed in time and space.
Here, we explain the pipeline operation for ERROR = 1
and PANIC = 0 (Case II, III, V, VI), as this is the most
complicated case. Once an error is detected in any one of
the pipeline stages, the global error signal is asserted, and in
every stage of the pipeline, registers R3 of the STEM cells
are not updated with the incoming data. In the next clock
cycle, the load backup signal, LBKUP, is asserted, and in
each STEM cell, the content of register R3 is loaded into
corresponding R1 and R2 registers. After this, the clock to
the pipeline is stalled for two cycles, completing the error
recovery process.
ERROR RECOVERY : In the following discussion, we
present the error recovery scheme in detail for a pipeline
using STEM cells. Various events involved in the recov-
ery process are illustared with the help of a timing diagram.
Figure 4 shows how our global error recovery scheme res-

















































Figure 3: Pipeline Design with STEM Cells
picts the timing relationship between various control sig-
nals. As mentioned, the global recovery takes three clock
cycles and the following description explains the events that







INST 0 INST 1 INST 2 INST 3 INST 2 INST 3
INST 0 INST 1 INST 2 INST 1 INST 2 INST 3
INST 0 INST 1 INST 0 INST 1 INST 2

















Figure 4: Timing Diagrams
Figure 4 shows a set of clock signals, CLK1G, CLK2G
and CLK3G, that are generated from the main clock signal,
CLK, using a LCM. Next, it shows a set of clock signals,
CLK1P , CLK2P and CLK3P , that are routed to the pipeline.
These clock signals, which are gated versions of CLK1G,
CLK2G and CLK3G respectively, are stalled in a manner
that enables the pipeline to recover from different error sce-
narios. Signal ERRORN indicates an error happening in
the pipeline stage N . Error signals from all the pipeline
stages are OR-ed together to generate the global error sig-
nal, GERROR, which is latched in the clock control unit.
Once an error is detected, the very next clock edge of clock
signal CLK3G is gated and in the next cycle, LBKUP sig-
nal is asserted high for one clock cycle. In the same clock
cycle, using CLK1P and CLK2P , recovery data from regis-
ter R3 is loaded back into registers R1 and R2. During the
next cycle, all clock signals, CLK1G, CLK2G and CLK3G,
are clock gated to give the pipeline sufficient time for re-
computation. Clock gating is achieved through control sig-
nals CLKSTALL12 and CLKSTALL3, which are generated
by the clock control unit.
To illustrate our error recovery mechanism, an error oc-
currence is highlighted in Cycle 3. The error occurs during
the execution of INST 1 of pipeline stage N . This event
triggers the error recovery mechanism that spans from Cy-
cle 4 to Cycle 6. During Cycle 4, data is loaded into regis-
ter R1 and R2 from the corresponding stage golden register
R3. Pipeline is allowed to perform the computation dur-
ing Cycles 5 and 6. Results are again checked at the end
of Cycle 6. Since no error is detected in this cycle, normal
pipeline operation resumes. From the waveforms, we can
see that on an error detection, the entire pipeline goes back
by one instruction. Similar recovery actions are performed
for a panic situation and it involves stalling the clock sin-
gals CLK1G and CLK2G for just one cycle. In this case, the
pipeline does not roll back and just the corresponding stage
R3 register is updated.
DYNAMIC FREQUENCY SCALING : In the following
discussion, we derive the limits of frequency scaling within
which a system integrated with STEM cells operates reli-
ably. Pipeline starts execution with a minimal phase shift re-
quired between the clocks, and the clock frequency is grad-
ually increased, while satisfying the error rate constraint. To
support reliable dynamic overclocking, certain governing
conditions need to be met at all times, during pipeline oper-
ation. Let us assume that the pipeline operates reliably be-
tween the clock frequencies, FMIN and FMAX , governed
by time periods, TMAX and TMIN respectively. TMAX is
estimated by the worst-case design settings, and is equal
to worst-case clock period, TWC . The following clocking
constraints decide TMIN . Under overclocking conditions,
the following constraints must be satisfied for proper error
detection and recovery.
Let D1 represent the phase shift that needs to be pro-
vided for CLK2, with respect to CLK1, for soft and timing
error mitigation, when the system is clocked with clock pe-
riod TMIN . Let D2 represent the phase shift that needs
to be provided for CLK3, with respect to CLK1, for proper
error recovery, when the system is clocked with clock pe-
riod TMIN . Value of TMIN , satisfying Equation (6), cor-
responds to the maximum frequency at which a system can























Figure 5: Dynamic Frequency Scaling
TMIN + D1 ≥ TPD (6)
D2 −D1 ≥ TPW (7)
TCD ≥ D2 (8)
CLOCK CONTROL : Clock control monitors the error
rate of the pipeline and communicates this error rate infor-
mation with the clock generator for frequency tuning. Clock
generator is connected to the pipeline in a feedback loop. It
checks the pipeline error rate with a set target rate, which is
programmable. Process of generating new frequency takes
up to 10us, depending upon the speed at which the phase
locked loop (PLL) generates the new stable clock signal.
Depending on the clock control scheme and error rate sam-
pling scheme chosen, clock frequency is adjusted to allow
the pipeline to operate below a set target error rate. Once
the new clock is generated, the main clock signal, CLK, is
switched to that frequency and other clock signals CLK1G,
CLK2G, CLK3G are generated by providing the necessary
phase shifts to CLK.
4.1. Performance Analysis
A key factor that limits frequency scaling is error rate.
As frequency is scaled higher, the number of input com-
binations that result in delays greater than the new clock
period also increases. The impact of error rate on frequency
scaling is analyzed as follows:
Let twc denote the worst-case clock period. Let tov de-
note the clock period after overclocking the circuit. Let n be
the number of cycles needed to recover from an error. Let
us assume that a particular application takes N clock cycles
to execute, under normal conditions. Let tdiff be the time
difference between the original clock period and the new
clock period. Then the total execution time is reduced by
tdiff ×N , if there is no error. Let us assume that the appli-
cation runs at the overclocked frequency of period tov with
an error rate of k%. To achieve any performance improve-
ment at this frequency, Equation (9) must be satisfied. It
states that even after accounting for error recovery penalty,
execution time required is still less than that required for
worst-case frequency operation.





For the STEM technique, an error can happen in five dif-
ferent scenarios, as mentioned in Table 2, and also the error
recovery penalty paid is not the same for all the cases. If we
assume that all these error scenarios are equally likely, then
the average error penalty in cycles is: n = 4×3+1×15 = 2.6.
According to Equation (10), for a frequency increase
of 15%, the error rate must not be higher than 5.76%, for
the STEM technique to yield no performance improvement.
For error rates less than 1%, a frequency increase of 2.6%
is enough for the STEM scheme to have a performance im-
provement over non fault tolerant designs.
OVERHEADS : One of the main overheads incurred by
our schemes is fixing the circuit contamination delay to
a required value. Increasing this delay involves rapid in-
crease in silicon area, as buffers need to be inserted in the
short circuit delay paths. This problem has to be addressed
from different design perspectives that include developing
new synthesis algorithms and delay buffer design with min-
imal area consumption. Both SEM and STEM cells require
metastability mitigation circuits, as flip-flops may enter a
metastable state when overclocked, or when a soft error
reaches the registers during the latching window. We en-
visage the incorporation of a metastability detection circuit,
similar to the one developed in [7].
5 Local Clock Generation
Reliable dynamic overclocking technique has been pro-
posed earlier, in [16], to improve system performance by
tuning the clock frequency beyond the conservative worst-
case clock period. It requires a dynamic phase shift (DPS)
between the clock signals to support aggressive dynamic
clock tuning. At higher frequencies, controlling the phase
shift precisely is a challenge and this often restricts the pos-
sible operating frequency configurations. To avoid dynamic
phase shift between the clock signals, we incorporate a con-
stant phase shift (CPS) between the clocks that are config-
ured to run between frequencies corresponding to the time
periods, TMAX and TMIN .
Let us consider a case where TMAX = 10ns, TMIN =
6ns and TCD = 4ns. Considering a dynamic phase shift
between the clock signals, when we scale the system clock
period down to 8ns, then we need to provide a phase shift
of 2ns. Similarly a phase shift of 3ns is required for a 7ns
clock period. Since the circuit contamination delay is in-
creased to 4ns to aggressively clock the system, computed
data will remain stable for (T + 4)ns, where T is the cur-
rent operating frequency of the system. Instead of requiring
a dynamic phase shift along with frequency scaling, we pro-
vide a constant phase shift of at most TCD at all times.
7
Processor pipelines occupy only a specific portion of
chip area. Local clock managers (LCMs) as shown in Fig-
ure 6 are placed only in the segments of the chip where the
processor pipeline is present. Employing CPS, delay values
D1 and D2 are set to constant values to satisfy the timing
constraint explained earlier in Section 4. This kind of ap-
proach saves the amount of clocking resources required for
SEM and STEM schemes, and also increase the number of
possbile operating frequencies available for a given system.
With CPS scheme, the clock signals required are derived
from a single clock distribution network. Figure 6 shows















Figure 6: Local clock generation with single clock routing
CASE STUDY : Local Clock Generation using buffers.
For generating clock signals required by SEM and STEM
schemes locally, we present a possible implementation us-
ing buffers. We perform this study using 45nm spice mod-
els distributed by Nangate Technologies [12]. Post lay-
out spice models containing parasitic information are used.
Area overhead, incurred for generating constant phase shift
clocks, is analyzed by applying a load of 128 STEM cells.
From this study, we observe that, even for a 2.5ns phase
shift, only 14 clock buffers are needed. This overhead is
much lower than a having a second and third clock tree net-
works. Study results are summarized in Table 3.
Table 3: Number of buffers vs Delay
DELAY(ns) BUFFERS DELAY(ns) BUFFERS
1.0 6 1.5 8
2 11 2.5 14
6. Experiments & Results
In this section, we present our results based on the ex-
periments conducted on a two stage arithmetic pipeline
and a five stage DLX in-order pipeline processor, wherein
pipeline registers are augmented with our fault detection
and correction circuitry.
EXPERIMENTAL METHODOLOGY : To estimate the
performance gains and fault tolerant capabilities offered by
SEM and STEM techniques, simulations are carried out on
a two stage arithmetic pipeline. This circuit performs a 64-
bit addition in the first stage and a 32-bit multiplication in
the second stage. Adder output is fed to the multiplier as
multiplicand and multiplier. RTL level models are devel-
oped for both the circuits, and are synthesized using the
45nm OSU standard cell library [15]. Timing-annotated
gate level simulations are then carried out by extracting tim-
ing information in standard delay format (SDF), and back









-- Fault Inject Location
Inject[1:2N] A ~A
Injecti
Figure 7: Fault Injector Framework
Figure 7 illustrates our fault injection methodology. The
working of our fault injector is as follows: A total of 2N
(N being 7 in our experiments) fault injection test nodes
that are spread uniformly across the area of the logic cir-
cuit are selected. To make sure that our injected fault has
indeed produced a SET, we modified the circuit netlist to
insert XOR gates at all selected nodes, as shown in Fig-
ure 7. If a location i is chosen for fault injection, Injecti
is made high to invert the signal A driven by the fault in-
jection node i. Out of 2N locations, one location is chosen
randomly for fault injection at a time, by using the output
of a N -bit randon number generator. For our experiments,
we used a linear feedback shift register (LFSR) for gener-
ating the N -bit random number. Final fault location is then
selected with the help of a N :2N decoder.
RESULTS FOR ARITHMETIC PIPELINE : For the arith-
metic pipeline, from static timing analysis reports, we es-
timated the value of TMAX to be 9ns. For aggressively
clocking the design, we increased the contamination delay
to 3ns. Area of the circuit is increased by 38%, for fixing
the contamination delay to 3ns. Pulses of varying widths
ranging from 500ps to 900ps are injected in the unit un-
der test (UUT). Each cycle, results are checked for correct-
ness after the computation is over to ensure that the recov-
ery mechanism works. Whenever recovery is triggered, we
logged the occurrence of an error.
For evaluating STEM technique, we performed our ex-
periments for a set error rate target of 1% over 10000 cycles.
During run time, the number of errors that happened during
a sampling interval is communicated to the clock control-
ling unit at the end of each interval. The clock controlling
unit makes a decision based on the error rate, during the
previous sampling interval, and the set target error rate. We
considered a linear control scheme for switching clock fre-
quency between the worst-case clock frequency, FMIN and
the overclocked frequency, FMAX . For our design, TMIN
is set at 7ns. This range is divided into 32 steps, and if the
8
Table 4: Fault Injection Results for Arithmetic Pipeline
STEM(MAXOC) STEM(DYNOC) STEM(NOOC) SEM TMR
TE Transient Faults TE Transient Faults Transient Faults Transient Faults Transient Faults
Injected Detected Injected Detected Injected Detected Injected Detected Injected Detected
RUN1 14 2031 432 14 2033 421 2030 325 2031 334 2026 256
RUN2 14 2031 450 12 2033 414 2025 315 2028 323 2026 268
RUN3 14 2031 449 15 2032 424 2025 307 2030 311 2034 273
error rate is less than 1%, clock frequency is increased by
one step size, otherwise it is decreased. Our fault injection
results for the arithmetic pipeline are presented in Table 4.
We initialized the LFSR with different seeds, and the fault
























Figure 8: Normalized Arithmetic Pipeline Execution time
We configured the arithmetic pipeline designed with
STEM cells to operate in three different modes. They are no
overclocking (NOOC), wherein TMAX = TMIN = 9ns,
maximum overclocking (MAXOC), wherein TMAX =
TMIN = 7ns, and dynamic overclocking (DYNOC),
wherein TMAX = 9ns and TMIN = 7ns. For DYNOC
mode, we started with a low frequency setting. For TMR
system, worst-case frequency, TMAX , is set at 11ns. We
evaluate SEM scheme at a constant clock period of 9ns.
Performance improvements offered by both SEM scheme
and different modes of STEM are shown in Figure 8. From
this, we can see that DYNOC mode offers 49% improve-
ment over TMR, while MAXOC mode offers 55% improve-
ment. Performance of NOOC mode is comparable to that
of SEM and SEM offers 23% performance improvement
over TMR. From Table 4, we can see that fault masking
rate is high in TMR design when compared with SEM and
STEM designs. This is because, its operating frequency in-
cludes the phase shifts of the clocks and voter delay. Hence,
TMR operates with a longer clock period compared to SEM
and STEM, resulting in more SET pulses attenuating before
reaching the latching window.
RESULTS FOR DLX PROCESSOR : We also simulated
three different micro benchmarks to evaluate the perfor-
mance improvement and fault coverage of both SEM and
STEM (DYNOC mode) schemes on a five stage in-order
pipelined processor. This processor, implemented in 45nm
technology, is based on the DLX instruction set architec-
ture. First application, RandGen, calculates a simple ran-
dom number generation to give a number between 0 and
255. The MatrixMult application multiplies two 50x50 inte-
ger matrices and the BubbleSort program implements bub-
ble sort algorithm on 5,000 half-word variables. Here, we
followed the same fault injection strategy and clock control
used for two stage arithmetic pipeline. For each benchmark,
processor state is checked to verify the correctness of the
computed results after simulation. From timing reports, the
worst-case clock period, TMAX , is estimated as 6ns. Con-
tamination delay is increased by 2ns and the system oper-
ates at an optimal clock period of 4ns. Area overhead in-
curred is less than 15% for the processor because significant
area consumption of the system comes from the memory
system. The results for the three different benchmarks are
presented in Figure 9, showing relative execution times for
conventional TMR, SEM and STEM schemes. From this,
we found that SEM offers 26.58% performance improve-























Figure 9: DLX Execution time for various benchmarks
7. Related Work
In the past, many hardware fault tolerance architectures
have been developed by the research community. These
schemes incur performance overhead even during error free
operation and do not support aggressive clocking. LEON-
FT processor [8] uses TMR approach and triplicates every
flip flop in the processor and incurs a 100% area overhead.
Redundant multi-threading based schemes exploit instruc-
tion level parallelism to provide fault tolerance [19]. These
approaches trade performance and power for achieving fault
tolerance capabilities. Systems designed with SEM cells
improve reliability and does not incur any performance loss
during normal operation.
Brute-force overclocking does not guarantee reliable ex-
ecution. TEATIME [18] adjusts the system frequency dy-
9
namically, based on process and environmental variations,
by employing timing error avoidance techniques. System
performance can be enhanced further, by allowing a sys-
tem to operate at a frequency that allows timing errors to
happen. At such overclocked frequencies, relevant tim-
ing error detection and correction schemes can be used to
guarantee functional correctness and to avoid any abnor-
mal execution in the system. Prior work, Razor [7] and
SPRIT3E [16], employ timing error tolerance techniques
to operate beyond worst-case limits. While Razor focuses
on achieving lower energy consumption by reducing supply
voltage in each pipeline stage, SPRIT3E improves perfor-
mance of a superscalar processor by reliably overclocking
the pipeline. Other closely related works are Paceline [9]
and CPipe [17]. Paceline employs leader-checker config-
uration in a chip multiprocessor system and tolerates both
timing and soft errors. CPipe architecture enables reliable
overclocking and enhances system reliability through core
replication and conjoining them. Systems designed with
STEM cells improve the reliability and performance of the
system without logic duplication. Table 5 summarizes how
our schemes differs from previously proposed techniques.
Table 5: Comparing with other schemes in terms of Logic Du-
plication (LD), Soft Error Protection (SEP), Aggressive Clock-
ing (AC) and Energy Savings (ES)
DESIGN LD SEP AC ES


















In this work, we developed two efficient soft error mit-
igation schemes that remove the error detection overhead
from the circuit critical path. One of our schemes, allow
overclocking and is capable of tolerating timing errors as
well. These specialized register cells provide near 100%
fault tolerance against transient faults. Our schemes toler-
ate fast transient noise pulses, which is the principal charac-
teristic of SETs. Both our schemes have no significant per-
formance overhead during error-free operation. SEM cells
are capable of ignoring false positives. One of the salient
features of our approach lies in the capability to trigger re-
covery immediately on error detection, without requiring
any checkpointing. Another key feature is that our scheme
generates clocks locally with constant phase shift values,
increasing the possible frequency settings for aggressively
clocked designs. Also, our local clock generation and dis-
tribution minimizes the clock routing overhead incurred. In
the future, we will implement our fault mitigation schemes
in complex pipelined systems, and evaluate the fault cover-
age and performance for more representative benchmarks.
References
[1] M. Alam. Reliability-and process-variation aware design of
integrated circuits. Microelectronics Reliability, 2008.
[2] L. Anghel and M. Nicolaidis. Cost reduction and evaluation
of a temporary faults detecting technique. In Proceedings
of the conference on Design, automation and test in Europe,
pages 591–598. Springer, 2000.
[3] T. Austin, V. Bertacco, D. Blaauw, and T. Mudge. Oppor-
tunities and challenges for better than worst-case design. In
ASP-DAC, volume 1, pages 2–7, January 2005.
[4] S. Borkar et al. Parameter variations and impact on cir-
cuits and microarchitecture. In DAC ’03: Proceedings of
the 40th conference on Design automation, pages 338–342,
New York, NY, USA, 2003. ACM.
[5] K. Bowman et al. Impact of die-to-die and within-die param-
eter variations on the throughput distribution of multi-core
processors. In Proceedings of the 2007 international sym-
posium on Low power electronics and design, pages 50–55.
ACM New York, NY, USA, 2007.
[6] B. Colwell. The zen of overclocking. IEEE Compututer,
37(3):9–12, March 2004.
[7] D. Ernst et al. Razor: A low-power pipeline based on circuit-
level timing speculation. In IEEE Micro, pages 7–18, 2003.
[8] J. Gaisler. A portable and fault-tolerant microprocessor
based on the sparc v8 architecture. pages 409–415, 2002.
[9] B. Greskamp and J. Torrellas. Paceline: Improving single-
thread performance in nanoscale cmps through core over-
clocking. In PACT, pages 213–224, September 2007.
[10] D. Mavis and P. Eaton. Soft error rate mitigation techniques
for modern microcircuits. Reliability Physics Symposium
Proceedings, 2002. 40th Annual, pages 216–225, 2002.
[11] A. Meixner, M. E. Bauer, and D. Sorin. Argus: Low-cost,
comprehensive error detection in simple cores. In MICRO
’07, pages 210–222. IEEE Computer Society, 2007.
[12] Nangate. http://www.nangate.com.
[13] B. Narasimham et al. Characterization of digital single event
transient pulse-widths in 130-nm and 90-nm cmos technolo-
gies. Nuclear Science, IEEE Trans. on, 54(6):2506–2511,
Dec. 2007.
[14] P. Shivakumar et al. Modeling the effect of technology
trends on the soft error rate of combinational logic. In DSN,
pages 389–398, June 2002.
[15] J. Stine et al. FreePDK: An Open-Source Variation-Aware
Design Kit. In Proc. of the 2007 IEEE Intl Conference on
Microelectronic Systems Education, pages 173–174, 2007.
[16] V. Subramanian, M. Bezdek, N. D. Avirneni, and A. Somani.
Superscalar processor performance enhancement through
reliable dynamic clock frequency tuning. In DSN, pages
196–205, June 2007.
[17] V. Subramanian and A. Somani. Conjoined pipeline: En-
hancing hardware reliability and performance through orga-
nized pipeline redundany. In PRDC, Dec 2008.
[18] A. K. Uht. Uniprocessor performance enhancement through
adaptive clock frequency control. IEEE Transactions on
Computers, 54(2):132–140, February 2005.
[19] T. Vijaykumar, I. Pomeranz, and K. Cheng. Transient-fault
recovery using simultaneous multithreading. pages 87–98,
2002.
10
