Error Estimation and Error Reduction with Input-Vector Profiling for Timing Speculation in Digital Circuits by Wang, Xiaowen
Error Estimation and Error Reduction with Input-Vector Profiling for Timing Speculation 
in Digital Circuits 
 
By 
 
Xiaowen Wang 
 
Dissertation 
Submitted to the Faculty of the 
Graduate School of Vanderbilt University 
in partial fulfillment of the requirements  
for the degree of 
 
DOCTOR OF PHILOSOPHY 
in 
 
Electrical Engineering 
May 10, 2019 
Nashville, Tennessee 
 
Approved: 
William H. Robinson, Ph.D. 
Bharat Bhuva, Ph.D. 
Daniel Loveless, Ph.D. 
Marcus H. Mendenhall, Ph.D. 
Aniruddha Gokhale, Ph.D. 
 
 
 ii 
ACKNOWLEDGEMENTS  
 
 This is a long journey. I am glad that I finally here after all difficulties and obstacles 
during these years. For this important achievement in my life, the first and foremost person 
I would like to thank is my dear advisor, Dr. William H. Robinson, for his continuous 
guidance and support not only in academic field but also in my life. He always encourage 
me to follow my curiosity and been there when I need help. I also want to thank all of my 
committee members, Dr. Bhuva, Dr. Mendenhall, Dr. Loveless, and Dr. Gokhale, for their 
insightful comments. Thank Vanderbilt University and National Science Foundation for 
providing all resources and financial support. 
 Along this journey, I am so grateful to my husband, Zhengyu, for his hearted 
encouragement during my down time. Without his love and fully support, I am not sure if I 
have the strength to make to this point.  
 Lastly, thanks to my parents for always believe in me, you guys are awesome! 
 To my son, Marvin, and my grandparents, I love you! 
  
 iii 
TABLE OF CONTENTS 
Page 
ACKNOWLEDGEMENTS ................................................................................................... ii 
LIST OF FIGURES ................................................................................................................ v 
LIST OF TABLES ................................................................................................................ vi 
CHAPTER 
I. INTRODUCTION ...................................................................................................... 1 
Better-Than-Worst-Case design .............................................................................. 1 
Variations that impact timing .................................................................................. 4 
Motivation ............................................................................................................... 6 
Research contributions ............................................................................................ 7 
 
II.  BACKGROUND AND RELATED WORKS ........................................................... 9 
Timing analysis ....................................................................................................... 9 
Path activation probability analysis ....................................................................... 12 
Timing speculation methodologies and error resilience in BTWC design ........... 15 
Evaluation Methods for BTWC designs ............................................................... 24 
Multi-threshold technology in VLSI designs ........................................................ 27 
Timing speculation vs. Instruction speculation ..................................................... 29 
 
III. RESEARCH METHODOLOGY AND BTWC DESIGN FLOW ........................... 30 
General EDA design flow and the customized design flow in this work ............. 30 
Value change dump files ....................................................................................... 33 
IV.  ALL-CLOCK-FREQUENCY ERROR-ESTIMATION .......................................... 36 
Obtaining outputs settling behavior ...................................................................... 36 
Categories of primary outputs ............................................................................... 39 
Error count estimation and error rate calculation .................................................. 43 
Error estimation results discussion ........................................................................ 47 
Conclusion ............................................................................................................. 53 
 iv 
V.  OFF-LINE ERROR-CHECKING METHOD .......................................................... 55 
General error checking and off-line error checking methodology comparison .... 55 
Reformatting .vcd file for off-line error-checking  ............................................... 58 
Implementing the off-line error-checker ............................................................... 60 
Error estimation vs. error checking results comparison  ....................................... 61 
Conclusion ............................................................................................................. 63 
VI. DUAL-THRESHOLD VOLTAGE APPROACH FOR TIMING ERROR 
REDUCTION ....................................................................................................................... 64 
Dual-threshold voltage approach for re-timing ..................................................... 64 
Identification of critical cells ................................................................................. 65 
Error reduction results comparison and discussion ............................................... 68 
Conclusion ............................................................................................................. 73 
 
VII. SUMMARY AND FUTURE WORKS .................................................................... 74 
APPENDIX .......................................................................................................................... 76 
REFERENCES	 .................................................................................................................... 84	
  
 v 
LIST OF TABLES 
Page 
Table 1: Summary of several EDAC methodologies ........................................................... 24 
Table 2: Overview of the circuits used in the analysis ......................................................... 30 
Table 3: Comparison of C1908 static delay, switching activity rate and active cycles rate.  
Output N2899 has longest delay and Output N2891 is the greatest error-contributor. 41 
Table 4: Benchmark circuit static propagation delay of all error-possible outputs .............. 48 
Table 5: Comparison of error composition from critical path and greatest error-contributing 
PO other than the critical path. ..................................................................................... 52 
Table 6: Total error numbers and the Dynamic Replacement improvement ....................... 72 
Table 7: Low-Vt cell usage comparison between Full Path Replacement (FPR) and Selected 
Cells Replacement (SCR) ............................................................................................. 73 
Table 8: Leakage power (µW) comparison of baseline, Full Path replacement (FPR) and 
Selected Cells Replacement (SCR) .............................................................................. 74 
 
 vi 
LIST OF FIGURES 
Page 
Figure 1: Block diagram of BTWC design general structure ................................................. 2	
Figure 2: An example of critical path delay distributions: (a) Before the variations (b) After 
the variations [7]. ............................................................................................................ 3	
Figure 3: An example to illustrate the relationship of the timing error probability with 
circuit performance (a) Timing error probability versus clock frequency, (b) Circuit 
performance versus clock frequency [8]. ....................................................................... 4	
Figure 4: Dynamic behavior curve of two paths with the same static delay time. [17] ....... 13 	
Figure 5: Block diagram of Razor logic. [22] ...................................................................... 17	
Figure 6: The circuit-level schematic of the shadow latch used in Figure 5. [22] ............... 17	
Figure 7: The pipeline recovery using global clock gating. (a) The pipeline structure. (b) 
The pipeline operation timing. [28] .............................................................................. 18	
Fig. 8: The pipeline recovery using counterflow pipelining. (a) The pipeline structure. (b) 
The pipeline operation timing.[28] ............................................................................... 18	
Figure 9: Circuit-level schematic of Razor II flip-flop. (a) Flip-flop schematic. (b) 
Transition detector schematic. (c) Detection clock generator. [29] ............................. 19	
Figure 10: Different ways to implement Razor flip-flop to detect timing errors. [32] ........ 20	
Figure 11: Schematic of TIMBER flip-flop. (a) Main flip-flop part. (b) Clock signal control 
and generating part.[33] ............................................................................................... 21	
Figure 12: TIMBER latch schematic. (a) Main latch part. (b) Clock signal control and 
generating part.[33] ...................................................................................................... 22	
Figure 13: The customized EDA flow with Synopsys tools. ............................................... 31	
Figure 14: The proposed design flow chart. ......................................................................... 33	
Figure 15: An example of value change dump file. (a) is the header part, and (b) is the body 
part. ............................................................................................................................... 34	
Figure 16: Algorithm flow chart of switching time stamp extraction for a specific node. .. 38	
Figure 17: The total active cycles out of 1 million cycles of error-possible outputs. 
Benchmark circuits (a) C432, (b) C880, (c) C1908 and (d) C6288. ............................ 40	
 vii 
Figure 18: Benchmark C1908 outputs with the average active cycles out of 10,000 cycles 
(using 100 simulation trials), and the total error counts for 1 million cycles at clock 
period of 1.7 ns. ............................................................................................................ 42	
Figure 19: Benchmark C1908 settling time histogram of each error-possible PO. The x-axis 
is the settling time in picoseconds, and the y-axis is the accumulated number within 
each bin. ........................................................................................................................ 44	
Figure 20: Benchmark C1908 stablization probability of each error-possble PO. .............. 45	
Figure 21: Benchmark C1908 settling time histogram and stabilization probability density 
function of outputs N2891, N2911, N2892. ................................................................. 46	
Figure 22: Outputs settling time histogram with 1 million random input vectors of four 
benchmark circuits C432, C880, C1908 and C6288. ................................................... 49	
Figure 23: Zoomed settling time histogram between 70% of orignal clock to the error-free 
clock of four benchmark circuits .................................................................................. 50	
Figure 24: Estimated error count of each error possible outputs of (a) C432, (b) C880, (c) 
C1908, and (d) C6288, from 70% of original clock period to the error-free clock 
period. ........................................................................................................................... 51	
Figure 25: The stabilization probability of four tested benchmark circuits for the given 
workload from start to end of the clock period. ........................................................... 54	
Figure 26: The general structure of (a) transition detection method .................................... 56	
Figure 27: The general structure of (b) duplication module/path method ........................... 56	
Figure 28: The general structure of (c) proposed off-line error checking method. .............. 57	
Figure 29: Algorithm flow chart of data preparation script. ................................................ 59	
Figure 30: The example of partial activity file extracted from .vcd file. ............................. 60	
Figure 31: The flow chart of the Extract and Compare script’s algorithm. ......................... 61	
Figure 32: The comparison between simulated results and total error estimation trends of 
four tested benchmark circuits. .................................................................................... 63	
Figure 33: A partial circuitery to differentiate three types of critical cells that are going to 
be replaced in this work. ............................................................................................... 67	
Figure 34: Benchmark C1908 critical path’s cells activity .................................................. 68	
Figure 35: Error counts of each error-possible PO before and after error reduction method 
Full Path Replacement and Selected Cell Replacement. The operating clock period of 
C432 is 1.7 ns (70% of 2.41 ns), C880 is 1.4 ns (70% of 2.01 ns), C1908 is 1.5 ns 
(68% of 2.2 ns), and C6288 is 3.4 ns (70% of 4.82 ns). .............................................. 70	
 viii 
Figure 36: Total error reduction improvement from Full Path Replacement to Selected 
Cells Replacement, when operating at 70% of original clock period. ......................... 72	
Figure 37: Error Free speed up comparison of Full Path Replacement method and Selected 
Cells Replacement method. .......................................................................................... 73
 1 
CHAPTER I 
INTRODUCTION  
 With nanometer fabrication technologies, system-on-a-chip (SoC) design enables 
the potential of billions of transistors to implement a wide range of functionality. But 
along with the size scaling, transistors are becoming more sensitive to environmental 
conditions [1], within die variations [2][3], and even input workload variations. Designs 
reliability has become a greater concern for integrated circuits (ICs) [4][5]. The design 
corners are analyzed to determine the worst-case delay possibility. Based on these design 
corners, designers include additional timing margins as a guard band on critical paths. 
However, using the worst-case design plus guard bands can be very pessimistic, which 
translates into a loss of performance while executing real applications [6].  
 
Better-Than-Worst-Case Design 
To avoid performance loss because of infrequently-occurring, worst-case 
scenarios, Better-Than-Worst-Case (BTWC) design was introduced to bridge the gap. It 
is a design style that was first introduced by Bob Colwell, architect of the Intel Pentium 
Pro and Pentium IV processor. A traditional design methodology sacrifices performance 
to contain the extreme cases, so as to ensure an error-free design. The essential idea of 
BTWC design emphasizes operating on average-cases. BTWC design improves 
performance by allowing certain timing errors to occur during the normal operation, 
while preserving the correct operation by adding error detection and correction to the 
 2 
design. By controlling the probability of the timing errors to a desired level, a trade-off 
can be made that results in an overall net performance gain with the error correction 
penalty.  
Figure 1 illustrates a general approach for BTWC design that includes a core 
computational component coupled with a checker mechanism that validates the semantics 
of the core’s operations [8]. The additional circuitry of a BTWC design will consume 
extra power and take time to correct any errors. Thus, several features the 
Checker/Corrector must be considered: (1) small area, (2) power efficiency, and (3) fast 
correction capability. 
 
 
 
Figure 1: Block diagram of BTWC design general structure 
 
Figure 2(a) is an example of a critical path delay distribution of a circuit under a 
certain workload. Figure 2(b) shows the change of the delay distribution of the same path 
after considering all types of variation. Instead of using clock l’ as operating clock 
frequency, BTWC design will select clock l to operate at an ultra-high speed, as shown in 
7"
 
 
1 INTRODUCTION 
 
Better Than Worst-Case (BTWC) design is a design style that was first introduced 
by Bob Colwell, architect of the Intel Pentium Pro and Pentium IV processor [1]. The 
essential idea of BTWC design is aimed to improve design performance (e.g., speed or 
power) by breaking the traditional circuit design boundary to allow certain timing errors 
to occur during normal operati n, and then preserve the correct op rati n by add ng error 
detection and correction (EDAC) features to the design.  
With the scaling of fabrication technology, the influence from variations (e.g., 
process variation, environmental variation, circuit behavior variation) is amplified 
[2][3][4]. Timing uncertainty, caused by eit r static or dynamic variation effects, is one 
of the most critical design problems to be addressed in the nanoscale era [5][6]. 
Traditional design practice will increase the design’s guard-band to cover the worst-case 
condition, or optimize a design as under the worst-case operating condition, even when 
those cases a  rare. Wi hout a  error detection and correcti n mechanism, the traditional 
design method could not allow any timing errors to occur. Therefore, the guaranteed 
error-free design creates a very pessimistic design, where performance is sacrificed to 
ensure correct operation for the entire range of circuit variation.  
""
Figure 1: Block diagram of general BTWC design!
Performance/
Power Optimized 
Core Component
Checker/
Corrector
Input Output
Well-defined
operations
Detects and corrects
operations faults
 3 
Figure 2(b). The gray area represents the error probability. Traditional designs will 
require a slower clock frequency (i.e., a longer clock period) to avoid errors caused by the 
probability within the gray area. The advantage of BTWC design is the capability to 
derive additional performance based upon typical operation cases, and use the error 
detection and correction (EDAC) circuitry to handle the errors that occur occasionally. 
However, it will always need a checker/corrector module as long as it contains the 
possibility of errors. The extra circuitry of the EDAC module will consume extra power 
and take time to correct any errors. Thus, making the EDAC module small and efficient 
is important, but selecting a well-balanced operating clock frequency to keep the errors at 
an ideal level is also the key to maximize performance gain.  
 
 
 
Figure 2: An example of critical path delay distributions: (a) Before the variations and  
(b) After the variations [7]. 
 
l’
l’l
l
 4 
The operating clock frequency can be categorized into three regions, as illustrated 
in Figure 3. Region One is the error-free zone, where the clock frequency is usually 
selected as shown as point a. Region Two is where applying timing speculation has a 
positive performance gain. Timing errors begin to appear beyond point b. Point c is the 
optimized clock frequency to maximize performance. Region Three is where 
performance gain becomes negative when applying timing speculation. BTWC designs 
work within Region Two, therefore identifying the point c is the ultimate goal of BTWC 
design. 
 
Figure 3: An example to illustrate the relationship of the timing error probability with 
circuit performance (a) Timing error probability versus clock frequency, (b) Circuit 
performance versus clock frequency [8]. 
 
 
Variations That Impact Timing 
The guard-band is the traditional design approach that tries to contain the timing 
uncertainties with extra design space. Those timing uncertainties are mostly caused by all 
kinds of variations, which influences on circuit timing. They can be categorized into 
static and dynamic sources. 
2 Timing Speculation (TS)
As we increase a processor’s clock frequency beyond its Rated
Frequency fr , we begin to consume the guardband that was set up
for process variation, aging, and extreme temperature and voltage
conditions. As long as the processor is not at its environmental
limits, it can be expected to operate fault-free under this over-
clocking. However, as frequency increases further, we eventually
reach a Limit Frequency f0, beyond which faults begin to occur.
The act of overclocking the processor past f0 and tolerating the
resulting errors is Timing Speculation (TS).
TS provides a performance improvement when the speedup
from the increased clock frequency subsumes the overhead of re-
covering from the timing faults. To see how, consider the perfor-
mance perf(f) of the processor clocked at frequency f, in instruc-
tions per second:
perf(f) =
f
CPInorc(f) + CPIrc(f)
=
=
f
CPInorc(f)× (1 + PE(f)× rp) =
=
f × IPCnorc(f)
1 + PE(f)× rp (1)
where, for the average instruction, CPInorc(f) are the cycles
taken without consideri g any re overy time, and CPIrc(f) are
cycles lost to recovery from timing errors. In addition, PE is the
probability of error (or error rate), measured in errors per non-
recovery cycle. Finally, rp is the recovery penalty per error, mea-
sured in cycles.
Figure 1 illustrates the tradeoff. The plots show three regions.
In Region 1, f < f0, so PE is zero and perf increases consis-
tently, impeded only by the application’s increasing memory CPI.
In Region 2, errors begin to manifest, but perf continues to in-
crease because the recovery penalty is small enough compared to
the frequency gains. Finally, in Region 3, recovery overhead be-
comes the limiting factor, and perf falls off abruptly as f increases.
f0 f0fr fr
1 2 3 1 2 3
P E
(f)
pe
rf(
f)
(a) (b)
a b
c
a
b
c
f f
Figure 1: Error rate (a) and performance (b) versus fre-
quency under TS.
Conventional pr cessors work at point a in the ﬁgures, or at
best at b. TS processors can work at c, therefore delivering higher
single-thread performance.
2.1 Overview of TS Microarchitectures
A TS microarchitecture must maintain a high IPC at high fre-
quencies with as small a recovery penalty as possible— all within
the conﬁnes of power and area constraints. Unsurprisingly, differ-
ing design goals give rise to a diversity of TS microarchitectures.
In the following, we group existing proposals into two broad cat-
egories.
2.1.1 Stage-Level TS Microarchitectures
Razor [5], TIMERRTOL [24], CTV [14], and X-Pipe [25] de-
tect faults at pipeline-stage boundaries by comparing the values
latched from speculatively-clocked logic to known good values
generated by a checker. This checker logic can be an entire copy
of the circuit that is safely clocked [14, 24]. A more efﬁcient op-
tion, proposed in Razor [5], is to use a single copy of the logic
to do both speculation and checking. This approach works by
wave-pipelining the logic [4] and latching the output values of
the pipeline stage twice: once in the normal pipeline latch, and a
fraction of a cycle later in a shadow latch. The shadow latch is
guaranteed to receive the correct value. At the end of each cycle,
the shadow and normal latch values are compared. If they agree,
no action is taken. Otherwise, the values in the shadow latches are
used to repair the pipeline state.
Another stage-level scheme, Circuit Level Speculation
(CLS) [9], accelerates critical blocks (rename, adder, and issue)
by including a custom-designed speculative “approximation” ver-
sion of each. For each approximation block, CLS also includes
two fully correct checker instances clocked at half speed. Com-
parison occurs on the cycle after the approximation block gener-
ates its result, and recovery may involve re-issuing errant instruc-
tions.
2.1.2 Leader-Checker TS Microarchitectures
In CMPs, two cores can be paired in a leader-checker organi-
zation, with both running the same (or very similar) code, as in
Slipstream [20], Paceline [6], Optimistic Tandem [11], and Re-
union [18]. The leader runs speculatively and can relax functional
correctness. The checker executes correctly and may be sped up
by hints from the leader as it checks the leader’s work.
Paceline [6] was designed speciﬁcally for TS. The leader is
clocked at a frequency higher than the Limit Frequency f0, while
the checker is clocked at the Rated Frequency fr . Paceline allows
adjacent cores in the CMP to operate either as a pair (a leader
with TS and a safe checker), or separately at fr . In paired mode,
the leader sends branch results to the checker and prefetches data
into a shared L2, allowing the checker to keep up. The two cores
periodically exchange checkpoints of architectural state. If they
disagree, the checker copies its register state to the leader. Be-
cause the two cores are loosely coupled, they can be disconnected
and used independently in workloads that demand throughput in-
stead of response time.
One type of leader-checker microarchitecture sacriﬁces this
conﬁgurability in pursuit of higher frequency by making the
leader core functionally incorrect by design. Optimistic Tan-
dem [11] achieves this by pruning infrequently-used functionality
from the leader. DIVA [1] can also be used in this manner by us-
ing a functionally incorrect main pipeline. This approach requires
the checker to be dedicated and always on.
3 Taxonomy of Design for TS
To understand the design space, we propose a taxonomy of de-
sign for TS from an architectural perspective. It consists of a clas-
siﬁcation of TS microarchitectures and of general approaches to
enhance TS, and how they relate.
214
 5 
 
Static variation sources: 
Static variation does not change with time and depends on physical factors, such 
as internal connections and device dimensions. The geometry of the circuit’s layout and 
structure determines the operational parameters. Physical parameter variations (e.g., 
critical dimension, oxide thickness, channel doping, wire width, wire thickness) lead to 
electrical parameter variation (e.g., saturation current, gate capacitance, threshold 
voltage, wire resistance, wire capacitance), and electrical variations result in delay 
variation. The inability to control precisely the fabrication parameters during 
manufacturing is called process variation [9]. The partially correlated process variations 
make the problem complicated to solve. A model of process variation and a model of 
timing errors for a processor’s microarchitecture was described in [7]; it predicts the 
failure rate of micro-architectural blocks as a function of clock frequency and the amount 
of variation. A novel approach was proposed in [10] that isolated the failing path to avoid 
timing errors caused by process variation for fabricated chips. 
Dynamic variation sources: 
Dynamic variations, on the contrary, are time-related and depend upon the 
operating conditions, like the fluctuation of: (i) Vcc droops, (ii) temperature, (iii) 
transistor drain current, (iv) cross-coupling capacitance, and (v) multiple inputs switching 
(MIS) in logic gates [11]. Vcc droops are induced from the internal switching activity, and 
lead to current transients. Dynamic voltage (IR) drop under real switching activity was 
analyzed in [12]. Temperature depends on input workload, environmental conditions, and 
heat-control methods. An adaptive system was discussed in [1] that accurately estimates 
 6 
the temperature-induced delay variation to avoid an overly conservative design. 
Transistor drain current aging is related to the gate bias and temperature. Cross-coupling 
capacitance change due to the adjacent wires switches will cause RC delay on the wire. 
MIS is related to input workload that affects the circuits’ internal activity and the settling 
time of the outputs. Paths with the same static propagation delay could have dramatically 
different distributions of their settling time because of the input workload variation. 
There are researches like [13] and [14] who have studied the circuit behavior curve under 
input workload.  
 
Motivation 
If the worst static propagation delay of an output is longer than the operating 
clock period, then there is a probability to observe errors at this output. But a timing error 
occurs only when the output settling time extends later than the specified clock period. 
However, when increasing the operating clock frequency, the error probability increases 
are not linear. Estimating error probability just based on the circuit’s static delay 
information may lead to a severe misunderstanding of the circuit’s behavior. The error 
probability is highly related with the output settling time and the operating clock period. 
As introduced earlier, there are many factors, including input workload variation, that 
could influence an output settling time. Most of influences are subtle. It is the input 
vectors that determine the path usage. Together with circuit’s previous status, the basic 
shape of the circuit dynamic activity curve is formed.   
BTWC design is all about finding the optimal operation clock frequency. Timing 
speculation is associated with errors, but a well-balanced operating clock frequency will 
 7 
contain the error rate at a desired level, so as to realize performance improvement. An 
accurate error estimation method based on the circuit’s dynamic behavior gives BTWC 
designers insight on how the error trends occur with the increase of clock frequency.  
The error rate from each output may vary with different input workload. When a 
circuit operates at an ultra-high clock frequency, an understanding of the dynamic 
activity for each circuit path will help to identify the most error-prone output, thereby 
making the error reduction more efficient. BTWC design requires error estimation for all 
potential clock frequencies for the given input workload. Therefore, a design flow for 
BTWC design has been developed to extend the capability of commercial electronic 
design automation (EDA) tools for dynamic path activity and output settling analysis. 
The design flow utilizes customized scripts that process standard output files from the 
commercial EDA tools. 
 
Research Contributions 
This research focused on performance improvement in digital integrated circuits 
(ICs) by considering path activity behavior under a given input workload. Timing 
analysis and timing closure are critical steps in digital circuit design. Many factors affect 
the delay distribution of the outputs, but the input workload determines the basic shape of 
the distribution. The longest static delay path(s) may not be very active for a certain input 
workload, and therefore would not frequently generate timing errors. Obtaining the actual 
delay distribution of the outputs for given workload could help designers to estimate the 
error rate for each output so as to select the well-balanced operating clock frequency, 
which is the fundamental challenge for BTWC design.  
 8 
This work contributes the following: (1) The Off-line error checking method that 
enables detailed statistical analysis on selected cells activities. Design and verifications 
does not require test bench that compares models operating in system work; (2) The All-
clock-frequency error estimation method that predicts error rate for all-possible operation 
frequency of each PO. Characterizes each PO settling curve under given input workload. 
Identifies the typical-case error contributor according to circuit internal activity. (3) Dual-
Vt method for effective error reduction on selected cells. The selection focuses on the 
fan-in cone of identified error contributors by using weights determined with circuit 
activity level. [15].  
The rest of this dissertation is organized as follows. Chapter II describes related 
methodologies and related backgrounds, including: (1) static timing analysis (STA) and 
statistical static timing analysis methods (SSTA), (2) path stabilization probability 
analysis for given input workload, and (3) general EDA design flow. Chapter III provides 
information about the research methodology, simulation setup environment, and the 
overall design flow used in this work. Chapter IV describes the error-estimation method 
and the error-checking method used in this research, which aided in identifying the 
greatest error-contributing primary outputs (POs) and the critical standard library cells 
that contribute to the propagation delay for the given input workload. Chapter V 
describes the error reduction method of using multi-threshold standard cells and analyzes 
the error-reduction results. Chapter VI summarizes the work and offers some future 
extensions of this work. The customized Python scripts for error checking and error 
estimation are provided in the Appendix. 
 9 
CHAPTER II 
BACKGROUND AND RELATED WORK 
 Device reliability is an important concern for the operation of integrated circuits 
(ICs). Design margins are incorporated into the final design to ensure an error-free IC. 
BTWC design can be used with timing speculation to reclaim the lost performance when 
incorporating the design margins associated with corner cases. Timing analysis and path 
activation probability analysis are necessary to prepare for BTWC design. Understanding 
the circuit behavior under typical cases is the key to success. The timing error rate of a 
BTWC design needs to be controlled at an optimized level in order to have an 
improvement in performance while avoiding the negation of the performance gain 
because of the error correction penalty. 
 This chapter provides background on techniques for timing analysis. It also 
discusses the analysis of path activation probability, which is used to determine how the 
input workload can affect the visible errors at the primary outputs. The chapter also 
reviews techniques for timing speculation with error detection and correction (EDAC) 
circuitry. 
 
Timing Analysis 
 The original Static-Timing Analysis (STA) was brought into very-large-scale-
integration (VLSI) chip design in the early 1990s, and it has been one of the successful 
and matured tools in digital circuits design [16]. This timing analysis tool could be 
 10 
widely accepted because the runtime is linear to the circuit size, and the results are 
relatively safe for traditional digital circuit design. The original STA tools are 
deterministic and calculate circuit delay with one specified corner case to represent the 
design boundary. However, the transistor size scaling amplified the impact from process 
variation. The deterministic attribute of traditional STA causes inaccuracies for digital 
circuits.  
The fundamental weakness of traditional STA is that there is no statistically 
rigorous method for modeling multiple corner files. Therefore, Statistical-Static-Timing-
Analysis (SSTA) emerged to improve the timing analysis method. Rather than giving a 
single result, SSTA evaluates the timing of gates and interconnects with probability 
distributions. Over the past decade, there are hundreds of papers published in this field, 
and D. Blaauw et al. [16] discussed the evolution of STA to SSTA. The basic goal of 
traditional STA is to find the delay of the longest path in the circuit, and the SSTA aims 
to find the latest arrival-time distribution of the output. The SSTA can be generally 
categorize into three approaches: 
1. Numerical-Integration Method:  The delay distribution of a set of paths that 
approach the maximum delay can be expressed as a function of physical 
parameters in order to select a certain region under a specific circuit delay. 
That region is then integrated numerically to explore the possible physical 
parameter space, and then compute the circuit’s statistical results of the 
timing. This approach is generic and can include different types of models to 
account for process variation, but the significant amount of computation time 
 11 
is a problem, especially for a well-balanced circuit with a large set of paths 
that approach the maximum delay. 
2. Monte Carlo simulation method:  The basic idea behind this approach is to 
perform sufficient independent sampling for the circuit delay using traditional 
STA using the probability distribution function (PDF) of the physical 
parameters. The circuit delay distribution can be found by sweeping the 
timing constraint. Like numerical integration, this method is completely 
general. Because the traditional STA methods are mature, Monte Carlo 
simulation is faster than the numerical integration method. However, due to 
the inner loop calculation inside of the simulation for STA, the run times are 
still significant. The second weakness of Monte Carlo simulation method is 
the difficulty to perform incremental analysis. If any change is made to the 
circuit, then the whole simulation procedure needs to be restarted to obtain an 
updated circuit delay distribution. 
3. Probabilistic analysis method:  Unlike the previous two methods that are 
based on the sample space, this method models the gate delay and the arrival 
time of signals with random variables. There are two main approaches to 
implement probabilistic analysis.  
a. Path-based approaches: This approach selects a set of most-likely 
critical paths, and then adds the gates and interconnection delays of 
each path within the set to approximate the circuit delay distribution. 
The paths selection must be done before the statistical analysis, so the 
accuracy of the approximation depends on the selection of likely high-
 12 
delay paths. This approach, therefore, has two split steps: First, find 
the paths, and second, calculate the path delay. The difficult task is 
how to find the best set of paths.   
b. Block-based approaches: This approach is based on the traditional 
STA algorithm, and deals with the circuit graph in a topological 
manner. To compute the arrival time for each node, the edge delay is 
added with the source node arrival time for each fan-in edge, and then 
the latest arrival time is selected as the final result for each node. The 
block-based approach has a runtime advantage, especially because 
incremental analysis is allowed by using this method.   
 Although SSTA makes the analysis more comprehensive, there are challenges and 
limitations associated with it. It is too complex when dealing with realistic delay 
distributions. It is also very difficult to apply within an optimized algorithm or flow. 
Both of traditional STA and SSTA method are designed to avoid the impact from 
input vectors. It is good for traditional design because including rare cases when testing 
design limitations makes the design more reliable. However, this advantage becomes an 
obstacle when using a BTWC design style that optimizes a design according to circuit 
behavior for typical input workloads.  
 
Path Activation Probability Analysis 
 Because the rare cases are not emphasized for BTWC design, information from 
regular timing analysis tools are not enough for BTWC design. Two Primary Outputs 
(POs) with the same static path delay, according to timing analysis tool, could have 
 13 
dramatically different distributions of path stabilization (i.e., settling time) in a real 
application [17].  
 
 
Figure 4: Dynamic behavior curve of two paths with the same static delay time. [17] 
 
Figure 4 shows an example that the path delay time does not equal to the path 
settling time. The labels A and B represent two outputs that have the same worst-case 
propagation delay. However Output A and Output B have distinct probability curves for 
their settling time. With the same input workload, Output A has a 99% probability that it 
will settle by time t, while Output B has a 53% probability to be settled at time t. This 
means Output B is more dynamically critical and should be weighted higher when 
analyzed for errors during the BTWC design process. By enhancing the speed along the 
path to only Output B, assuming the circuit only has two paths, then most errors will be 
reduced, and the circuit could operate at cycle time t with very little penalty for error 
correction. 
 The circuit’s dynamic stabilization curve will be affected by many types of 
variations, but the curve’s basic shape is decided by the input workload. A rigorous 
analysis of path activation probability, which describes the typical dynamic behavior of 
14"
application [14]. Figure 3 shows an example of dynamic behavior, where A and B are 
two logic paths of a circuit with the same static delay time. The axis, Time, is the 
operating clock period, and the axis Ps is the probability that the PO can produce a 
correct, stable result in a given time. In this example, two paths with the same static delay 
behave differently wh  ob erved from a real application. Path B has much higher 
probability to generate errors than Path A, so Path B is more dynamically critical. 
Knowing the dynamic behavior of a circuit is important for BTWC design so as to 
improve the performance of a design for the common case [18][19].  
"
Figure 3: Dynamic behavior cu ve of two paths with e same sta ic delay time. [14] 
To further enhance the performance and energy efficiency of BTWC designs, a 
rigorous analysis of path activation probability across a typical workload of the design 
would provide insight that helps the designer to maximize the performance gain. A study 
by Intel [12] presented several circuit techniques for tolerance of dynamic variation, and 
the authors also explored timing optimization based on the path delay histogram and the 
path activation probability. 
Wan and Chen [14] also proposed a method to analyze circuit-level dynamic 
behavior with a new data structure, called timed ternary decision diagrams (tTDDs). The 
tTDD is created based on the TDD, which is similar to binary decision diagrams (BDDs) 
[20] with the difference that each node has three possible outgoing branches, as discussed 
by Sasao [21]. The tTDD has a temporal term, t, to model the unstable state at a given 
time that tBDD unable to model and allows a circuit to be partitioned into sub-circuits to 
perform the analysis. Figure 4 is an example that shows how a general tTDD describes 
Analysis of Circuit Dynamic Behavior with Timed Ternary 
Decision Diagram
Lu Wan        Deming Chen 
ECE Department, University of Illinois at Urbana Champaign 
{luwan2, dchen}@illinois.edu 
ABSTRACT 
Modern logic optimization tools tend to optimize circuits in a 
balanced way so that all primary outputs (POs) have similar delay 
close to the cycle time. However, certain POs will be exercised more 
frequently than the rest. Among these critical primary outputs, some 
may be stabilized very quickly by input vectors, even if their 
topological delays from primary inputs are very long. Knowing the 
dynamic behavior of a circuit can help optimize the most commonly 
activated paths and help engineers understand how resilient a PO is 
against dynamic environmental variations such as voltage 
fluctuations. In this paper, we describe a tool to analyze the dynamic 
behavior of a circuit utilizing probabilistic information. The 
techniques exploit the use of timed ternary decision diagrams (tTDD) 
to encode stabilization conditions for POs. To compute probabilities 
based on a tTDD, we propose false assignment pruning and random 
variable compaction to preserve probability calculation accuracy. To 
deal with the scalability issue, this paper proposes a new circuit 
partitioning heuristic to reduce the inaccuracy introduced by 
partitioning. Compared to the timed simulation results, our tool has a 
mean absolute error of 2.5% and a root mean square error of 5.3% on 
average for ISCAS-85 benchmarks. Compared to a state-of-the-art 
dynamic behavior analysis tool, our tool is on average 40x faster and 
can handle circuits that the previous tool cannot. 
1. Introduction and Motivation 
Traditional circuit design optimizes the static critical paths even 
when these paths are rarely exercised dynamically. As a result, 
circuit optimization targets the worst-case conditions to guarantee 
error-free computation but may also lead to very pessimistic designs. 
Recently, there are design techniques to achieve higher performance 
that over-clock the chip to the point where timing errors occur, and 
then perform error correction either through circuit-level or 
microarchitecture-level techniques. This approach in general is 
referred to as Timing Speculation. 
The idea behind timing speculation and better than worst case 
(BTW) design is based on the observation that even if two POs have 
the same static critical path delay, their dynamic behaviors can be 
very different. For example, in Figure 1, two POs have the same 
static critical path delay. But one PO (A) has large probability 
Ps=99% to be stabilized by primary inputs (PI) as early as t. For the 
other PO (B), the probability of stabilization at time t is only 
Ps=53%. The difference in stabilization probabilities akes B more 
dynamically critical than A because when the circuit is over-clocked 
at t, B fails frequently while A may still be able to produce the 
correct outputs 99% of the times. Knowing such behavior is the key 
to opti ize the circuits for timing speculation.  
 
Figure 1. Dynamic behavior curves of primary outputs 
Many previous works use Razor logic [1] or other error correcting 
schemes to enable timing speculation [12][13][16][17][18]. The 
Blueshift work [13] utilized a commercial design flow to optimize 
the dynamically critical nodes to achieve higher throughput working 
with either Razor logic or error-checking processor. In their work, 
the dynamic behavior is collected through timed simulation, which is 
very time-consuming. In [17], power-aware slack redistribution was 
proposed to shift the slack of frequently exercised and near-critical 
timing paths in a power efficient manner. It requires knowledge of 
dynamic behavior not only of the whole circuit but also of individual 
POs, which again was achieved through simulation. To improve 
microprocessor performance and energy efficiency, Intel’s study [12] 
reduced the timing guard band by using embedded error-detection 
sequential (EDS) circuits to tolerate dynamic variations. Their work 
explored path-activation probabilities across various workloads and 
chose operating points based on the path-delay histogram and path-
activation probabilities. DynaTune [16] proposed an analytical 
approach to compute the dynamic behavior curve of a circuit using a 
timed characteristic function and BDD. The dynamic behavior is 
captured in the form of a behavior curve, which is similar to the error 
rate versus clock frequency curve used in [12]. Figure 1 shows an 
example of behavior curves. A behavior curve is a curve with axis T 
and P, where T is the operating clock period and P is the probability 
that the circuit (or a primary output ! PO) can produce correct results 
within T. By varying T, one can plot all (T, P) pairs to get the 
behavior curve representing the dynamic behavior of a circuit. 
Guided with this behavior curve, DynaTune optimizes a circuit for 
higher throughput using dual Vt assignment. The authors of [18] also 
proposed a BTW synthesis by first characterizing the error 
probability of a circuit and then used new cost functions considering 
dynamic behavior of individual gate to do BTW logic decomposition 
and mapping. Understanding the dynamic behavior of a circuit can 
help these BTW tools to optimize dynamically critical paths. It can 
also be used to guide circuit optimization for resilience to 
environmental variations, such as voltage droop [12], by speeding up 
dynamically critical POs. 
All of the works mentioned above require a mechanism to 
characterize the dynamic behavior of individual gates, POs, or the 
whole circuit. Unfortunately, most of the works achieved this 
through netlist simulation, which is very time consuming. The 
behavior curves derived by DynaTune give good accuracy but they 
cannot scale for large circuits because it uses a global BDD to 
capture the behavior of the entire circuit. 
  To derive the dynamic behavior curve, we propose: (1) the use of a 
timed ternary decision diagram (tTDD) to represent stabilization 
conditions; (2) two tTDD-associated rules – (A) false assignment 
pruning and (B) random variable compaction – to calculate the 
dynamic behavior curve of a partitioned sub-circuit; and (3) a novel 
partitioning heuristic to produce sub-circuits that are suitable for 
tTDD calculation. To achieve high accuracy during probability 
calculations, we take care of two types of correlations that can 
contribute to inaccuracy: (1) a signal’s temporal correlation and (2) a 
circuit’s structural correlation. 
The contributions of this work can be summarized as follows: 
1) Our algorithm solves the scalability issue of computing the 
behavior curve through a new partitioning algorithm. 
2) For a single partition, when the inputs of the partition are 
independent of one another, our algorithm can compute the 
978-1-4244-819 - /10/$26.00 ©2010 IEEE 5164 1
 14 
the circuit’s response to a common-case workload, would provide insight that helps the 
BTWC designers to maximize the performance gain. Wan and Chen proposed several 
circuit optimization techniques for timing speculation based on the circuit’s dynamic 
activity in [17], [18], [19].  
In [18], Wan and Chen proposed a circuit-level optimization tool called 
DynaTune that combined TCF (Timed Characteristic Function) [20], an ATPG 
(automatic test pattern generation) method, and BDD (binary decision diagram) [21] to 
derive the circuit’s dynamic behavior curve to understand the impact of input workload 
on a circuit’s settling time; based on that information, it selects a targeted operating clock 
frequency and the corresponding settling probability. Then, it selectively resynthesized 
the cells along the timing-critical paths that exceed the threshold for delay and activity 
probability so as to improve performance while mitigating errors. The timing speculation 
techniques used in DynaTune are the Razor logic [22] (which is discussed in more detail 
later in the chapter) or the Telescopic Unit [23]. 
DynaTune has several drawbacks: (1) The use of Global BDD is only suitable for 
small circuits; (2) TCF analysis is sensitive to the node’s value, and it requires structural 
information of the circuit to perform the analysis. (3) During the analysis, the input was 
set to a static probability, which likely is not representative of the real application input 
workload, which could have distinct phases of operation.  
Wan and Chen also purposed a method to analysis circuit-level dynamic behavior 
with new data structure, called timed Ternary Decision Diagram (tTDD) [17]. The tTDD 
is created based on the TDD and TCF. Ternary Decision Diagram is similar to Binary 
Decision Diagrams (BDDs). BDD’s basic idea is Shannon expansion, and it is a graph 
 15 
based rooted but directed data structure that is used to represent Boolean functions. 
Bryant [24] added restrictions on the ordering of decision variables in vertices, which 
enables BDD to manipulate representations in a more efficient manner. Ternary decision 
diagram, As discussed by Sasao in [25], has three possible outgoing branches for each 
node, which solves BDD incapability of modeling not settled cases. But this method 
requires circuit partitioning, and the partitioning algorithm is crucial because it will affect 
both structure correlation and calculation cost. The estimation error complexity becomes 
relative high when dealing with larger circuits. Detailed timing model that extracted from 
standard delay format (SDF) was used in this method, but input change impaction on cell 
delay did not include yet.  
CCP [19] resynthesizes a circuit according to a probabilistic manner that creates 
functionally equivalent but shorter logic paths for paths with high activity. The rarely 
active paths are resynthesized with a longer delay. To identify the common cases, a 
global behavior profile is obtained by generating a set of primary input vectors according 
to given typical case characteristics [8] [26], and then it uses the synthesis engine in ABC 
[27]. Input vectors are selected according to the typical-case characteristics. The global 
behavior profile can be reused for all sub-circuits. To promote the common case, the TCF 
information of common cases is used to build redundant sub-functions for common cases, 
and the sub-functions are merged into the original design to improve performance. 
 
Timing Speculation Methodologies And Error Resilience in BTWC Design 
A BTWC design can be separated into two main parts: Timing speculation part 
and Error resilience part. Timing speculation is the part to improve the performance (e.g., 
 16 
increase speed, or reduce power usage), which could be implement at different design 
level. Error resilience part aims to preserve the reliability of the design. In this work, we 
are focus on the circuit-level timing speculation, and the most popular one is Razor logic 
structure, like Razor [28], Razor II [29].  
Timing speculation methodologies: 
Researchers at the University of Michigan developed a circuit-level approach 
called Razor to implement Dynamic Voltage Scaling (DVS) processors [22][30][29]. It 
combines a circuit-level error detection mechanism with a microarchitecture-level error 
recovery technique.  
Razor [22] proposed a more aggressive but realistic approach to DVS. It tunes the 
supply voltage by monitoring the error rate during the operation. The timing error 
detection is implemented by using a delayed latch, called a shadow latch, to compare 
with the corresponding state element in the design. The value in the shadow latch is 
guaranteed to be correct since it uses the worst-case timing (Figure 5). Figure 6 shows 
how the Razor flip-flop was designed. When an input signal transitions at the same time 
as the clock, meta-stability may occur in the Razor flip-flop.  
Razor relies on both the combinational circuit and the architecture for an efficient 
EDAC method. It has been applied with a pipeline structure to correct timing errors. Two 
recovery mechanisms have been proposed [28]. The mechanisms use either clock gating 
or a counterflow pipeline, as shown in Figure 7 and Figure 8 respectively. The clock 
gating mechanism simply asserts a global stall for all stages in the next cycle after the 
error flag is issued. However, global clock gating is not ideal for the clock tree, so the 
counterflow pipelining approach is introduced. When an error is detected, a bubble signal 
 17 
propagates to next stage, and a pipeline flush is initiated from this stage back to the first 
stage. The pipeline restarts from the first stage. 
The voltage is increased or decreased according to the error rate. A low error rate 
means that the voltage could be reduced. A high error rate suggests that the supply 
voltage violates the timing constraints too much and should be increased. A properly 
selected reference error rate is very important to maximize the performance gain. 
 
Figure 5: Block diagram of Razor logic. [22] 
 
 
Figure 6: The circuit-level schematic of the shadow latch used in Figure 5. [22] 
flip-flops is restored even when only one of the Razor flip-flops gen-
erates an error.
If an error occurs in pipeline stage L1 in a particular clock
cycle, the data in L2 in the following clock cycle is incorrect and
must be flushed from the pipeline using one of the pipeline control
methods described in Section 2.2. However, since the shadow latch
contains the correct output data of pipeline stage L1, the instruction
does not need to be re-executed through this failing stage. Thus, a
key feature of Razor is that if an instruction fails in a particular pipe-
line stage it is re-executed through the following pipeline stage,
while incurring a one cycle penalty. The proposed approach there-
fore guarantees forward progress of a failing instruction, which is
essential to avoid the perpetual failure of an instruction at a particu-
lar stage in the pipeline.
In addition to invalidating the data in the following pipeline
stage, an error must also stall the preceding pipeline stages while the
shadow latch data is restored into the main flip-flops. A number of
different methods, such as clock gating or flushing the instruction in
the preceding stages, were examined to accomplish this and are dis-
cussed in Section 2.2. The proposed approach also raises a number
of circuit related issues. The Razor flip-flop must be constructed
such that the power and delay overhead is minimized. Also, the pres-
ence of the delayed clock introduces a new short-path constraint in
the design. And finally, allowing the setup time of the main flip-flop
to be exceeded raises the possibility of meta-stability. These issues
are discussed in more detail in Section 2.1. In the proposed Razor
based DVS approach, the error signal is used to tune the supply volt-
age to its optimal value. In Section 2.3, we therefore discuss differ-
ent algorithms to control the supply voltage based on the observed
error rate.
In general, maximum power savings is obtained from Razor
technology when it is applied to all parts of a microprocessor design.
To accomplish this, we identify three distinct design challenges. The
first design challenge, and the focus of this paper, is the detection
and recovery of timing errors in combinational logic contained
within pipeline datapaths, e.g., adders, shifters, and decode logic.
The second design challenge is the application of Razor to on-chip
SRAM structures. In SRAM structures, such as register files and
caches, it is necessary to introduce Razor-compatible sense amplifi-
ers and support for fast non-speculative stores. The third challenge is
the use of Razor on pipeline control logic to restore correct program
execution in the presence of incorrect control decisions.
For the sake of brevity and clarity, the focus of this paper is
limited to the first design challenge, which is the use of Razor on
combinational logic blocks contained within the pipeline datapaths.
We therefore apply Razor to a simple embedded processor which
utilizes an in-order pipeline with simple control and small caches. In
such a processor, control logic and SRAM structures remain error-
free, even at the worst-case frequency and voltage and do not require
Razor technology. However, to effectively apply Razor in large
microprocessor designs with large caches and complex control logic,
it will be necessary to apply Razor technology to all parts of the
design. Therefore, in concert with the effort presented in this paper,
we are developing Razor-compatible memory structures based on
bit-line sampling and architectural modifications for reduced typi-
cal-case latency. For control logic, we are developing techniques to
checkpoint control state to enable control logic recovery. These addi-
tional developments will be presented in future reports.
2.1  Circuit-level implementation issues
A key requirement for Razor based DVS is that during error-
free operation, the delay and power overhead due to the error detec-
tion and correction circuitry is minimal. Otherwise, the power gain
from more aggressive voltage scaling is overcome by the power
overhead due to the presence of the error detection and correction
circuitry. In addition, the overhead of performing an error correction
must also be minimized to enable efficient operation at moderate
error rates. A number of methods were applied to reduce the power
and delay overhead of the Razor flip-flop, shown in Figure 1. The
multiplexer at the input the razor flip-flop results in a significant
delay and power overhead, and was therefore moved to the feedback
path of the master latch of the main flip-flop, as shown in Figure 2.
Hence, it introduces only a slight increase in the capacitive loading
of the critical path and has minimal impact on the performance and
power of the design. 
The power overhead of Razor is also reduced by the fact that in
most cycles, the input of a flip-flop will not transition and only the
power overhead from switching the delayed clock is incurred. To
further minimize this additional clock power, the delayed clock is
locally generated, reducing its routing capacitance. If the delayed
clock is delayed by half the clock cycle, it can be derived by simply
inverting the main clock. Also, many non-critical flip-flops in the
design do not need Razor. If the maximum delay at the input of a
flip-flop is guaranteed to meet the required cycle time under the
worst-case sub-critical voltage, the flip-flop cannot fail and does not
need to be replaced with a Razor flip-flop. It was found that in the
prototype Alpha processor only 192 flip-flops out of a total of 2408
required Razor, thereby significantly reducing the power overhead
of the Razor approach. For this prototype processor, the total power
overhead in error free operation (due to Razor flip-flops) was found
to be less than 1%, while the delay overhead was negligible.
The use of a delayed clock at the shadow latch raises the possi-
bility that a short path in the combinational logic will corrupt the
data in the shadow latch. Figure 3 shows how a short-path allows
data launched at the start of a cycle to be latched into the shadow
latch, instead of the data launched from the previous cycle. To pre-
vent this corruption of the shadow latch data, a minimum-path
length constraint is added at the input of each Razor flip-flop in the
design. These minimum-path constraints result in the addition of
buffers during logic synthesis to slow down fast paths and therefore
introduce a certain power overhead. Figure 3 shows that the mini-
mum-path constraint is equal to the clock delay tdelay plus the hold
time thold of the shadow latch (which is typically a small negative
value). A large clock delay increases the severity of the short path
constraint and therefore increases the power overhead due to the
need for additional buffers. On the other hand, a small clock delay
reduces the margin between the main flip-flop and the shadow latch,
and hence reduces the amount by which the supply voltage can be
dropped below the critical supply voltage. The clock delay therefore
presents a trade-off between the power overhead incurred from
short-path correction and the degree of possible power saving from
sub-critical voltage operation. In the prototype 64-bit Alpha design,
the clock delay was set at 1/2 the clock period. This simplified the
generation of the delayed clock while the short-path constraints
could still be easily met and resulted in a power overhead (due to
buffers) of less than 3%.
In subcritical voltage operation, it is possible that the data at the
input of the main latch transitions at the same time as the clock. This
can give rise to meta-stability of the main flip-flop, where the output
Figure 2. Reduced overhead Razor flip-flop and meta-
stability detection circuit .
clk_b
clk
clk
clk_b
D Q
Error_L
Inv_n
clk_del
clk_del_b
Inv_p
Meta-stability detector
Error_L
Shadow Latch
 18 
 
Figure 7: The pipeline recovery using global clock gating. (a) The pipeline structure. (b) 
The pipeline operation timing. [28] 
 
 
 
Fig. 8: The pipeline recovery using counterflow pipelining. (a) The pipeline structure. (b) 
The pipeline operation timing.[28] 
voltage does not resolve to a definite high or low voltage, but instead
hovers near Vdd/2 [4]. The danger of meta-stability is that different
fan-out gates may interpret this indeterminate voltage level as differ-
ent logic states, or may even enter a meta-stable state themselves. It
is important to note that, since the minimum sub-critical voltage s
constrained such that the setup time of the shadow latch is always
met, the shadow latch is stable and can not exhibit meta-stability.
However, if the main flip-flop is meta-stable, it is impossible to
determine if its latched value is correct or not using the XOR gate in
Figure 2. Hence, we include a meta-stability detector circuit in the
Razor flip-flop which detects the presence of a meta-stable voltage
levels, as shown in Figure 2. A detected meta-stability event is cor-
rected the same way as a regular delay failure, and results in the sta-
ble and correct data value from the shadow latch being restored in
the main flip-flop. For simplicity, the meta-stability detector in Fig-
ure 2 is constructed using two inverter gates with different skewed P/
N ratios, such that they switch at different voltage levels. If the two
inverters interpret the result differently, the flip-flop voltage is not
definitive and may be meta-stable. Note that, any suitable compara-
tor circuit could be utilized and that these meta-stability events do
not result in a failure of the system but are corrected using the exist-
ing Razor error correction infrastructure.
However, it is well known that complete system failure due
meta-stability to cannot be completely avoided and only its probabil-
ity of occurrence can be reduced to negligible levels [4]. In the pro-
posed Razor design, this manifests itself in the small but finite
probability that the error signal itself becomes meta-stable. This
could occur if the main flip-flop output voltage was near the edge of
the meta-stable voltage range and, hence, the meta-stability detector
was unable to determine if a meta-stability event occurred or not. In
this case, the error signal will not resolve to a definite voltage level
and ambiguity will exist in the logic value of the error signal, possi-
bly causing a failure in the error correction mechanism. A standard
approach to reduce the probability of such an event to negligible lev-
els is to double latch the signal. However, this would delay the
detection of an error in the main flip-flop by one cycle, complicating
the error recovery mechanism. We therefore employ at the same time
an additional mechanism to detect metastable error signals, where
the error signal is double latched using two skewed flip-flops. The
probability that the outputs of the second set of flip-flops are meta-
stable is hence reduced to a negligible level and by comparing their
output values, the presence of a meta-stable error signal one cycle
earlier can be reliably detected. Under normal operation, the error
signal will resolve to a definite voltage level and the output values of
the two skewed flip-flops will match, indicating that the performed
error correction was executed correctly. However, in the unlikely
event that the error signal is meta-stable, the outputs of the skewed
latches will differ in the subsequent clock cycle indicating that the
error correction was unsafe and could have failed. In this case, a so
called panic signal is generated, which requires that the entire pipe-
line is flushed and restarted. In this case, guaranteed forward
progress is lost, and the supply voltage level must be raised to avoid
possible perpetual failure of the same instruction. However, the pos-
sibility of a meta-stable error signal is extremely small and does not
constitute a significant burden on the power and performance of the
processor. Also, only one set of double latches is needed for each
pipeline stage, meaning that the power overhead during error-free
operation is negligible.
2.2   Pipeline error recovery mechanisms
The pipeline error recovery mechanism must guarantee that, in
the presence of Razor errors, register and memory state is not cor-
rupted with an incorrect value. In this section, we highlight two pos-
sible approaches to implementing pipeline error recovery. The first is
a simple but slow method based on clock gating, while the second
method is a much more scalable technique based on counterflow
pipelining.
Recovery using clock gating. Figure 4(a) illustrates a simple
approach to pipeline error recovery based on global clock gating. In
the event that any stage detects a Razor error, the entire pipeline is
stalled for one cycle by gating the next global clock edge. The addi-
tional clock period allows every stage to recompute its result using
the Razor shadow latch as input. Consequently, any previously for-
warded errant values will be replaced with the correct value from the
Razor shadow latch. Since all stages re-evaluate their result with the
Razor shadow latch input, any number of errors can be tolerated in a
single cycle and forward progress is guaranteed. If all stages produce
an error each cycle, the pipeline will continue to run, but at 1/2 the
normal speed.
It is imperative that errant pipeline results not be written to
architected state before it has been validated by Razor. Since valida-
tion of Razor values takes two additional cycles (i.e., one for error
detection and one for panic detection), there must be two non-specu-
lative stages between the last Razor latch and the writeback (WB)
stage. In our design, memory accesses to the data cache are non-
speculative, hence, only one additional stage labeled ST for stabilize
is required before writeback (WB). The ST stage introduces an addi-
tional level of register bypass. Since store instructions must execute
non-speculatively, they are performed in the WB stage of the pipe-
line.
Figure 4(b) gives a pipeline timing diagram of a pipeline recov-
ery for an instruction that fails in the EX stage of the pipeline. The
first failed stage computation occurs in the 4th cycle, when the sec-
ond instruction computes an incorrect result in the EX stage of the
pipeline. This error is detected in the 5th cycle, but only after the
MEM stage has computed an incorrect result using the errant value
Figure 3. Short Paths Constraints.
clock
clock_del
tdelay thold
Min. path delay 
Min. Path Delay > tdelay + thold
intended path short path
Figure 4. Pipeline recovery using global clock gating. 
Figure a) shows the pipeline organization, Figure b) 
illustrates the pipeline timing for a failure in the EX stage 
of the pipeline. The “*” denotes a failed stage 
computation.
IF
Ra
zo
r F
F
Ra
zo
r F
F ID
Ra
zo
r F
F
Ra
zo
r F
F EX
Ra
zo
r F
F
Ra
zo
r F
F MEM WB
(reg/mem)
error
recover recover recover
Ra
zo
r F
F
Ra
zo
r F
F
PCPC
recover
errorerror error
clock
Time (in cycles)
IF ID EX* MEM* WB
St
ab
iliz
er 
FF
St
ab
iliz
er 
FF
a)
b)
IF ID EX MEM WB
IF ID EX MEM WBstallIns
tru
cti
on
s
IF ID EX MEMstall
Correct value
provided to MEM
MEM
Razor latch gets
correct EX value
ST
ST
ST
ST
ST
stall
forward from the EX stage. After the error is detected, a global clock
stall occurs in the 6th cycle, permitting the correct EX result in the
Razor shadow latch to be evaluated by the MEM stage. In the 7th
cycle, normal pipeline operation resumes.
Recovery using counterflow pipelining. In aggressively
clocked desig s, it may ot be possible to impl ment global clock
gating without significantly impacting processor cycle tim . Conse-
quently, we have d signed and mplemented a fully pipelined error
recov ry mechanism b s d on counterflow pipelining techniques
[19]. The approach, illustrated in Figure 5(a), places negligible tim-
ing constraints on the bas li e pipeline design at the expens  f
extendi g pipeline recovery over a f w cycles. When a Razor rror
i  detected, two specific actions must be taken. First, the erran  stag
computation foll wing the failing Razor latch must be nullified. This
action is accomplished using the bubble signal, which indicates to
the next and subsequent stages that the pipeline slot is empty. Sec-
ond, the flush train is triggered by asserting the stage ID of failing
stage.   In the following cycle, the correct value from the Razor
shadow latch data is injected back into the pipeline, allowing the
errant instruction to continue with its correct inputs. Additionally,
the flush train begins propagating the ID of the failing stage in the
opposite direction of instructions. At each stage visited by the active
flush train, the corresponding pipeline stage and the one immedi-
ately preceding are replaced with a bubble. (Two stages must be nul-
lified to account for the twice relative speed of the main pipeline.)
When the flush ID reaches the start of the pipeline, the flush control
logic restarts the pipeline at the instruction following the errant
instruction. In the event that multiple stages experience errors in the
same cycle, all will initiate recovery but only the Razor error closest
to writeback (WB) will complete. Earlier recoveries will be flushed
by later ones.
Figure 5(b) shows a pipeline timing diagram of a pipelined
recovery for an instruction that fails in the EX stage. As in the previ-
ous example, the first failed stage computation occurs in the 4th
cycle, when the second instruction computes an incorrect result in
the EX stage of the pipeline. This error is detected in the 5th cycle,
causing a bubble to be propagated out of the MEM stage and initia-
tion of the flush train. The instruction in the EX, ID and IF stages are
flushed in the 6th, 7th and 8th cycles, respectively. Finally, the pipe-
line is restarted after the errant instruction in cycle 9, after which
normal pipeline operation resumes.
In the event a panic signal is asserted, all pipeline state is
flushed and the pipeline is restarted immediately after the last
instruction to writeback. Panic situations complicate the guarantee
of forward progress, as the delay in detecting the situation may result
in the correct result being overwritten in the Razor shadow latch.
Consequently, after experiencing a panic, the supply voltage is reset
to a known-safe operating level, and the pipeline is restarted. Once
re-tuned, the errant instruction should complete without errors as
long as re-tuning is prohibited until after this instruction completes.
A key requirement of the pipeline recovery control is that it not
fail under even the worst operating conditions (e.g., low voltage,
high temperature and high process variation). This requirement is
met through a conservative design approach that validates the timing
of the error recovery circuits at the worst-case subcritical voltage.
2.3  Supply Voltage Control
Many of the parameters that affect voltage margin vary over
time. Temperature margins will track ambient temperatures and can
vary on-die with processing demands. Consequently, to optimize
energy conservation it is desirable to introduce a voltage control sys-
tem into the design. The voltage control system adjusts the supply
voltage based on monitored error rates. If the error rate is very low, it
could indicate circuit computation is finishing too quickly and volt-
age should be lowered. Similarly, a low error rate could indicate
changes in the ambient environment (e.g., decreasing temperature),
giving additional opportunity to lower voltage. Increasing error
rates, on the other hand, indicate circuits are not meeting clock
period constraints and voltage should be increased. The optimal
error rate depends on a number of factors including the energy cost
of error recovery and overall performance requirements, but in gen-
eral it is a small non-zero error rate.
Figure 6 illustrates the Razor voltage control system. The con-
trol systems works to maintain a constant error rate of Eref. At regu-
lar intervals the error rate of the system is measured by resetting an
error counter which is sampled after a fixed period of time. The
computed error rate of the sample Esample is then subtracted from the
reference error rate to produce the error rate differential Ediff. Ediff is
the input to the voltage control function, which sets the target volt-
age of the voltage regulator. If Ediff is negative the system is experi-
ence too many errors, and voltage should be increased. If Ediff is
positive the error rate is too low and voltage should be lowered. The
magnitude of Ediff indicates the degree to which the system is “out of
tune”.
While control of this system may seem simple on the surface, it
is complicated by the slow response time of the voltage regulator.
Typical commercial voltage regulators can take 10’s of microsec-
onds to adjust supply voltage by 100 mV. Consequently, if the con-
troller reacts too fast or too abruptly, the system could become
unstable or go into oscillation. Moreover, an overly conservative
control function that is slow to react to changing system environ-
ments will reduce the overall efficiency of the design. As a starting
point, we have implemented a proportional control system [15]
which adjusts supply voltage in proportion to the sampled Ediff. To
prevent the control system from over-reacting and potentially plac-
ing the system in an unstable state, the error sample rate is roughly
equivalent to the minimum voltage step period.
3   Experimental Evaluation
3.1  Razor Pipeline Implementation
The proposed Razor error detection and correction approach
was implemented in a 64-bit Alpha processor. The processor was
implemented using a simple in-order pipeline consisting of instruc-
tion fetch, instruction decode, execute, and memory/writeback with
8 Kbytes of I-cache and D-cache. The implementation details, as
well as a die picture, are shown below in Figure 7. The processor
was implemented using a 0.18 µm process and is expected to operate
at 200 MHz. After careful performance analysis, it was found that
only the instruction decode and execute stages were critical at the
worst-case voltage and frequency settings and hence required Razor
Figure 5. Pipeline recovery using counterflow 
pip lining. Fi ure a) shows the pipeline organization, 
Figure b) illustrates the pipeline timing for a failure in the 
EX stage of the pipeline. The “*” denotes a failed stage 
computation.
recover
IF
Ra
zo
r F
F
Ra
zo
r F
F ID
Ra
zo
r F
F
Ra
zo
r F
F EX
Ra
zo
r F
F
Ra
zo
r F
F MEM
(read-only)
WB
(reg/mem)
error bubble
r cover recover
Ra
zo
r F
F
Ra
zo
r F
F
St
ab
iliz
er
 F
F
St
ab
iliz
er
 F
F
PCPC
recover
flushID
bubbleerror bubble
flushID
error bubble
f lushIDFlush
Control
flushID
error
IF ID EX* MEM WB
a)
b)
IF ID EX MEM WB
IF ID flushEX
Time (in cycles)
Ins
tru
cti
on
s
bubble
ST
ST
ST
flushID flushIFEX IF ID
Razor detects fault,
forwards bubble toward WB,
initiates flush toward IF
Pipeline flush
completes
IF ID IF
 19 
The original Razor design not only detects errors but also restores the correct 
results from the shadow latch. However, generating the restore signal from the pipeline 
makes it harder to implement an aggressively clocked microprocessor. Razor II [29] 
proposed a new flip-flop that only detect errors, and uses the technique of architectural 
replay to handle the correction. Because it uses architectural replay, the Razor II flip-flop 
is smaller in size and complexity but pays a higher penalty on recovery, as measured by 
the throughput, Instructions Per Cycle (IPC). The advantage of architectural replay is that 
it is a mature technique used in many existing speculative processors [31]. 
 
 
Figure 9: Circuit-level schematic of Razor II flip-flop. (a) Flip-flop schematic. (b) 
Transition detector schematic. (c) Detection clock generator. [29] 
 
The Razor II flip-flop is a positive level-sensitive latch. Since it is level-sensitive, 
when the clock is high, any input change will be captured. In Razor II, any transition that 
36 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 44, NO. 1, JANUARY 2009
Fig. 4. Circuit-level schematic of the RazorII flip-flop. (a) RazorII flip-flop circuit schematic; (b) transition-detector; (c) detection clock generator.
precharged during recovery, in the event of a timing error. A
cross-coupled inverter pair is used as a latch structure to protect
the dynamic node from discharge due to leakage.
B. Impact of Intra-Die Process Variability
As explained previously, the low-pulse temporarily disables
the transiti n-detector thereby preventing legitimat transitions
at the latch node from being flagged as errors. For correct func-
tionality, it is required that the minimum width of the low pulse
at the DC clock is greater than the maximum CLK-Q delay of the
main latch across all PVT corners. The width of the DC pulse
is determined by the delay through the delay-chain in the DC
generator. We used conventional worst-case sizing of the tran-
sistors in the DC generator to satisfy this constraint on silicon.
Achieving this in the face of rising intra-die process variability
at 45 nanometer technology node and below, may require the
use of Monte-Carlo sampling techniques. For a 3-sigma yield
target, it is required to ensure that the 3-sigma increase of the
CLK-Q delay of the latch is still covered by the 3-sigma reduc-
tion in the DC pulse-width. The relevant timing diagram with
process variation is illustrated in Fig. 5.
In order to enable post-manufacture tuning and to account
for process-variation mismatches between the latch delay and
DC pulse-width, the delay-chain in the DC generator is made
tunable by controlling the gate voltage of the transmission gate
through the DC-TG Vdd pin. Again, tuning was not required
for the normal operation of the chip. The DC-TG Vdd pin of
individual RazorII flip-flops were routed as conventional signal
nets with an input pad serving as a common driver. The TD-TG
Fig. 5. Timing constraints with intra-die process variations.
Vdd pin was also routed in a similar manner. These pins have
relaxed timing constraints since they are only meant for post-
manufacture tuning. The analog tuning voltages (DC-TG Vdd
and TD-TG Vdd) are generated using external regulators which
form a part of the test-gig. During testing, they were set at their
default setting of 1.2 V (nominal supply voltage for the tech-
nology used).
The difference between the CLK-Q delay and the DC
pulse-width represents the duration when a transition on N goes
undetected. This allows dynamic time-borrowing in the RazorII
flip-flop wherein a critical computation gets extra time from
the next cycle to complete, without flagging a timing error.
Of course, this reduces the available time for the succeeding
 20 
happens during the latch’s transparent phase is considered an error. Figure 9 illustrates 
the circuit structure of the Razor II flip-flop.  
Other than Razor and Razor II, there are several EDAC flip-flops that use 
transition detection with time borrowing (TDTB) as in Figure 10(b), and double sampling 
with timing borrowing (DSTB) as in Figure 10(c). These techniques were proposed to 
solve the meta-stability issue that exists in the previous Razor design [32]. Figure 10(a) 
shows the regular structure of Razor flip-flop.  
 
 
Figure 10: Different ways to implement Razor flip-flop to detect timing errors. [32] 
 
TIMBER [33] proposed two timing elements to provide online masking of timing 
errors for a pipelined structure. The author found that timing errors due to dynamic 
variations often only span one pipeline stage on successive clock cycles and therefore can 
be masked by timing borrowing. TIMBER has flip-flop version and latch version, and 
both are illustrated in Figure 7 and Figure 8. When EN is high, the TIMBER flip-flop 
works in the timing-borrowing mode. Node M0 and M1 are designed for error checking. 
If they are not the same, then the value sampled by M1 will mask the previous value after 
the delay time. The delay time is controlled by S1S0. Similar in TIMBER latch, when EN 
is high, the latch is in the timing-borrowing (TB) mode. The input value is latched by 
52 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 44, NO. 1, JANUARY 2009
Fig. 2. Error-detection sequential circuits: (a) Razor flip-flop (RFF) [5]–[9],
(b) transition detector with time borrowing (TDTB), and (c) double sampling
with time borrowing (DSTB). CLK is duty-cycle controlled to satisfy min-delay
requirements.
B. Transition Detector With Time Borrowing (TDTB)
In Fig. 2(b), the first proposed EDS circuit is a transition de-
tector with a time-borrowing latch (TDTB). The TDTB EDS
circuit operation is demonstrated through a simulated timing di-
agram in Fig. 3(a). The transition detector monitors input data
(D) transitions during the high clock phase. As input data transi-
tions, a pulse is always generated at the XOR output. During the
low clock phase, the output of the dynamic gate pre-charges and
the pulse does not affect the error signal (ERROR) as described
in Fig. 3(a). If input data arrives late, CLK is logically-high and
the pulse discharges the output node voltage of the dynamic
gate, thus transitioning ERROR to a logic-high as illustrated
in Fig. 3(a). As CLK transitions to a logic-low, the dynamic
gate output pre-charges, and consequently, ERROR transitions
to a logic-low. As discussed further in Section III-B, ERROR
is propagated to a set-dominant latch (SDL), where the SDL
output remains logically-high while the dynamic transition de-
tector pre-charges during the low clock phase. The SDL is trans-
parent during the high clock phase and only allows high tran-
sitions during the low clock phase. Since min-delay paths are
designed with sufficient margin as described in (2), the master
Fig. 3. Simulated timing diagrams for (a) TDTB and (b) DSTB to demonstrate
error generation from late arriving input data.
latch of a datapath flip-flop is unn cessary. The datapath latch
is identical to a pulse-latch, resulting in lower clock energy and
eliminating datapath metastability during a rising clock edge.
Datapath metastability does not occur on the falling clock edge
since the max-delay constraint in (1) is satisfied.
Although TDTB employs a datapath latch, path timing con-
straints are still based on a flip-flop design with an error-detec-
tion window as illustrated in Fig. 1 and modeled in (1). The
purpose of the transparency window in the datapath latch is to
eliminate datapath metastability while detecting timing errors.
When input data arrives late, an error signal is generated even
though the input data traverses to the latch output. The error
signal ensures that late arriving data from the path in the current
pipeline stage does not affect the max-delay constraint in (1) for
adjoining fan-out paths in subsequent pipeline stages. If ample
max-delay margin is available for the adjoining paths in the sub-
sequent pipeline stage, then a pulse-latch may replace the TDTB
EDS circuit at the current pipeline stage. This would enable tra-
ditional time borrowing between the path in the current pipeline
stage and the adjoining paths in the subsequent pipeline stage.
Although datapath metastability is removed in TDTB, the
transition-detector output can become metastable. For metasta-
bility to occur on the transition-detector output, the input data
must arrive within a tight metastability window ( 1 ps in a
65 nm technology [11]), starting slightly after the setup time
prior to a rising clock edge. For EDS circuits
with a datapath latch, is defined as the minimum
52 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 44, NO. 1, JANUARY 2009
Fig. 2. Error-detection sequential circuits: (a) Razor flip-flop (RFF) [5]–[9],
(b) transition detector with time borrowing (TDTB), and (c) double sampling
with time borrowing (DSTB). CLK is duty-cycle controlled to satisfy min-delay
requirem nts.
B. Transition Detector With Time Borrowing (TDTB)
In Fig. 2(b), the first proposed EDS circuit is a transition de-
tector with a time-borrowing latch (TDTB). The TDTB EDS
circuit operation is demonstrated through a simulated timing i-
agram in Fig. 3(a). The transition detector monitors input data
(D) transitions during the high clock phase. As input data transi-
tions, a pulse is always generated at the XOR output. During the
low clock phase, the output of the dynamic gate pre-charges and
the pulse does not affect the error signal (ERROR) as described
in Fig. 3(a). If input data arrives late, CLK is logically-high and
the pulse discharges the output node voltage of the dynamic
gate, thus transitioning ERROR to a logic-high as illustrated
i Fig. 3(a). As CLK transitions to a logic-low, the dynamic
gate output pre-charges, and consequently, ERROR transitions
to a logic-low. As discussed further in Section III-B, ERROR
is propagated to a set-dominant latch (SDL), where the SDL
output remains logically-high while the dynamic transition de-
tector pre-charges during the low clock phase. The SDL is trans-
parent during the high clock phase and only allows high tran-
sitions during the low clock phase. Since min-delay paths are
designed with sufficient margin as described in (2), the master
Fig. 3. Simulated timing diagrams for (a) TDTB and (b) DSTB to demonstrate
error generatio from late arriving i put data.
latch of a datapath flip-flop is unnecessary. The datapath latch
is identical to a pulse-latch, resulting in lower clock energy and
eliminating datapath metastability during a rising clock edge.
Datapath metastability does not occur on the falling clock edge
since the max-delay constraint in (1) is satisfied.
Although TDTB employs a datapath latch, path timing con-
s raints are still based o a flip-flop design with an error-detec-
tion window as illustrated in Fig. 1 and modeled in (1). The
purpose of the transparency window in the datapath latch is to
eliminate datapath metastability while detecting timing errors.
When input data arrives late, an error signal is generated even
though the input data traverses to the latch output. The error
signal ensures that late arriving data from the path in the current
pipeline stage does not affect the max-delay constraint in (1) for
adjoining fan-out paths in subsequent pipeline stages. If ample
max-delay margin is available for the adjoining paths in the sub-
sequent pipeline stage, then a pulse-latch may replace the TDTB
EDS circuit at the current pipeline stage. This would enable tra-
ditional time borrowing between the path in the current pipeline
stage and the adjoining paths in the subsequent pipeline stage.
Although datapath metastability is removed in TDTB, the
transition-detector output can become metastable. For metasta-
bility to occur on the transition-detector output, the input data
must arrive within a tight metastability window ( 1 ps in a
65 nm technology [11]), starting slightly after the setup time
prior to a rising clock edge. For EDS circuits
with a datapath latch, is defined as the minimum
52 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 44, NO. 1, JANUARY 2009
Fig. 2. Error-detection sequential circuits: (a) Razor flip-flop (RFF) [5]–[9],
(b) transition detector with time borrowing (TDTB), and (c) double sampling
with time borrowing (DSTB). CLK is duty-cycle controlled to satisfy min-delay
requirements.
B. Transition Detector With Time Borrowing (TDTB)
In Fig. 2(b), the first proposed EDS circuit is a transition de-
tector with a time-borrowing latch (TDTB). The TDTB EDS
circuit operation is demonstrated through a simulated timing di-
agram in Fig. 3(a). The transition detector monitors input data
(D) transitions during the high clock phase. As input data transi-
tions, a pulse is always generated at the XOR output. During the
low clock phase, the output of the dynamic gate pre-charges and
the pulse does not affect the error signal (ERROR) as described
in Fig. 3(a). If input data arrives late, CLK is logically-high and
the pulse discharges the output node voltage of the dynamic
gate, thus transitioning ERROR to a logic-high as illustrated
in Fig. 3(a). As CLK transitions to a logic-low, the dynamic
gate output pre-ch rges, and consequently, ERROR transitions
to a logic-low. As discuss d further in Section III-B, ERROR
is propagated to a set-dominant latch (SDL), where the SDL
outp t remain logically-high while t e dynamic transition de-
tector pre-charges during the low clock phase. The SDL is trans-
parent during the high clock phase and only allows high tran-
sitions during the low clock phase. Sin e min-delay paths are
designed with sufficient margin as described in (2), the master
Fig. 3. Simulated timing diagrams for (a) TDTB and (b) DSTB to demonstrate
error generation from late arriving input data.
latch of a datapath flip-flop is unnecessary. The datapath latch
is identical to a pulse-latch, resulting in lower clock energy and
eliminating datapath metastability during a rising clock edge.
Datapath metastability does not occur on the falling clock edge
since the max-delay constraint in (1) is satisfied.
Although TDTB employs a datapath latch, path timing con-
straints are still based on a flip-flop design with an error-detec-
tion window as illustrated in Fig. 1 and modeled in (1). The
purpose of the transparency window in the datapath latch is to
eliminate datapath metastability while detecting timing errors.
When input data arrives late, an error signal is generated even
though the input data traverses to the latch output. The error
signal ensures that late arriving data from the path in the current
pipeline stage does not affect the max-delay constraint in (1) for
adjoining fan-out paths in subsequent pipeline stages. If ample
max-delay margin is available for the adjoining paths in the sub-
sequent pip line stage, then a pulse-latch may replace the TDTB
EDS circuit at the current pipeline stage. This would enable tra-
ditiona time borrowing between the path in the current pipeline
stage a d the adjoining paths in the subsequent pipeline stage.
Although datapath metastability is removed in TDTB, the
transition-d tector output can become metastable. For metasta-
bility to occur on the transition-detector output, the input data
must arrive within a tight metastability window ( 1 ps in a
65 nm technology [11]), starting slightly after the setup time
prior to a rising clock edge. For EDS circuits
with a datapath latch, is defined as the minimum
 21 
transmission gate M during the TB time interval, while transmission gate L is always on 
for entire checking period in the timing-borrowing mode. If signal arrives after the TB 
interval ends, then the timing error will be masked without an error flag. If signal 
switching occurred during ED interval, then the error flag will be inserted. Both the 
TIMBER flip-flop and the TIMBER latch do not have meta-stability issue. 
 
 
Figure 11: Schematic of TIMBER flip-flop. (a) Main flip-flop part. (b) Clock signal 
control and generating part.[33] 
25"
timing violation eaches the ED i te val, then  timing error flag will be asserted. 
Because L is transparent for the entire checki g p iod, the TIMBER latc  propagates 
glitches and spurious transitions. In addition, becaus  of its level-sensitive sampling 
feature, it does not have a meta-stability issue. 
!
Figure 11: TIMBER flip-flop schematic. (a) Main flip-flop part. (b) Clock signal 
control a d generatin  part. [41] ""
"
Figure 12: TIMBER latch schematic. (a) Main latch part. (b) Clock generating and 
control part. [41] 
5. TIMBER: Circuit design
This section describes two sequential circuit elements — TIM-
BER flip-flop and TIMBER latch — that implement the TIMBER
architecture for a checking period with one TB and two ED inter-
vals. TIMBER flip-flop preserves the edge-sampling property of a
master-slave flip-flop because error masking is performed by bor-
rowing discrete time intervals. As a result, the TIMBER flip-flop-
based design requires error relay logic to determine the number of
time intervals required to mask errors in successive pipeline stages.
TIMBER latch eliminates the need for error relay logic by imple-
menting time-borrowing using a level-sampling latch. However,
TIMBER latch propagates glitches and spurious transitions during
the checking period. An important feature of both TIMBER flip-
flop and TIMBER latch is that there is an enable signal that allows
the time-borrowing mechanism to be disabled for operation as a
normal master-slave flip-flop.
5.1 TIMBER flip-flop
A TIMBER flip-flop consists of two master latches, M0 and M1,
and a common slave latch as shown in Fig. 3(a). The clock control
logic for the TIMBER flip-flop is shown in Fig. 3(b). The signal R
denotes the system reset signal and the signal EN is the enable sig-
nal. Time-borrowing in a TIMBER flip-flop can be turned off by
setting EN to zero. When EN is low, P0 is CK, and P1 is high.
Thus, M0 and the slave latch together function as a conventional
master-slave flip-flop and M1 is blocked because the transmission
gate P1 is open. In a conventional master-slave flip-flop, M0 sam-
ples the value of the data signal D at the rising edge of the CK and
drives the slave latch and the output Q to the sampled value when
CK is high. When CK goes low, the transmission gate P0 is open
and the slave latch drives the output Q.
When EN is high, the TIMBER flip-flop operates in the time-
borrowing mode. The three intervals in the checking period are
encoded using the select input signals, S1S0. S1S0 = 00 is the TB
interval and S1S0 = 01, 10 are the ED intervals. On system reset,
S1S0 is set to 00. Error masking based on time-borrowing happens
as follows. The master latch M0 samples the value of the data
signal, D, on the rising edge of clock and drives the slave latch and
the output, Q, to the sampled value. The master latch M1 samples
the data signal, D, on the rising edge of the delayed clock, DCK,
after a delay δ determined by the value of the select inputs S1S0.
On the rising edge of the delayed clock, DCK, the transmission gate
P0 opens and the transmission gate P1 closes. Thus, after delay δ,
for the rest of the clock period when CK is high, the master latch
M1 drives the slave latch and the output Q to the new value sampled
by M1. If no timing error has occurred, the master latches M0 and
M1 would sample the same value. Hence, M0 drives the slave latch
and the output to the correct value on the rising edge of CK, and no
time-borrowing occurs.
If a timing error occurs at the flip-flop, the master latches M0
and M1 sample different values, and M1 masks the timing error
after delay δ as follows. Recall that error masking in a TIMBER
flip-flop occurs by borrowing discrete time units. Suppose each
interval in the checking period has a duration of 100ps, and S1S0
is 00. If a timing error occurs due to a 80ps timing violation on the
data input, then the error is masked by the master latch M1 after
a 100ps delay, i.e., 100ps is borrowed from the next stage. Note
that TIMBER flip-flop does not suffer from data-path metastability
issues because a data-path signal violating setup time on the rising
edge of clock is masked by the delayed sampling of the data-path
signal by master latch M1. To mask multi-stage timing errors, error
relay logic configures the select inputs of TIMBER flip-flops in
successive pipeline stages as follows.
QD
CK
CK M0
P0
R
R
P0
M1DCK
P1
DCK
P1
Error flag
and
Error relay
DCK
DCK
CK
CK
CK
CK
(a)
… …CK …
TB ED
P0
P1
EN
ED
S0
0 1 32S1
DCK
(b)
Figure 3: TIMBER flip-flop (a) design and (b) clock control.
Error relay: Consider a TIMBER flip-flop, f , with m TIMBER
flip-flops g1, g2, · · · , gm in the fanin cone of f . Denote S(gi) as
the select input to gi. If no error occurs at gi, then the select output
of gi is set to 00. If an error occurs at gi, then the select input S(gi)
is incremented by 1 to obtain the select output for gi. Incrementing
S(gi) by 1 ensures that the TIMBER flip-flop f can borrow an ad-
ditional time interval if a multi-stage timing error occurs at f . The
select input for f is obtained as the maximum over all the select
outputs from g1, g2, · · · , gm. The logic for generating the select
outputs at each TIMBER flip-flop using its select inputs is omitted
from Fig. 3(a) due to space constraints. Fig. 4 is the block dia-
gram for the error relay logic. Note that the error relay logic is
different from the error consolidation logic to the central error con-
trol unit. Recall that the error signal is latched on the falling edge
of the clock. Since the error relay logic must set the select inputs
before the next rising clock edge, the error relay logic can have a
maximum delay of half of the clock period. In Sec. 6, a case-study
for an industrial processor shows that the delay of the error relay
logic is much smaller than half a clock period. This is because the
error relay for a TIMBER flip-flop must occur only from a small
number of TIMBER flip-flops in its fanin cone that are both start
and endpoints of critical paths (refer Fig. 1).
TIMBER
FFs
Error relay
logic
Combinational
logic
CK
D
2-bit select inputs
2-bit select
outputs
Q
DFFs
D QD
TIMBER
FFs
D
2-bit select
inputs
Q
CK
Figure 4: TIMBER flip-flop error relay logic.
Fig. 5 shows SPICE waveforms for error masking when a two-
stage timing error occurs on two TIMBER flip-flops, f1 and f2,
on successive pipeline stages. The signals D1 (D2), Q1 (Q2), and
Err1 (Err2) are the data, output, and error signals for flip-flop f1
(f2). The first timing error, occurring at flip-flop f1, is masked by
borrowing one TB time interval at f1. Although the timing error
is not flagged to the central error control unit (Err1 signal is 0),
the erro relay logic configu es the select inputs of flip-flop f2 to
01. Thus, when a two-stage timing error occurs at flip-flop f2, the
error is masked by borrowi g a TB a d an ED time interval at f2.
The timing error at f2 is flagged to the central error control unit by
latching the error sign l (Err2 signal goes high) on th subsequent
falling edge of CK.
1.5 2 2.5 3
x 10−9
1.5 2 2.5 3
x 10−9
1.5 2 2.5 3
x 10−9
1.5 2 2.5 3
x 10−9
1.5 2 2.5 3
x 10−9
1.5 2 2.5 3
x 10−9
1.5 2 2.5 3
x 10−9
1.5 2 2.5 3
x 10−9
1 1.5 2 2.5 3
0
0.5
1
Error masking in f1
CK
TB
D1
Q1
Err1
D2
Q2
ED
Err2
EDTB
Error masking in f2
ED
EDTB ED
Figure 5: Two-stage timing error in a TIMBER flip-flop design.
5.2 TIMBER latch
TIMBER latch implements time-borrowing in continuous units
using a level-based sampling of the data using a pulse-gated latch.
A TIMBER latch consists of a master and a slave latch as shown
in the circuit schematic in Fig. 6(a). The clock control logic for
a TIMBER latch is shown in Fig. 6(b). The signal R denotes the
system reset signal and the signal EN is the enable signal. Time-
borrowing in a TIMBER latch can be turned off by setting EN to
zero. When EN is low, the transmission gate L is open and the
TIMBER latch operates as a conventional master-slave flip-flop.
D
R
M
Q
R
S
Error flag
M
M
M
F
F
S
L
L
(a)
EN L
F
0
1
… …
CK
S
0
1
M
(b)TB ED
Figure 6: TIMBER latch (a) design and (b) clock control.
When EN is high, the TIMBER latch operates in the time-bor-
rowing mode. In this mode, the transmission gate F is open and the
master latch and slave latch operate independently as pulse-gated
latches. The checking period is divided into one TB and one ED
interval. Note that this ED interval is equivalent to the sum of the
ED intervals in the TIMBER flip-flop. The master latch is trans-
parent during the TB interval and the slave latch is transparent for
the entire checking period. A timing error is detected by compar-
ing the values stored in the master latch and the slave latch on the
falling edge of the clock. When a single-stage timing error occurs,
the timing violation of the late arriving data signal lies within the
TB time interval. The timing error is masked because the slave
latch is transparent for the entire checking period. Since the mas-
ter is also transparent for the TB interval, both the master latch
and slave latch hold the same value and hence, a timing error is
not flagged. However, if a two-stage timing error occurs such that
the timing violation of the late arriving data signal is greater than
the TB interval, then the master and slave latches sample different
values, and a timing error is detected and flagged to the central er-
ror control unit. Recall that a TIMBER latch masks timing errors
by borrowing continuous time units. Suppose the TB interval is
100ps and a timing violation of 80ps occurs at a TIMBER latch,
then the error is masked by borrowing 80ps from the next stage.
Since the slave latch is transparent for the entire checking period,
error relay logic is not required. However, TIMBER latch propa-
gates glitches and spurious transitions during the checking period.
Note that TIMBER latch does not have metastability issues because
level-sensitive sampling is used for time-borrowing.
1.5 2 2.5 3
x 
1 1.5 2 2.5 3
x
0
5
1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3
0
0.5
1
1 1.5 2 2.5 3
x
2
4
6
8
1 1.5 2 2.5 3
x
0
5
1.5 2 2.5 3
x 
1 1.5 2 2.5 3
x
0
5
Error masking in l1
1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3 3.2
x 10−9
ED
CK
TB
D1
Q1
Err1
D2
Q2
Err2
TB Error masking in l2ED
TB
Figure 7: Two-stage timing error in a TIMBER latch design.
Fig. 7 shows SPICE waveforms for error masking when a two-
stage timing error occurs on two TIMBER latches, l1 and l2, on
successive pipeline stages. The signals D1 (D2), Q1 (Q2), and Err1
(Err2) are the data, output, and error signals for latch l1 (l2). The
first timing error, occurring at latch l1, can be masked by borrowing
the time unit TB. Hence, the timing error is not flagged (Err1 signal
is 0). When a two-stage timing error occurs at latch l2, the error is
masked by borrowing a TB and an ED time interval. The timing
error at latch l2 is flagged by latching the error signal (Err2 signal
goes high) on the subsequent falling edge of clock CK.
6. TIMBER case study
We present results from a case-study when TIMBER is inte-
grated into an industrial processor. Three processor performance
points — low, medium, and high — each with four checking peri-
ods of 10%, 20%, 30%, and 40% of the clock period are considered.
For a checking period equal to c% of the clock period, all flip-flops
terminating at the top c% critical paths are replaced by a TIMBER
sequential circuit element (TIMBER flip-flop or TIMBER latch).
 22 
 
Figure 12: TIMBER latch schematic. (a) Main latch part. (b) Clock signal control and 
generating part.[33] 
 
TEAtime [34] (Timing error avoidance) uses a methodology that in situ adjusts 
the clock frequency to avoid operating a circuit at an unnecessarily low frequency. The 
longest critical path is used as a checker for the main circuitry to ensure correct operation, 
shown in Figure 13. A toggle flip-flop feeds into the checker to test whether the results 
could propagate beyond the longest delay under the current clock period. When the 
checker remain equal, the counter increments, the voltage increases, and the clock 
frequency increases. The clock frequency can be decreased by implementing the process 
in reverse. A bi-directional counter, a digital-to-analog (D/A) converter, and a voltage-
controlled oscillator (VCO) are used in TEAtime. The prototype design can experience 
meta-stability. 
25"
timing violation reaches the ED interval, then a timing error flag will be asserted. 
Because L is transparent for the entire checking period, the TIMBER latch propagates 
glitches and spurious transitions. In addition, because of its level-sensitive sampling 
feature, it does not have a meta-stability issue. 
!
Figure 11: TIMBER flip-flop schematic. (a) Main flip-flop part. (b) Clock signal 
control and generating part. [41] ""
"
Figure 12: TIMBER latch schematic. (a) in latch part. (b) C ock ge erating and 
control part. [41] 
5. TIMBER: Circuit design
This section describes two sequential circuit elements — TIM-
BER flip-flop and TIMBER latch — that implement the TIMBER
architecture for a checking period with one TB and two ED inter-
vals. TIMBER flip-flop preserves the edge-sampling property of a
master-slave flip-flop because error masking is performed by bor-
rowing discrete time intervals. As a result, the TIMBER flip-flop-
based design requires error relay logic to determine the number of
time intervals required to mask errors in successive pipeline stages.
TIMBER latch eliminates the need for error relay logic by imple-
menting time-borrowing using a level-sampling latch. However,
TIMBER latch propagates glitches and spurious transitions during
the checking period. An important feature of both TIMBER flip-
flop and TIMBER latch is that there is an enable signal that allows
the time-borrowing mechanism to be disabled for operation as a
normal master-slave flip-flop.
5.1 TIMBER flip-flop
A TIMBER flip-flop consists of two master latches, M0 and M1,
and a common slave latch as shown in Fig. 3(a). The clock control
logic for the TIMBER flip-flop is shown in Fig. 3(b). The signal R
denotes the system reset signal and the signal EN is the enable sig-
nal. Time-borrowing in a TIMBER flip-flop can be turned off by
setting EN to zero. When EN is low, P0 is CK, and P1 is high.
Thus, M0 and the slave latch together function as a conventional
master-slave flip-flop and M1 is blocked because the transmission
gate P1 is open. In a conventional master-slave flip-flop, M0 sam-
ples the value of the data signal D at the rising edge of the CK and
drives the slave latch and the output Q to the sampled value when
CK is high. When CK goes low, the transmission gate P0 is open
and the slave latch drives the output Q.
When EN is high, the TIMBER flip-flop operates in the time-
borrowing mode. The three intervals in the checking period are
encoded using the select input signals, S1S0. S1S0 = 00 is the TB
interval and S1S0 = 01, 10 are the ED intervals. On system reset,
S1S0 is set to 00. Error masking based on time-borrowing happens
as follows. The master latch M0 samples the value of the data
signal, D, on the rising edge of clock and drives the slave latch and
the output, Q, to the sampled value. The master latch M1 samples
the data signal, D, on the rising edge of the delayed clock, DCK,
after a delay δ determined by the value of the select inputs S1S0.
On the rising edge of the delayed clock, DCK, the transmission gate
P0 opens and the transmission gate P1 closes. Thus, after delay δ,
for the rest of the clock period when CK is high, the master latch
M1 drives the slave latch and the output Q to the new value sampled
by M1. If no timing error has occurred, the master latches M0 and
M1 would sample the same value. Hence, M0 drives the slave latch
and the output to the correct value on the rising edge of CK, and no
time-borrowing occurs.
If a timing error occurs at the flip-flop, the master latches M0
and M1 sample different values, and M1 masks the timing error
after delay δ as follows. Recall that error masking in a TIMBER
flip-flop occurs by borrowing discrete time units. Suppose each
interval in the checking period has a duration of 100ps, and S1S0
is 00. If a timing error occurs due to a 80ps timing violation on the
data input, then the error is masked by the master latch M1 after
a 100ps delay, i.e., 100ps is borrowed from the next stage. Note
that TIMBER flip-flop does not suffer from data-path metastability
issues because a data-path signal violating setup time on the rising
edge of clock is masked by the delayed sampling of the data-path
signal by master latch M1. To mask multi-stage timing errors, error
relay logic configures the select inputs of TIMBER flip-flops in
successive pipeline stages as follows.
QD
CK
CK M0
P0
R
R
P0
M1DCK
P1
DCK
P1
Error flag
and
Error relay
DCK
DCK
CK
CK
CK
CK
(a)
… …CK …
TB ED
P0
P1
EN
ED
S0
0 1 32S1
DCK
(b)
Figure 3: TIMBER flip-flop (a) design and (b) clock control.
Error relay: Consider a TIMBER flip-flop, f , with m TIMBER
flip-flops g1, g2, · · · , gm in the fanin cone of f . Denote S(gi) as
the select input to gi. If no error occurs at gi, then the select output
of gi is set to 00. If an error occurs at gi, then the select input S(gi)
is incremented by 1 to obtain the select output for gi. Incrementing
S(gi) by 1 ensures that the TIMBER flip-flop f can borrow an ad-
ditional time interval if a multi-stage timing error occurs at f . The
select input for f is obtained as the maximum over all the select
outputs from g1, g2, · · · , gm. The logic for generating the select
outputs at each TIMBER flip-flop using its select inputs is omitted
from Fig. 3(a) due to space constraints. Fig. 4 is the block dia-
gram for the error relay logic. Note that the error relay logic is
different from the error consolidation logic to the central error con-
trol unit. Recall that the error signal is latched on the falling edge
of the clock. Since the error relay logic must set the select inputs
before the next rising clock edge, the error relay logic can have a
maximum delay of half of the clock period. In Sec. 6, a case-study
for an industrial processor shows that the delay of the error relay
logic is much smaller than half a clock period. This is because the
error relay for a TIMBER flip-flop must occur only from a small
number of TIMBER flip-flops in its fanin cone that are both start
and endpoints of critical paths (refer Fig. 1).
TIMBER
FFs
Error relay
logic
Combinational
logic
CK
D
2-bit select inputs
2-bit select
outputs
Q
DFFs
D QD
TIMBER
FFs
D
2-bit select
inputs
Q
CK
Figure 4: TIMBER flip-flop error relay logic.
Fig. 5 shows SPICE waveforms for error masking when a two-
stage timing error occurs on two TIMBER flip-flops, f1 and f2,
on successive pipeline stages. The signals D1 (D2), Q1 (Q2), and
Err1 (Err2) are the data, output, and error signals for flip-flop f1
(f2). The first timing error, occurring at flip-flop f1, is masked by
borrowing one TB time interval at f1. Although the timing error
is not flagged to the central error control unit (Err1 signal is 0),
the error relay logic configures the select inputs of flip-flop f2 to
01. Thus, when a two-stage timing error occurs at flip-flop f2, the
error is masked by borrowi g a TB a d an ED time interval at f2.
The timing error at f2 is flagged to the central error control unit by
latching the error signal (Err2 signal goes high) on the subsequent
falling edge of CK.
1.5 2 2.5 3
x 10−9
1.5 2 2.5 3
x 10−9
1.5 2 2.5 3
x 10−9
1.5 2 2.5 3
x 10−9
1.5 2 2.5 3
x 10−9
1.5 2 2.5 3
x 10−9
1.5 2 2.5 3
x 10−9
1.5 2 2.5 3
x 10−9
1 1.5 2 2.5 3
0
0.5
1
Error masking in f1
CK
TB
D1
Q1
Err1
D2
Q2
ED
Err2
EDTB
Error masking in f2
ED
EDTB ED
Figure 5: Two-stage timing error in a TIMBER flip-flop design.
5.2 TIMBER latch
TIMBER latch implements time-borrowing in continuous units
using a level-based sampling of the data using a pulse-gated latch.
A TIMBER latch consists of a master and a slave latch as shown
in the circuit schematic in Fig. 6(a). The clock control logic for
a TIMBER latch is shown in Fig. 6(b). The signal R denotes the
system reset signal and the signal EN is the enable signal. Time-
borrowing in a TIMBER latch can be turned off by setting EN to
zero. When EN is low, the transmission gate L is open and the
TIMBER latch operates as a conventional master-slave flip-flop.
D
R
M
Q
R
S
Error flag
M
M
M
F
F
S
L
L
(a)
EN L
F
0
1
… …
CK
S
0
1
M
(b)TB ED
Figure 6: TIMBER latch (a) design and (b) clock control.
When EN is high, the TIMBER latch operates in the time-bor-
rowing mode. In this mode, the transmission gate F is open and the
master latch and slave latch operate independently as pulse-gated
latches. he checking period is divided int one TB nd one ED
interval. Note that this ED interval is equivalent to the sum of the
ED intervals in the TIMBER flip-flop. The master latch is trans-
parent during the TB interval and the slave latch is transparent for
the entire checking period. A timing error is detected by compar-
ing the values stored in the master latch and the slave latch on the
falling edge of the clock. When a single-stage timing error occurs,
the timing violation of the late arriving data signal lies within the
TB time interval. The timing error is masked because the slave
latch is transparent for the entire checking period. Since the mas-
ter is also transparent for the TB interval, both the master latch
and slave latch hold the same value and hence, a timing error is
not flagged. However, if a two-stage timing error occurs such that
the timing violation of the late arriving data signal is greater than
the TB interval, then the master and slave latches sample different
values, and a timing error is detected and flagged to the central er-
ror control unit. Recall that a TIMBER latch masks timing errors
by borrowing continuous time units. Suppose the TB interval is
100ps and a timing violation of 80ps occurs at a TIMBER latch,
then the error is masked by borrowing 80ps from the next stage.
Since the slave latch is transparent for the entire checking period,
error relay logic is not required. However, TIMBER latch propa-
gates glitches and spurious transitions during the checking period.
Note that TIMBER latch does not have metastability issues because
level-sensitive sampling is used for time-borrowing.
1.5 2 2.5 3
x 
1 1.5 2 2.5 3
x
0
5
1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3
0
0.5
1
1 1.5 2 2.5 3
x
2
4
6
8
1 1.5 2 2.5 3
x
0
5
1.5 2 2.5 3
x 
1 1.5 2 2.5 3
x
0
5
Error masking in l1
1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3 3.2
x 10−9
ED
CK
TB
D1
Q1
Err1
D2
Q2
Err2
TB Error masking in l2ED
TB
Figure 7: Two-stage timing error in a TIMBER latch design.
Fig. 7 shows SPICE waveforms for error masking when a two-
stage timing error occurs on two TIMBER latches, l1 and l2, on
successive pipeline stages. The signals D1 (D2), Q1 (Q2), and Err1
(Err2) are the data, output, and error signals for latch l1 (l2). The
first timing error, occurring at latch l1, can be masked by borrowing
the time unit TB. Hence, the timing error is not flagged (Err1 signal
is 0). When a two-stage timing error occurs at latch l2, the error is
masked by borrowing a TB and an ED time interval. The timing
error at latch l2 is flagged by latching the error signal (Err2 signal
goes high) on the subsequent falling edge of clock CK.
6. TIMBER case study
We present results from a case-study when TIMBER is inte-
grated into an industrial processor. Three processor performance
points — low, medium, and high — each with four checking peri-
ods of 10%, 20%, 30%, and 40% of the clock period are considered.
For a checking period equal to c% of the clock period, all flip-flops
terminating at the top c% critical paths are replaced by a TIMBER
sequential circuit element (TIMBER flip-flop or TIMBER latch).
 23 
Error resilience mechanisms: 
Error resilience is another part of BTWC designs. The mechanism of each 
methodology could be categorized as follow: 
1. Error detection + Rollback/Instruction replay: Normally the approaches that use 
this scheme include duplicated registers or a transition detection mechanism with 
a delayed clock to capture signals that violate timing. To recover from the timing 
errors, the main system is suspended and restored to the correct value from either 
the duplicated register or a replay of the instruction. Pitfalls of this scheme 
normally are: (1) a limited detection window, (2) a prolonged hold time 
requirement, and (3) the issue of meta-stability.  
2. Error masking: For a given logic circuit, errors can be masked by an approximate 
logic circuit that predicts the correct value [35], [36]. For every output, the logic 
could be either expressed with a 0-implication or a 1-implication approximate 
function. These functions are used to detect 1-to-0 or 0-to-1 errors. The type of 
approximate function for the output is determined by computing the dominant 
type of errors.  
 
The work for this dissertation first needed to identify the optimized operating 
clock frequency (with the assumption of a known threshold value for maximum error 
tolerance). Then, the design is modified to reduce errors according to the typical activity 
for the given input workload. The traditional EDAC modules cannot provide an 
estimation of the error rate for a speculative operating clock frequency unless the 
simulation is actually performed. However, conducting simulation sweeps through all-
 24 
possible clock frequencies is too inefficient to accept in a real-world design flow. 
Therefore, an all-clock-frequency error estimation method has been developed as part of 
this dissertation research, which enables an accurate error prediction for all-possible 
speculative operating clock frequencies of each primary output with only one simulation 
at the original, error-free operating frequency. 
 
Table 1: Summary of several EDAC methodologies 
 
 Detection Methods EDAC type 
Recovery 
Method 
Application 
Structure Pros and Cons  
Razor Shadow latch Detection 
Restore from 
register Pipeline 
Meta-stability; 
Complexity 
Razor II Transition detection Detection 
Architecturally 
handle 
instruction 
replay 
Pipeline Complexity 
Bubble 
Razor 
Shadow 
latch Detection Local replay Pipeline 
Less hold time 
to restore  
TIMBER Duplicate paths 
Partial Error 
detection; 
Partial Error 
masking 
No 
Standard 
sequential 
circuit 
Limited 
functions 
TEAtime 
Monitor 
critical 
path 
Error 
masking 
Instruction 
replay 
Standard 
sequential 
circuit 
One path 
monitor; 
Meta-stability 
 
 
Evaluation Methods For BTWC Design 
Circuit’s performance can be evaluated from three aspects: (1) operational speed, 
(2) power consumption, and (3) operational reliability. BTWC designs attempt to either 
enhance computational efficiency or lower the energy usage while maintaining a robust 
design. 
 25 
Performance: 
The metric for operational speed can use the clock frequency (f), but normally the 
throughput is used to measure the performance of a BTWC design. The input may need 
to stall several cycles for the correction penalty when a timing error occurs in a BTWC 
design. For a traditional design, all the primary outputs (POs) are bounded with the 
desired cycle time (Tcycle), so the probability (𝑃!) of each PO to stabilize (i.e., settle) 
within the cycle time is 1. Therefore, assuming that one operation is completed per cycle, 
the throughput for the traditional design should be [19]: 𝑇ℎ𝑟𝑜𝑢𝑔ℎ𝑝𝑢𝑡 = 𝑃! ∙ 𝑓 = 𝑓 = 1/𝑇!"!#$ 
However, idea of BTWC design is to make the highly active paths with a long delay to 
settle before the cycle time, while permitting some of the less time-critical paths to 
exceed the boundary on occasion. Thus, considering the error correction penalty (r), the 
equation to calculate BTWC design throughput is [19]: 𝑇ℎ𝑟𝑜𝑢𝑔ℎ𝑝𝑢𝑡 = 𝑃! ∙ 𝑓 + 1− 𝑃! ∙ !! , 
and the energy cost of BTWC designs is inversely proportional to the throughput. 
Power: 
The power consumption in the conventional CMOS digital circuit can be separated into 
three types of dissipation [37][38][39][40]: (1) switching power, (2) short-circuit power, 
and (3) leakage power consumption. The switching power represents the power 
dissipated during the signal transitions when energy is drawn from the power supply to 
charge-up the device capacitances. Short-circuit power is produced during the moment 
that both the PMOS network and the NMOS network are simultaneously on in CMOS 
 26 
logic. The MOSFETs in CMOS logic normally will have some non-zero reverse leakage 
and sub-threshold current, which causes the leakage power dissipation. The sum of 
switching power and short-circuit power can be categorized as dynamic power, while the 
leakage power is also called static power dissipation [41]. Dynamic power is dominated 
by switching power, while leakage mainly comes from the sub-threshold leakage current. 
The static power increases faster than dynamic power with the shrinking of feature size. 
Reducing supply voltage is an efficient way to reduce total power consumption, but it 
may lead to timing delay and exponential leakage increase [29][31]. Multi-threshold 
voltage, where a low-threshold voltage is used with cells on critical paths and a high-
threshold voltage is used for the other cells, is a widely accepted technique to reduce 
power [43]. It has also been used in BTWC design to improve power.  
 
Reliability: 
 The reliability in BTWC design focuses on the detection and correction of timing 
errors. Because of the nature of BTWC design, timing errors would invalidate the results 
during the operation. An effective error detection and correction (EDAC) mechanism is 
crucial. The evaluation criteria includes answering the following questions:  
• How complex is the implementation of the method?  
• What penalty is the design going to pay? 
• What is the detection/correction rate for the method? 
Timing speculation is the idea where various methodologies are used to enhance 
the operational speed to the point where timing errors occur while equipping the design 
with techniques to detect and correct those timing errors [44][8]. Based on this idea, 
 27 
BTWC design allows the timing error rate to a certain point where the performance gain 
(either in speed or in power) is effectively balanced with the penalty cost for reliability. 
BTWC design has adopted a cross-layered approach [45][46][47] from the architectural 
level down to circuit level. 
 
Multi-threshold Technology In VLSI Designs 
Threshold voltage is the minimum voltage applied on a MOSFET gate to create a 
conducting path between the source and the drain. The MOSFET acts like a switch 
ideally. However, during the OFF stage, there are mobile carriers (i.e., electrons or holes) 
that travel through the semiconductor junctions, which is called sub-threshold leakage. 
With the technology scaling, the leakage power consumption is now a major concern for 
current semiconductor industry, and the sub-threshold leakage is the main contributor to 
the leakage current.  
The sub-threshold leakage is directly related to the threshold voltage as it controls 
the size of the depletion region. A higher threshold voltage could reduce the sub-
threshold leakage, but it limits the cell’s response speed. On the other hand, a lower 
threshold voltage reduces the propagation delay but will result in a dramatic increase of 
leakage power with such small geometry devices. 
Many previous works studied how to use multiple threshold voltages on one 
design. Mutoh et al. [48] introduced the multi-threshold technology for 0.5-𝜇𝑚 CMOS 
that uses low-threshold MOSFETs to enhance speed while high-threshold MOSFETs are 
used to reduce leakage power. Wang and Vrudhula [49] introduced a heuristic algorithm 
based on circuit graph enumeration to effectively reduce leakage power of CMOS digital 
 28 
circuit without too much impact on speed. Wei et al. [50] proposed a dual threshold 
approach to reduce leakage power by assigning high-threshold voltage cells to non-
critical paths, and using low-threshold voltage cells on critical paths, and introduced an 
algorithm to optimize the selection.  
 The tradeoffs for dual Vt CMOS circuits has been has been explored by Wang and 
Vrudhula in [51]. The detailed simulation has performed to investigate short circuit 
power dissipation of dual Vt technology, and the short current impact of low-Vt  
MOSFETs on gate delay. Multiple power models of dual Vt technology create challenges 
to EDA tool development as well. 
 Jayakuamr and Khatri [52] prepared pull-up circuit and a pull-down circuit with 
different Vt standard cells for standby mode. After the traditional mapping using regular 
cells, they then replace the cells with prepared low-leakage cells according to the 
simulation results of each gate’s output. The methodology is compared with regular 
multi-threshold CMOS methodology and shows better performance on leakage reduction. 
 Most of previous Dual-Vt /Multi-Vt methodologies were targeted to reduce power 
consumption or to ensure resiliency when applying dynamic voltage scaling (DVS). In 
this research, the Dual-Vt technology will be used to adjust the timing of specific paths to 
precisely reduce timing errors during timing speculation. 
 
Timing Speculation vs. Instruction Speculation 
Speculation could have different interpretations for people from different research 
fields. One ambiguity comes from the computer architecture field where researchers 
commonly refer to the speculative execution based on the branch prediction or out-of-
 29 
order executions. Based on the history of branch executions, the speculative execution 
schemes allow the instructions to be scheduled ahead when the outcome of a conditional 
branch has not yet been determined, in order to utilize the microarchitectural resources in 
a more efficiently way [31]. However, this widely used optimization technique in modern 
computer architecture shows security vulnerabilities in January 2018, which affects Intel 
x86 microprocessors, IBM POWER processors, and some ARM-based microprocessors. 
One of the vulnerability, Meltdown [53], occurs between memory accesses and privilege 
checking during instruction processing. The microprocessor’s cache holds the 
unauthorized address because of the out-of-order execution, from which the data can be 
recovered. The other vulnerability, Spectre [54], uses the information leakage from 
branch predictions via cache timing as a side-channel attack to manipulate the target 
process.  
The timing speculation discussed in this dissertation is approached from the 
circuit-level. The traditional clock frequency is bounded by the worst-case delays. 
Operating the circuit at a higher clock frequency to gain execution speed is the purpose of 
circuit-level timing speculation. There is no structural modification to the circuit, or out-
of-order instruction manipulation in this work to achieve circuit-level timing speculation. 
Therefore, it does not enable the side-channel attack that has been used for Spectre and 
Meltdown.
 30 
CHAPTER III 
RESEARCH METHODOLOGY AND BTWC DESIGN FLOW 
This chapter introduces the details of the methodology used for the research in 
this dissertation. The chapter describes the simulation environment setup as well as the 
BTWC design flows used to evaluate the approach. Four benchmark circuits from 
ISCAS85 [48][49] were used to represent four different types of functions (Table 2).  
Table 2: Overview of the circuits used in the analysis 
Name Function Input # Output # Cell # 
C432 27-channel interrupt controller 36 7 160 
C880 8-bits ALU 60 26 383 
C1908 16-bit SEC/DED 33 25 880 
C6288 16x16-bit multiplier) 32 32 2406 
 
 
General EDA Flow And The Design Flow Used In This Work 
The General EDA flow with Synopsys tools [57] is shown in Figure 13 to give a 
brief introduction to using the Synopsys design tool kit. Circuit functionality is verified 
during RTL simulation. Design Compiler (DC) synthesizes the benchmark source file 
with standard cell modules to generate a gate-level netlist. IC compiler (ICC) performs 
Place-and-Route, which will generate accurate timing information and store the 
 31 
information in the .sdf file. The post-simulations are performed in VCS with back-
annotated timing information (.sdf). The .vcd file stores the simulation results with all the 
switching activity information for the circuit across the entire simulation time; this file is 
the essential raw data for the research. 
 
Figure 13: The customized EDA flow with Synopsys tools. 
 
Verilog 
testebnch file
(.v)
Verilog 
source file
(.v)
Standard cell 
models
(.v)
Standard cell library
(.db, .fr, .tf, .map) Constraints 
VCS
DC
ICC
Gate level 
netlist
(.v)
Timing &
Area
Constraints 
(.sdc)
6WDQGDUG
GHOD\IRUPDW
.sdf
Gate level 
netlist
(.v)
Timing &
Area Layout
Post P&R
simulation
.vcd
 32 
In Figure 14, the customized design flow used in this research is shown. The 
whole design flow uses the Synopsys Design tool suite (Design Compiler, IC Compiler, 
PrimeTime, and VCS) with the Synopsys 32-nm library and customized Python scripts to 
implement, simulate, and analyze the designs.  
• After implementation with Design Compiler, the test circuit is place and routed by IC 
Compiler. The whole circuit delay information, including wire interconnection delay, 
is saved in the standard delay format file (.sdf file). The gate-level net-list from IC 
Compiler is simulated in VCS with timing information .sdf file back-annotated. After 
simulation, the value change dump file (.vcd file) is generated, which records all 
nodes’ switching activity during the simulation time. For the same given input 
workload of a test circuit, each possible operating clock frequency will have a .vcd 
file. 
• There are several customized scripts to process the .vcd files, which perform the 
following tasks: (1) Process the .vcd files to prepare and extract the important 
information for later use; (2) Produce error estimation for a specific speculative clock 
frequency based on the statistical histogram of each PO’s settling time to identify the 
real error-contributing PO for the given workload; and (3) Calculate the desired 
internal cells activity level to identify the critical cells. 
• According to the cell replacement rule, re-synthesize the test circuit with low 
threshold voltage (Low-Vt) cells on identified critical cells. 
• After the Dual-Vt resynthesis method, use the proposed error-checking method to test 
the real error rate. The error-checking method uses the same activity .txt file 
processed from .vcd files as error-estimation. Compare the POs’ settling status of the 
 33 
speculative clock frequency’s .vcd file with the golden copy’s .vcd file cycle by 
cycle, and calculate the error count of the POs.  
The details of each step are discussed in more detail in the following Chapters.  
 
•  
Figure 14: The proposed design flow chart. 
 
 
Value Change Dump Files 
The .vcd file was a significant component in this work. It was used not only for the 
error checking method, but it was also the basis for error estimation and other activity 
[9][10][11][12][13], (2) Logic circuitry based on the 
circuit’s dynamic behavior [14][15][16], and (3) Error 
detection using approximate logic circuits for sensitive 
paths [17][18].  
Because two primary outputs with the same static path 
delay could have dramatically different path settling 
behavior with the same input workload, the error reduction 
method should be selectively applied on those error-
contributing paths/cells. There are several works designed 
for obtaining circuit typical internal activity, e.g., BlueShift 
[14], DynaTune [15], timed Ternary Decision Diagram 
(tTDD) [19], and the common case promotion (CCP) 
method [16]. All the methods are based on using a Timed 
Characteristic Function (TCF) and a Binary Decision 
Diagram (BDD). However, the nature of the TCF and BDD 
makes the calculation too complex to apply on large 
circuits.  
The achievements presented in this paper are to: (1) 
reduce errors efficiently for timing speculation without 
major modifications to the original circuitry by considering 
the impact of input workload variance on the circuit’s 
activity, (2) create a universal design/simulation flow with 
commercial tools that enables accurate error estimation for 
a range of operating clock frequencies, which provides the 
insight for a BTWC designer to maximize performance 
gain, and (3) implement an off-line error checking method 
that allows designers to perform the desired cost-benefit 
analysis. 
III. METHODOLOGY 
The methodology details used in this work are discussed 
in this section. Part A gives the overview of the design 
flow. Part B describes the off-line error-checking method. 
Section Part C describes the error-estimation method 
developed for this work. Part D explains the identification 
of the error-contributing outputs and the selection of key 
cells for error reduction under a given input workload. 
A. Experimental setup and general work flow 
The whole design flow uses the Synopsys Design tool 
suite (DC, ICC, PT, and VCS) with the Synopsys 32-nm 
library [20] and customized Python scripts. Four benchmark 
circuits from ISCAS85 were used to represent four different 
types of functions, shown in Table I. The gate-level netlist 
is simulated with back-annotated timing information (.sdf 
file) in VCS. The Value Chang Dump (VCD) file records 
the switching activity for all nodes. The customized scripts 
process the data to obtain the timing error information; the 
scripts also modify the netlist accordingly. Then, the design 
is resynthesized to perform place and route (P&R) as part of 
the regular design flow steps. This method is well 
incorporated with standard EDA flow that could be applied 
to any design. Figure 1 shows the flow chart of the design 
flow. The yellow steps represent the main contribution 
described in this paper. 
 
 
Table I: Overview of the circuits used in the analysis 
Name Function Input # 
Output 
#  
Cell 
# 
C432 27-channel interrupt controller 36 7 160 
C880 8-bits ALU 60 26 383 
C1908 16-bit SEC/DED 33 25 880 
C6288 16x16-bit multiplier 32 32 2406 
 
 
Figure 1: Flow chart of the entire design flow. 
B. Error checking method 
As discussed in Section II, many kinds of error detection 
circuits exist, which require special circuitry for parallel 
comparison. The proposed error-ch cking module was 
developed to perform statistical analysis of errors after 
simulation in order to inform our mitigation approach. 
Figure 2 shows a comparison of the general structure for 
two previous error-checking methods and one used in this 
work. The regular error checking method requires either the 
golden circuit or the delay element in the simulation to 
detect the errors. The test bench needs to be modified 
accordingly for every design under test (DUT). The 
proposed method uses customized Python scripts to extract 
information from the Value Change Dump (VCD file) of 
the golden circuit and testing circuits. The VCD file is the 
important raw data for this proposed work. The error 
detection method, error estimation method, and cell activity 
analysis are all based on the VCD file. 
This proposed error detection method does not need a 
special test bench. Error detection and analysis could be 
performed on the desired nodes directly from the simulation 
results. To detect errors, the settled value at every cycle of 
the desired nodes is extracted from both the tested and the 
golden VCD file; this information is compared cycle-by-
cycle using scripts. The saved VCD file could be used 
multiple times for different types of analysis and 
Benchmark 
Verilog file
Design 
Compiler
IC Compiler
VCS
Value Change 
Dump file (.VCD)
Standard Delay 
Format file (.SDF)
Path timing info
(.txt)
Off-line error 
checker
Modify 
Netlist
All-clock error 
estimator
Cell activity 
monitor
 34 
analysis in this work. The .vcd file contains both the switching activity and the timing 
information that is generated during the simulation. The .vcd file used in this work is 
generated from the Synopsys simulator. Figure 15 is an example of a .vcd file from an 
actual simulation. A .vcd file records the switching activity of all the nodes or a selected 
hierarchy. Figure 15(a) is the header file that defines the corresponding relationship 
between the node name and the symbol that is used later in the .vcd file. Each node has 
an assigned symbol. Figure 15(b) records the value change activity of all nodes 
throughout the simulation time. The entries starting with the symbol “#” are timestamps. 
The subsequent entries are the nodes that switched at the time point. The first digit is the 
current value of the node, and the remaining digits are the corresponding symbol of the 
node. The time unit is ps. One node may change multiple times within one cycle.  
 
Figure 15: An example of value change dump file. (a) is the header part, and (b) is 
the body part. 
  
With the understanding of the .vcd file contents, the special customized error 
estimator and error checker in this work are all based on the information contained in 
#2424
0t
0"`
0s
#2457
0"C
#2463
0"E
0#3
#2485
0~
0"f
#2486
0"4
#2493
0|
#2539
0"W
$scope module U100 $end
$var wire 1 E Y $end
$var wire 1 q A1 $end
$var wire 1 ~ A2 $end
$var wire 1 "! A3 $end
$upscope $end
$scope module U101 $end
$var wire 1 u Y $end
$var wire 1 Y A $end
$upscope $end
$scope module U102 $end
$var wire 1 t Y $end
$var wire 1 m A $end
$upscope $end
(a) (b)
 35 
.vcd file. More details about error estimation and error checking methods will be 
introduced in Chapter IV and Chapter V. 
 36 
CHAPTER IV 
ALL-CLOCK-FREQUENCY ERROR-ESTIMATION 
In traditional design, the designer normally focuses on the task of reducing 
propagation delay of the static critical paths in order to improve the operating speed of 
the circuit. The timing errors are observed at the PO when: (1) the PO is activated by an 
input vector during that cycle, and (2) the settling time is longer than the current 
operating clock frequency. The PO corresponding to the static critical path may not 
always lead to the largest error-contributor. The identification of the real error-
contributors will help to reduce errors more effectively. However, no existing 
commercial EDA tool directly provided internal activity analysis coupled with error 
estimation information. In this chapter, the detailed methodology of the all-clock-
frequency error-estimation is described, and error estimation results are discussed. 
 
Obtaining Outputs Settling Behavior 
As shown in Chapter III, the .vcd file contains all nodes switching activity and 
switching time stamps. A timing error occurs when an output has settled after the 
required clock time. For each PO, only the last switching time of each cycle is important 
for error estimation. By processing and analyzing the switching information of the output 
nodes saved in the .vcd file, it is possible to characterize the settling behavior of the 
primary outputs.  
 37 
The error estimation methodology in this work is designed to analyze each PO 
individually. To realize the error estimation methodology, the first step is to extract all of 
the switching timestamps of the selected PO node. Each cycle has an entry. Then, the last 
switching time stamp is recorded for this PO at every cycle. A histogram can then be 
plotted to obtain this PO’s settling behavior for the given input workload. The histogram 
helps to predict the real error contributing POs for the given workload. Figure 16 is the 
flow chart of the algorithm to take the raw .vcd file and process it in order to extract the 
timestamps and node activity. The example codes of Benchmark C1908, PO N892 are 
listed in Appendix Part A. In Part A Section (1), the code prepares all transition 
timestamps of the given PO, and in Part A Section (2), the code extracts the settling 
timestamps of the given PO. The bash scripts to automate the process were not included 
in the Appendix. 
 38 
 
 
Figure 16: Algorithm flow chart of switching time stamp extraction for a specific 
node. 
Read in one line of .VCD file
Is this line body 
part (start with ’$’)?
Is this line time stamp 
(start with ’#’)?
No
Yes
Yes
No
Is this a new 
cycle?
Start a new 
entry Read in next line
Yes No
Is this line about 
interested node?
Is the line record node status 
(NOT start with ’#’)?
Yes
No
Save the current 
time stamp
Yes
Is the last line 
of .VCD file?
End
Yes
No
No
 39 
Categories Of Primary Outputs  
The operating speed of a circuit has traditionally been based on the longest static 
propagation delay. However, this path may not always be the most active one. This 
discrepancy can lead to faster, highly active paths that produce more observable errors. 
Commercial EDA tools do not directly provide the internal activity analysis for the 
various paths of the circuit. Knowing how frequent a PO will be active by the given input 
workload, and how likely the PO settles later than the required clock time are the key 
factors to identify the real error-contributor POs. Then, the information can be used to 
optimize for BTWC design.  
In a circuit, the POs can be categorized into several types:  
I. Safe-POs: all paths to the output have a shorter propagation delay than the 
clock period. 
II. Error-possible-POs: The worst case to the output has a longer propagation 
delay than the clock period. 
III. Error-prone-POs: The worst case to the output that has a longer 
propagation delay than the clock period and has high switching activity. 
Category II and Category III have overlap when only considering the PO’s 
activity level, because activity does not directly indicate the error rate. In fact, using just 
the output’s activity level to predict the real error-contributors is not accurate enough. 
 40 
 
 
 
Figure 17: The total active cycles out of 1 million cycles of all error-possible outputs. 
Benchmark circuits (a) C432, (b) C880, (c) C1908 and (d) C6288. Y-axis shows active 
cycles with maximum limit 1,000,000 cycles. 
 
Figure 17 shows the total active time of each output, which reflects the activity 
level of each output under a given input workload. An output could change multiple 
times during a cycle, and the active cycle count is increased by exactly one when there is 
one or multiple switches within a cycle. The most active output may not be the largest 
error contributor during the timing speculation.  
For benchmark C1908 as an example in Table 3, PO N2891 is the greatest error-
contributor, however neither the switching activity rate nor the active cycles rate 
indicates the trends. Figure 18 shows the activity level and the actual error count of each 
output of Benchmark C1908 at the clock period of 1.7 ns. Note that the original clock 
662750	 681107	
703406	 720091	
612147	
0	
100000	
200000	
300000	
400000	
500000	
600000	
700000	
800000	
900000	
1000000	
C432	
N421	
N432	
N431	
N430	
N370	
587248	
488683	
574023	 561723	
631760	
571337	
0	
100000	
200000	
300000	
400000	
500000	
600000	
700000	
800000	
900000	
1000000	
C880	
N878	
N866	
N879	
N880	
N874	
N863	
7015	 6384	 5798	 6061	 5818	 6632	 5437	 6627	
0	
100000	
200000	
300000	
400000	
500000	
600000	
700000	
800000	
900000	
1000000	
C1908	
N2899	
N2887	
N2890	
N2888	
N2889	
N2891	
N2811	
N2892	
606567	 590824	
711104	
781367	
841215	
884705	
914844	 936484	
0	
100000	
200000	
300000	
400000	
500000	
600000	
700000	
800000	
900000	
1000000	
C6288	
N6288	
N6287			
N6280		
N6270	
N6260	
N6250	
N6240	
N6230	
 41 
period is 2.1 ns. The outputs are listed in increasing order of static propagation delay 
from smallest to largest. Output N2899 has the longest static delay, and the N2886 is the 
most active one. However, N2891 is the largest error-contributor. 
 
Table 3: Comparison of C1908 static delay, switching activity rate and active cycles rate. 
Output N2899 has longest delay, while Output N2891 is the greatest error-contributor. 
 
 
Static Delay 
(ns) 
Total 
Switching 
Activity Rate 
Total Active  
Cycles Rate 
Stabilization 
probability at 
CLK=1.7 ns 
N2899 2.20 1.02% 0.7% 100% 
N2887 2.18 0.86% 0.64% 100% 
N2890 2.18 0.71% 0.58% 100% 
N2888 2.17 0.76% 0.61% 100% 
N2889 2.17 0.71% 0.58% 100% 
N2891 2.07 0.92% 0.66% 99.9987% 
N2811 1.97 0.63% 0.54% 99.9996% 
N2892 1.81 0.95% 0.66% 99.9999% 
 
 42 
 
Figure 18: Benchmark C1908 outputs with the average active cycles out of 10,000 cycles 
(using 100 simulation trials), and the total error counts for 1 million cycles at clock 
period of 1.7 ns. 
  
Determining the real error-contributing output based only upon the output’s 
activity rate is misleading, because errors only occur when the PO settles after the 
required clock period. We need to have the settling behavior of error-possible outputs to 
estimate the error rate for the given workload, and then a prediction can be made for the 
error-contributors.  
Therefore, an analysis of the settling time of each PO is necessary to identify the 
error-contributor POs, thus Category IV is added: 
IV.  Error-contributor-POs: The worst case to the output that has a longer 
propagation delay than the clock period, and is highly likely to settle after 
the required clock period. 
 43 
Setting the threshold value of stabilization probability to identify Category IV POs 
depends upon the level of error tolerance of the EDAC module that will be incorporated 
into the design. The stabilization probability is directly linked with the settling time for 
each cycle.  
 
Error Count Estimation And Error Rate Calculation 
Based on the settling time histogram, one can calculate the stabilization probability 
and predict the error rate for all POs and all possible clock frequencies. In this section, 
the detailed method of how to obtain the error count and to calculate the error rate is 
discussed.  
Figure 19 shows the settling time histogram of all error-possible POs for 
Benchmark C1908. The settling time histogram of the desired outputs can be plotted with 
an appropriate bin size. In this work, the bin size is 50 ps, because it is the average 
propagation delay of a NAND gate for the Synopsys 32-nm library.  The histogram is 
plotted based on the .vcd file of original clock period. The error estimation rate is 
calculated for each small range of clock frequency based on the bin size. In this work, the 
error estimation essentially matches the simulation result because 50 ps is also the step 
size of the swept simulation as in Chapter V.  
 44 
 
Figure 19: Benchmark C1908 settling time histogram of each error-possible PO. The x-
axis is the settling time in picoseconds, and the y-axis is the accumulated number within 
each bin. 
 
For Benchmark C1908, most of cycles the POs settled at 0 ns, which means that 
most POs in C1908 are not active with the given input workload during the simulation 
time. Errors only occur during those cycles that settled after the required clock period. 
Figure 20 shows the stabilization probability curve of each PO; the thick black line is for 
the whole circuit. The stabilization probability curve is the cumulative distribution 
function (CDF) of the output’s settling histogram. The error-contributing POs can be 
identified based on the stabilization probability curves.  
 
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Ac
cu
m
ul
at
e 
co
un
t 
 45 
  
Figure 20: Benchmark C1908 stablization probability of each error-possble PO. 
 
Continuing with the example of Benchmark C1908, if the operating clock period is 
1.7 ns, for example, then outputs N2891, N2811, and N2892 are the error-contributor 
POs, because their stabilization probability did not reach 1 by 1.7 ns. Figure 21 shows the 
stabilization probability of error-contributor POs: N2891, N2811 and N2892 in full and 
zoom-in versions. The exact error count can be calculated by summing the histogram bins 
that stabilize later than the selected operating clock frequency. According to the 
histogram, the estimated error counts for output N2891 = 13, N2811= 4, N2892 = 1, 
while the simulation results of error count of output N2891 = 12, N2811 = 3, and N2892 
= 1.   
 
Clock	period	(	ps	)
 46 
  
Figure 21: Benchmark C1908 settling time histogram and stabilization probability density 
function of outputs N2891, N2911, N2892. 
 
Each PO has a different error rate with the same operating clock frequency. The 
formula used to calculate the error rate: 𝐸𝑟𝑟𝑜𝑟_𝑟𝑎𝑡𝑒 = 1− 𝑆𝑡𝑎𝑏𝑖𝑙𝑖𝑧𝑎𝑡𝑖𝑜𝑛_𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 (1) 
 
5
1
3
5
1 111
Clock period ( ps )
Clock period ( ps )
 47 
Error Estimation Results Discussion 
As discussed previously, the static longest path does not always impact the 
observed errors of the circuit. It may not be active for a certain input workload. When 
evaluating the capabilities of existing commercial EDA tools, it was found that they did 
not directly provide path activity information. As part of this dissertation work, a method 
was developed to obtain each output’s settling behavior curve by analyzing the .vcd file. 
The error count and the error rate trend of each output is predicted.  
A circuit could have multiple outputs, but only the Category II - error-possible 
outputs - have the probability to experience errors. Therefore, analysis has performed on 
the output settling behavior of error-possible outputs in order to obtain the error rate 
estimation; this analysis enables the identification of the Category IV - error-contributing 
POs. Customized scripts extract the settling timestamp for every cycle of the tested 
outputs, and the histogram of the results indicate the error count of the tested outputs. The 
accuracy of this method is confirmed later with the simulation results. Each output has 
one histogram of the settling time probability.  
Table 4 shows the longest delay for each of the Category II error-possible outputs 
for C432, C880, C1908 and C6288. The static critical delays of each benchmark are 2.41 
ns, 2.01 ns, 2.20 ns, and 4.82 ns respectively. As an example, assume that the goal is to 
operate the circuit at 70% of original clock period, which corresponds to clock periods of 
1.70 ns, 1.40 ns, 1.50 ns, and 3.40 ns respectively. Any output propagation delay that is 
longer than the operating clock period has the potential to generate errors. The maximum 
clock speculation explored in this work is 30% higher than the original, error-free design. 
 48 
Notes that the desired performance improvement would ultimately be determined by the 
design team. 
 
Table 4: Benchmark circuit static propagation delay of all error-possible outputs 
C432 
N421 N432 N431 N430 N370 
2.41 ns 2.25 ns 2.24 ns 2.11 ns 1.75 ns 
C880 
N878 N866 N879 N880 N874 N863 
2.01 ns 1.89 ns 1.85 ns 1.82 ns 1.7 ns 1.48 ns 
C1908 
N2899 N2887 N2890 N2888 N2889 N2886 N2891 N2811 N2892 N2781 
2.20 ns 2.18 ns 2.18 ns 2.18 ns 2.17 ns 2.15 ns 2.07 ns 1.97 ns 1.81 ns 1.67 ns 
C6288 
N6288 N6287 N6280 N6270 N6260 N6250 N6240 N6230 N6220 N6210 
4.82 ns 4.74 ns 4.7 ns 4.57 ns 4.44 ns 4.32 ns 4.19 ns 4.06 ns 3.93 ns 3.81 ns 
 
To obtain the actual error count, the settling histogram of each output gives more 
direct and detailed information. Figure 22 shows the settling time histogram of each 
tested benchmark circuit with the original clock period. Because of the large number of 
inactive cycles for some benchmarks, the histogram in Figure 22 did not include the 
inactive cycles, and the y-axis is different in scale for different benchmarks to display the 
settling behavior better. The error rate calculation is based on the stabilization probability 
curve, which contains the inactive cycles. The two vertical lines on each graph in Figure 
22 mark the 70% of original clock period and the latest settling time on record. The latest 
settling time for circuit C421 is 2.35 ns, for C880 is 1.90 ns, for C1908 is 1.95 ns, and for 
C6288 is 4.00 ns.  
 49 
 
Figure 22: Outputs settling time histogram with 1 million random input vectors of four 
benchmark circuits C432, C880, C1908 and C6288. The x-axis is the clock period. The y-
axis is the activity count. The right vertical line represents the maximum recorded settling 
time. The left vertical line represents 70% of the original clock period. 
 
 50 
 
Figure 23: Zoomed settling time histogram between 70% of orignal clock to the error-
free clock of four benchmark circuits. The x-axis is the clock period. The y-axis is the 
error count observed in 1 million cycles. 
 51 
Figure 23 shows the zoomed view of the area between those two marked lines. 
Each of the error possible outputs is shown in Figure 22 and Figure 23, and the legends 
are in descending order of the propagation delay from top to bottom. According to the 
data in Figure 23, the output that contained the longest path will be the first output to 
observe errors when the operating clock period is reduced. However, if the operating 
clock period is reduced further, then the dominant error contributor may change to the PO 
that settled more frequently after the selected clock period. 
 
Figure 24: Estimated error count of each error possible outputs of (a) C432, (b) C880, (c) 
C1908, and (d) C6288, from 70% of original clock period to the error-free clock period. 
 
Figure 24 displays the error count estimation out of 1 million cycles of each error-
possible PO from 70% original clock period to the error-free clock period. The largest 
(b)
(d)
(a)
(c)
Es
tim
at
ed
 e
rro
r c
ou
nt
Clock period ( ps ) Clock period ( ps )
Es
tim
at
ed
 e
rro
r c
ou
nt
Es
tim
at
ed
 e
rro
r c
ou
nt
Es
tim
at
ed
 e
rro
r c
ou
nt
Clock period ( ps ) Clock period ( ps )
 52 
error-contributing PO changes with the operating clock period. Errors may concentrate to 
certain POs, or they may evenly distribute across different POs. For Benchmark C432, 
the static longest output, N421, is the most error-prone output for all clock periods. For 
Benchmark C880, the most error-prone output shifts from the static longest path, N878, 
to N879 when the clock period is reduced to 75% of the original clock period. The same 
observation is made for Benchmark C6288, the largest error-contributing PO shifts to a 
lower bit along with a reduction in clock period. This result is because of the ripple 
structure of the multiplier, where results in the higher-order bits depend on the lower-
order bits.  
Table 5: Comparison of error composition from critical path and greatest error-
contributing PO other than the critical path. 
 
 C432 C880 C1908 C6288 
Operating clock 
period 1.7 ns 1.4 ns 1.5 ns 3.4 ns 
Errors from critical 
path 29.7% 21.9% 0% 13.7% 
Errors from the 
greatest error 
contributor other 
than critical path 
23.2% 28.5% 37.3% 22.1% 
Total errors 20,455 6,400 158 562 
 
Table 5 takes 70% of the original clock period as an example to show the errors 
from the static critical path and the total errors. For the tested four benchmark circuits, 
only C432’s critical path is the largest error-contributor PO. The application of an error-
reduction method on the identified error-contributor POs specifically will help to reduce 
errors effectively. In this work, it is assumed that the error-tolerant threshold value is 
 53 
0.1% (i.e., total error count < 1,000) for the whole system. The targeted timing 
speculation level is 70% of the original clock period. The error reduction method and 
results will be discussed in detail in the next chapter. 
 
Conclusion  
In summary, there are three steps of the all-clock-frequency error estimation method in 
this work: 
1. Identify POs in Category II based on the static timing delay. 
2. Extract each PO’s settling time for each cycle from the .vcd file. 
3. Plot the settling time histogram of each PO and calculate the probability 
density function to estimate each PO’s stabilization probability. 
In this work, a PO is considered as an error-contributor (Category IV) if the error 
rate of the PO is twice the average error rate of the whole circuit. Therefore, if the 
operating clock period is set at a speculation level of 70% of the original clock period, 
then the steps to identify Category IV POs are: (1) Find out all error-possible POs 
according to the static propagation delay; (2) Calculate (estimate) each error-possible PO 
stabilization probability at the targeted speculating clock period; then (3) Determine that 
according to the error rate calculation formula, the 𝑆𝑡𝑎𝑏𝑖𝑙𝑖𝑧𝑎𝑡𝑖𝑜𝑛_𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 =  1− 𝑒𝑟𝑟𝑜𝑟_𝑟𝑎𝑡𝑒. This stabilization probability value is the threshold value for identifying 
Category IV error-contributor POs. Figure 25 shows the stabilization probability curve of 
the four benchmark circuits for the given input workload. Designers can then identify the 
error-contributing POs for a desired clock period. 
 54 
 
 
Figure 25: The stabilization probability of four tested benchmark circuits for the given 
workload from start to end of the clock period. The x-axis is the clock period. The y-axis 
is the likelihood of settling. 
 
 
C432 C880
C1908 C6288
Lik
eli
ho
od
 o
f s
et
tlin
g
Lik
eli
ho
od
 o
f s
et
tlin
g
Lik
eli
ho
od
 o
f s
et
tlin
g
Lik
eli
ho
od
 o
f s
et
tlin
g
Clock period ( ps )
Clock period ( ps ) Clock period ( ps )
Clock period ( ps )
 55 
CHAPTER V 
OFF-LINE ERROR CHECKING METHOD  
As discussed in Chapter I, many kinds of error detection circuits exist, which 
require special circuitry for parallel comparison. The proposed error-checking module 
was developed to perform statistical analysis of errors after simulation in order to inform 
the mitigation approach and to confirm the error estimation results. 
 
Approaches For General Error-Checking And Off-Line Error-Checking  
The regular error checking method requires either the golden circuit or the delay 
element in the simulation to detect errors and to preserve correct results and detect errors. 
Therefore a special wrap circuit needs to be modified accordingly for every DUT. Figure 
26 and Figure 27 show the general structure of two existing error-checking methods, and 
Figure 28 shows the off-line error-checking method used in this work. 
 56 
 
 
Figure 26: The general structure of (a) transition detection method 
 
 
 
Figure 27: The general structure of (b) duplication module/path method 
 
 
XOR
DUT
Delay
element
FF
FF
 SimulatorInput
workloads
Clock Error 
flag
D
XOR
DUT
Baseline
Design
FF
FF
 SimulatorInput
workloads
Clock Error 
flag
E
 57 
 
Figure 28: The general structure of (c) proposed off-line error checking method. 
 
The off-line error checking method uses customized Python scripts to extract 
information from the value change dump (.vcd) file, and compares the settled value of 
each output between golden copy and the tested one after the simulation. To detect errors, 
each output’s settled value at the end of each cycle is extracted from the tested circuit’s 
.vcd file; a comparison is made with the golden results cycle by cycle. Both the saved 
golden .vcd file and the tested ones can be used multiple times for different analysis and 
comparison. This off-line error checking method enables the possibility of on-demand 
post-simulation error analysis and saves on run time. This proposed error detection 
Baseline GHVLJQ҈'87V
Error Checker҈
Analyzer
Statistical 
Error results
YFG
ƉOH
Simulator
YFG
ƉOH
YFG
ƉOH
YFG
ƉOH
Input
workloads
Golden DUTs
F
 58 
method does not need a customized test-bench, and cell activity analyses are all based on 
extracting and processing the .vcd file.  
 
Reformatting The .vcd File For Error-Checking 
The .vcd file saves complete information during the simulation in the body part. 
However, only the PO’s final status of each cycle is important to implement the offline 
error-checker. Reformatting the raw .vcd file information is the first step to implement 
the off-line error-checker. 
Figure 29 is the algorithm flow chart of the data preparation scripts. After reading 
in the original text of the .vcd file, a customized Python script is used to restructure it into 
the appropriate formation to later process. The header part will be removed, and the 
switching nodes status of each cycle is saved in one entry without the timestamp. Figure 
30 shows an example of the extracted activity file. Each entry records the status for all 
nodes for one clock cycle without timestamps. The Appendix Part B (1) shows the 
example code to prepare the activity information from raw .vcd files to a formatted txt 
file for further usage. The bash scripts to automate the process were not included in the 
Appendix. 
 
 59 
 
Figure 29: Algorithm flow chart of data preparation script. 
 
Read in one line of .vcd file
Is it the body 
part?
No
Yes
Read in next 
line
Is this a NEW 
cycle?
Save current line 
to output file
 
Is this line a time 
stamp?
Yes No
Start a new entry 
in output file
Is this the end 
of .vcd file?End 
No
Yes
Save current line 
to output file
Yes No
 60 
 
Figure 30: The example of partial activity file extracted .from .vcd file. 
 
 
Implementation Of The Off-Line Error-Checker 
After the raw .vcd file is processed, the formatted activity text file of both the test 
circuit and the golden copy are read into the Extract and Compare script. A LUT (Look 
Up Table) is pre-defined with the selected nodes of interest and the notation used in the 
.vcd file. For each cycle, the switching activity of the selected nodes are saved into the 
LUT in sequence. Although one node could change multiple times within one cycle, only 
the final (i.e., settled) value is directly related to the correctness of operation. Therefore, a 
subsequent switching activity in the cycle always overwrites the previous one in the LUT. 
The scripts first identify whether any switching activity of selected nodes are contained in 
this cycle, and then the LUT is modified accordingly. Then, a comparison is made 
between the test LUT and the golden LUT; the error count is increased accordingly for 
each PO whenever it detects a cycle that does not match. Figure 31 shows the algorithm 
flow chart for the Extract and Compare script that is given in Appendix Part B (1). The 
bash scripts to automate the process were not included in the Appendix. 
ŏ
0M 0Q 0U 0V 0W 0E 0L 0B 0F 0C 0T 0D 0G 0P 0O 0N 0H 0I 0J 0S 0K 0Z 1Y 1R 1X 
1F 1B 1E 1J 1I 1D 1K 1Z 0R 1V 1S 0Z 1T 0Y 0S 
0E 0J 0D 0K 1Q 1M 1C 1N 1R 1Y 1W 1S 1Z 1U 0Z 
0I 0N 1E 0V 0T 1O 1J 0U 1K 0W 0S 1L 0Y 0X 1X 
0F 0B 0E 0O 0J 0M 0C 1H 1G 1P 0L 1N 0X 1X 1Y 
0H 0P 0Q 1B 1E 1O 1M 1J 1I 1D 1C 1V 1T 1S 1L 1Z 0Y 0Z 0S 1W 1S 1Y 0S 1Z 
0G 0J 0D 0M 0C 1H 1F 1Q 1P 0L 0R 0Z 0W 0T 0V 1S 
0P 0B 0O 0I 1G 0K 1C 1L 1Z 1T 1V 0S 0X 0Y 0T 1S 0Z 
0G 0Q 0C 0L 1M 1K 1R 0V 0S 1X 1Y 
0H 0E 1G 0N 1P 1J 1I 0X 1V 1U 1W 1X 0R 1S 0U 0W 0X 
0G 0I 1H 1Q 1E 1L 1R 0V 0S 1X 
ŏ
ŏ
 61 
 
 
Figure 31: The flow chart of the Extract and Compare script’s algorithm. 
 
Comparison Of Error Estimation vs, Error Checking Results  
According to the error estimation method introduced in Chapter V, there are error 
count estimation results for all clock frequencies. This section presents the results of the 
simulation data of the error count. The accuracy of the error estimation method is 
confirmed by simulation data.  
Activity .txt file 
of test circuit
Read-in one 
cycle activity 
from each file
If interested 
nodes 
contained?
Modify the 
status of both 
test and gold 
LUT
Is the end of the 
file?
Activity .txt file 
of gold circuit
LUT of interested 
nodes initial 
status
Are there nodes 
status different 
from test to gold?
Increase error 
count of those 
nodes in LUT
Yes No
Yes No
End
NoYes
 62 
For the evaluation, the circuits are synthesized, placed, routed, and simulated with 
Synopsys tools and the Synopsys 32-nm library. The simulation clock period is swept 
from the error-free clock period to 70% of the original clock (i.e., the static critical path 
delay time). The simulation step size is 50 ps, however, Figure 32 show step size is 0.1 ns 
for a clear view in graph.  
Figure 32 shows the error trend with the decrease of the clock period and the 
comparison of simulation and prediction errors counts. With the knowledge of each 
output’s settling information, designers could select a speculative clock period at an 
acceptable error rate tolerance. Because the error estimation method is made based on 
settling time histogram of each output with original clock period.  
For the binning procedure, the larger the bin size, the less the total bin number. If 
the bin size increased from 50 ps to 100 ps, the bin number will be halved. Since the 
sampling data stay the same, therefore the value of each bin will change. However, 
whether the bin size will affect on the estimation results is depending on the targeting 
clock period precision and the bin size. As long as the bin size is smaller than the 
targeting clock period precision, the estimation results will not be affected, because the 
estimation is calculated by the sum of bin value that settling time above the targeting 
clock. 
For example, if we are trying to estimate error count for clock period 1.8 ns 
(precision at 100 ps). Changing bin size from 50 ps to 100 ps will not affect the 
estimation results, since all the sampling data over 1.80 ns will fall into the bins that 
counted as error. On the other side, if we are trying to estimate error count of 1.85 ns 
(precision at 50 ps) clock period, changing bin size from 50 ps to 100 ps, it will lead to 
 63 
some ambiguous sampling point of bin (1.80 – 1.90) for error estimation. For instance, if 
there is a clock cycle, an output settled at 1.82 ns (no error) and 1.88 ns (error) will fall 
into one bin.  
 
 
 
Figure 32: The comparison between simulated results and total error estimation trends of 
four tested benchmark circuits. 
 
Conclusion 
The off-line error-checking method allows a designer to check errors and perform 
specific analysis after simulation, which suits the demand in this work perfectly. The 
Python implementation of the error-checking module was developed to process the 
raw .vcd file, so that it can be used universally on all types of circuits, and it does not 
199	 1090	
3505	
6979	
13866	
18274	
20239	
197	 1064	
3491	
6972	
13827	
19951	
24702	
0	
5000	
10000	
15000	
20000	
25000	
30000	
2.3ns	 2.2ns	 2.1ns	 2.0ns	 1.9ns	 1.8ns	 1.7ns	
Er
ro
r	
co
un
t	
fo
r	
1	
m
ili
on
	c
yc
le
s	
c432	simula5on	vs.	predic5on		
simula4on	
predic4on	
208	
804	
2580	
6400	
206	
803	
2569	
6416	
0	
1000	
2000	
3000	
4000	
5000	
6000	
7000	
1.7ns	 1.6ns	 1.5ns	 1.4ns	
Er
ro
r	c
ou
nt
	fo
r	1
	m
iil
io
n	
cy
cl
es
	
c880	simula4on	vs.	predic4on	
simmuilatoin	
predic;on	
1	 2	
39	
252	
562	
1	 2	
44	
267	
588	
0	
100	
200	
300	
400	
500	
600	
700	
4.0ns	 3.9ns	 3.7ns	 3.5ns	 3.4ns	
Er
ro
r	c
ou
nt
	fo
r	1
	m
ill
io
n	
cy
cl
es
	
c6288	simula5on	vs.	predic5on	
simula4on	
predic4on	
1	
7	
16	
61	
156	
1	
8	
18	
58	
160	
0	
20	
40	
60	
80	
100	
120	
140	
160	
180	
1.9ns	 1.8ns	 1.7ns	 1.6ns	 1.5ns	
Er
ro
r	c
ou
nt
	fo
r	1
	m
ill
io
n	
cy
cl
es
	
c1908	simulation	vs.	prediction	
simulation	
prediction	
 64 
require any test wrap circuit during simulation. To ensure the correctness of error 
estimation, the bin size should not be larger than the clock period precision during circuit 
behavior curve statistical analysis. 
 
 65 
CHAPTER VI 
DUAL-THRESHOLD VOLTAGE APPROACH FOR TIMING ERROR REDUCTION 
In this work, a dual-threshold voltage approach is used on selected cells to 
improve the propagation delay of identified error contributing POs. For the given input 
workload in this work, Category IV - the real error-contributing POs – have been 
identified in Chapter IV. However, the fan-in cone contains multiple paths that feed into a 
PO. Replacing all the cells on the fan-in cone is impractical due to the leakage power 
increase of using Low-Vt cells. Therefore, consideration must be given to improving 
error rate in a cost-effective manner. 
 
Dual-Threshold Voltage Approach For Re-Timing 
The Synopsys SAED_EDK 32/28_CORE digital standard cell library [57] was 
used in this work. The library includes typical miscellaneous combinational and 
sequential logic cells for different drive strengths. It also contains cells with different 
versions (multi-voltage, multi-threshold, etc.) for low power designs. In order to 
implement multi-threshold low power techniques High-Vt (HVT), Low-Vt (LVT) and 
Standard-Vt (SVT) versions of the library was created.  
Multi-threshold / Dual-threshold technology mostly uses to reduce leakage power 
by using the HVT cells whenever performance goals allow, and in this work, the LVT 
cells will be used to where necessary to meet timing.  
 66 
According to the analysis of circuit typical case timing behavior, selected cells to 
be replaced with LVT version in the net-list file generated synthesis. Because the 
modification of the net-list did not structurally change the circuit connection, there are 
minimum impacts on the circuit activity behavior. The error-contributing POs timing 
closure improved as desired to specifically reduce timing errors. The steps to implement 
the dual-threshold voltage approach  to improve certain paths/ POs delay are listed: 
1. List all error-possible POs. 
2. Identify the error-contributing POs using the error-estimation method. 
3. List the fan-in paths that have longer propagation delay than the clock period. 
4. Identify the convergence point of paths listed in Step 3. 
5. Replace cells after the convergence point with low-Vt cells. 
6. Perform cell activity analysis as described previously on the remaining cells.  
7. Weight each cell’s activity level. 
8. If the activity level is higher than 50%, then replace the cells with low-Vt cells.  
The detail of critical cell identification is introduced in next section. 
 
Identification of Critical Cells  
The goal of this work is to reduce the error rate more efficiently by shortening the 
propagation delay of the identified error-contributing POs. With the knowledge of the 
circuit behavior under a given input workload, there are two competing objectives that 
need to be met: reducing more errors while using fewer Low-Vt cells.  
 67 
Finding the right cells to replace is the key to reduce errors effectively in this 
approach. After identified error-possible PO, we have to working on the whole fan-in 
cone of the PO to select cells to replace. 
The critical cells as defined in this work can be categorized into three types:  
A. Stem cells (Green): the cells after the convergence point with an active rate 
greater than 50% for the given workload. 
B. Shared cells (Yellow): the cells shared by more than one branch or shared by 
other fan-in cones with an active rate greater than 50% for the given workload. 
C. Highly active branch cells (Blue): the cells only used by one fan-in cone, but 
have an active rate greater than 50% for the given workload. 
 
 
 
Figure 33: A partial circuitery to differentiate three types of critical cells that are going to 
be replaced in this work. 
 
Figure 33 shows an example of the three types of critical cells in a fan-in cone to 
give a more intuitive definition. For C1908, the traditional critical path leads to the PO 
N2899 with 2.20 ns propagation delay. The identified error contributor POs are N2891, 
N2892, and N2811. Take PO N2891 as an example, the longest 5 paths to the PO are 
AND
OR
NORXORAND
AND
AND
OR
AND AND
< 0.5
< 0.5
> 0.5
> 0.5
< 0.5
< 0.5
Converge point
> 0.5 > 0.5 > 0.5> 0.5
 68 
from inputs: N4, N1, N7, N13, N19. These 5 inputs are the start point of longest paths of 
PO N2899, N2887, N2890, N2888, N2889, N2886, N2811, N2892. The cell selection 
After identifying the stem cells and the shared cells, the activity of each cell was analyzed 
to determine replacement selection.  
Figure 34 shows the activity level of cells on the traditional critical path. The cells 
activity analysis is similar as the output activity process. First, the activity of those 
selected cells from the golden .vcd file are extracted using the same algorithm that was 
used to extract the activity of POs for circuit stabilization probability analysis in Chapter 
IV. However, the representation symbol of selected cells is needed to update in side of 
testing script. 
 
 
Figure 34: Benchmark C1908 critical path’s cells activity. 
 
 
 
 
 69 
Error Reduction Results Comparison And Discussion 
To evaluate the effectiveness of the error reduction results, each circuit has three 
versions of implementation: 
1. Baseline Circuit – uses the standard threshold voltage cells for all cells.  
2. Full-Path Replacement (FPR) – replaces all cells on the static longest path with 
low-threshold voltage cells. 
3. Selected Cell Replacement (SCR) – replaces cells selectively on the fan-in cone of 
the identified error-contributing POs with low-threshold voltage cells based on 
activity level. 
The Full-Path Replacement represents the method that did not include the knowledge 
of circuit behavior for the given input workload. The Selected Cell Replacement 
represents the method that has benefited from understanding the circuit’s typical behavior.  
In this section, the error rate and improved error rate were compared at 70% of the 
original clock period. The maximum error-free clock frequency speed up was also 
compared between two methods. 
 
 70 
 
Figure 35: Error counts of each error-possible PO before and after error reduction method 
Full Path Replacement and Selected Cell Replacement. The operating clock period of 
C432 is 1.7 ns (70% of 2.41 ns), C880 is 1.4 ns (70% of 2.01 ns), C1908 is 1.5 ns (68% 
of 2.2 ns), and C6288 is 3.4 ns (70% of 4.82 ns). 
 
Figure 35 shows four tested circuits with error comparison of three different the 
implementation. For each circuit, all error-possible POs are listed in the figures in 
descending order of propagation delay, from left to right. During the analysis, we 
observed that some POs would not generate any error even if their propagation delay was 
longer than the operating clock period. The error count numbers are marked on top. 
According to the results, the Selected Cell Replacement reduces more errors than the Full 
Path Replacement in general. The error reduction will be more obvious if the static 
critical path PO is not the identified error-contributing PO. Also, the PO that is more 
error concentrated will be more responsive on the Selected Cell Replacement method.  
0	
10	
20	
30	
40	
50	
60	
N2891	 N2811	 N2892	 N2779	
59	 58	
38	
1	
32	
17	
37	
0	
19	
7	
17	
0	
Baseline	circuit	 Full	path	replacement	 Selected	cell	replacement	
0	
20	
40	
60	
80	
100	
120	
140	
N6288	 N6287	 N6280	 N6270	 N6260	 N6250	 N6240	 N6230	
77	
23	
98	
120	 124	
81	
33	
6	9	 1	 5	 6	 2	 1	 0	 0	5	 0	 4	 5	 1	 0	 0	 0	
Baseline	circuit	 Full	path	replacement	 Selected	cell	replacement	
0	
200	
400	
600	
800	
1000	
1200	
1400	
1600	
1800	
2000	
N878	 N866	 N879	 N880	 N874	 N863	
1402	
1251	
1826	
1112	
595	
214	
88	 9	
200	
17	 3	 0	47	 0	 40	 13	 0	 0	
Baseline	circuit	 Full	path	replacement	 Selected	cell	replacement	
0	
1000	
2000	
3000	
4000	
5000	
6000	
7000	
N421	 N432	 N431	 N430	 N370	
6075	
4584	 4613	 4751	
216	
3121	
1594	 1439	
0	 0	
2125	
590	
226	 0	 0	
Baseline	circuit	 Full	path	replacement	 Selected	cell	replacement	
C432 C880
C1908 C6288
 71 
For C432, the static critical path PO is N421, and it is also the identified error-
contributing PO. The Selected Cell Replacement method’s advantage on output N421 is 
diminished, because most of replaced cells are the same as the Full Path Replacement 
method. 
For C880, N879 is identified as the largest error-contributing PO, while N878 is 
the static critical path PO. By using the Selected Cell Replacement method, 80% more 
errors have been removed just for output N879. 
For C1908, N2891 and N2811 are identified as error-contributing POs. Their 
propagation delays are the 7th and 8th longest path respectively. The static critical path 
leads to output N2899, but it does not generate any errors for the given input workload 
(one million random vectors) with the tested operating clock period (70% of the original). 
The Selected Cells Replacement method removes 58.8% more errors for output N2811 
and 40.6% more for output N2891.  
For C6288, the Selected Cells Replacement removes 50% and 16.7% more errors 
for the identified error-contributing POs N6260 and N6270. Also, 37.5% more errors 
have been removed in total. The results are relatively low compared to the other circuits 
because of the special structure of this multiplier. The paths to error-prone output N6260 
and N6270 are just a subset of the critical path to output N6288. 
 72 
 
Figure 36: Total error reduction improvement from Full Path Replacement to Selected 
Cells Replacement, when operating at 70% of original clock period. 
 
Table 6: Total error numbers comparison and the Selected Cells Replacement (SCR) 
improvement verses Full Path Replacement  
 
 C432 C880 C1908 C6288 
Full Path Replacement 6154 317 4 24 
Selected Cells 
Replacement 2941 100 1 15 
Improvement 52.2% 68.4% 75% 37.5% 
 
Figure 36 shows error reduction results comparison between Full Path 
Replacement and Selected Cells Replacement. The actual total error counts and the 
improvement from the Full Path Replacement method to Selected Cells Replacement 
method are shown in Table 6. The Selected Cells Replacement method shows efficiency 
69%	
85.31%	95%	
98.43%	
45.57%	
72.78%	
95.73%	 97.33%	
30%	
40%	
50%	
60%	
70%	
80%	
90%	
100%	
110%	
Sta0c	replacement	method	 Dynamic	replacement	method	
Total	Error	Reduc0on	to	Baseline	design			
c432	
c880	
c1908	
c6288	
 73 
on error reduction when operating at 70% of the original clock period. Designers could 
select the timing speculation level based on the ability of the EDAC module. The error-
free timing speculation clock is also tested. Figure 37 displays the speed increase 
comparison of Full Path Replacement and Selected Cells Replacement methods at the 
maximum error-free timing speculation clock period. Table 7 lists the Low-Vt cells usage 
in total number, and Table 8 shows the leakage power. 
 
 
Figure 37: Error Free speed up comparison of Full Path Replacement method and 
Selected Cells Replacement method. 
 
Table 7: Low-Vt cell usage comparison between Full Path Replacement (FPR) and 
Selected Cells Replacement (SCR)  
 
 Total cell number 
Low-Vt cell 
number of FPR 
Low-Vt cell 
number of SCR 
C432 81 18 6 
C880 167 20 12 
C1908 211 21 11 
C6288 516 32 16 
 
c432	 c880	 c1908	 c6288	
20.50%	
25.63%	
15.79%	
30.27%	
33.90%	
25.63%	
22.21%	
37.71%	
Full	path	replacement	 Selected	cell	replacement	
 74 
Table 8: Leakage power (µW) comparison of baseline, Full Path replacement (FPR) and 
Selected Cells Replacement (SCR) 
 
 Leakage power of Baseline (µW) 
Leakage power 
of FPR (µW) 
Leakage power 
of SCR (µW) 
C432 0.353 0.432 0.380 
C880 0.953 1.04 1.00 
C1908 1.33 1.40 1.39 
C6288 0.546 0.832 0.717 
 
 
Conclusion 
This section compared the typical case workload behavior based timing 
optimization method Selected Cells Replacement (SCR) with baseline implementation 
and worst case (traditional critical path) timing optimization Full path Replacement (FPR) 
from many different aspects.  Like the error reduction ability to a certain timing 
speculation percentage (30% up tested in this work), the tested all Benchmark circuits 
maximum error free operation point improvement of the SCR method, the Low-Vt cells 
ratios and leakage power dissipation. 
 
 75 
CHAPTER VII 
SUMMARY AND FUTURE WORKS 
Higher chip performance is a constant demand in the semiconductor industry. The 
traditional design methodology sets a conservative guard band according to the worst 
case to ensure the correct operation. The impact of this constraint leads to lost 
performance. The BTWC design methodology optimizes a circuit based on the average 
case, and then it allows an error correction module to handle the errors. In this case, it is 
given the ability to cover the penalty for error correction by knowing the circuit’s 
dynamic activity behavior for a given workload. Every error correction process has a 
penalty, therefore maximizing the operating clock frequency while controlling the total 
error counts leads to an overall gain. Based on the circuit’s typical behavior under a given 
input workload, certain circuit timing paths be optimized to help effectively reduce errors 
during timing speculation.  
This work introduced an error-estimation method for all-clock-periods without 
tedious simulations, and described a novel off-line error-checking method that does not 
require special test wrap circuit and simultaneous simulation. Both the error-estimation 
method and the error-checking method are based on extracted information from the 
raw .vcd file generated from simulating the circuit. The circuit’s internal cell activity 
analysis is also obtained from the .vcd file. After understanding the circuit’s dynamic 
activity under a given input workload, it can be re-synthesized with low-threshold voltage 
cells in the fan-in cones affecting the identified error-contributing outputs.  
 76 
The results demonstrated that the error estimation method is accurate, and the 
error checking method provides a convenient way to detect and analyze errors for any PO. 
This error-reduction approach reduces a majority of errors while maintaining the 
minimum usage of low Vt cells. This work demonstrated the advantage of using the 
knowledge of the circuit’s typical behavior and its impact on improving the performance 
and the error-reduction process.  
The entire design flow was based on the typical commercial approach for 
synthesizing designs with standard cell libraries; the flow was augmented with 
customized Python scripts and is well incorporated with commercial EDA tools: 
Synopsys Design Complier, IC Compiler, Primetime and VCS. 
The input workload used in this work was pure randomly generated input vectors 
on ISCAS85 Benchmarks. The random vector generator comes from Python library to 
modify the test bench stimuli. For a more realistic analysis, with the given benchmark 
circuit, apply typical application to the testbench and obtain circuit stabilization 
probability curve and error estimation to analyze the data with introduced methodology in 
this work.  
This input workload variation caused timing behavior change has also been 
explored by Kevin E. Murray [58]. They introduced a new timing analysis formulation to 
form the circuit stabilization behavior with consideration of input combinations and 
compare the results with traditional SAT and Monte-Carlo simulation. Actually the big 
data analysis method could also be applied on circuit stabilization curve generation with 
regular usage of DUT. The stabilization curve will provide the insight of timing error 
estimation. 
 77 
On the other hand, because of the reconfigurable character of field-programmable 
gate arrays (FPGAs), researchers could also explore the behavior analysis methodology 
described in this work on an FPGA board, and compare the performance improvements 
on timing, power, and errors. 
 78 
APPENDIX 
PART A:  
(1) The script to extract the given PO all transition timestamps, each cycle has an entry. 
#!/usr/bin/env python 
# encoding: utf-8 
 
import sys 
import os 
import string 
import math 
import numpy as np 
 
ft = open(sys.argv[-2],'a') #### Define output file of interested POs 
transition timestamps 
f1 = open(sys.argv[-1],'r') #### Define input file of raw .VCD file 
sbl=sys.argv[-3]            #### Define the interested PO's representing 
symbol used in .VCD 
p = sys.argv[-4]            #### Define the clock period 
lines=f1.readlines()        #### Read in raw .VCD file 
l = len(lines)              #### Get .VCD file length 
 
i=0 
for i in range (0,l): 
    line = lines[i] 
    if line[0] == '$':      #### Skip the header part 
        continue 
    if line[0] == '#':      #### if current line is a time stamp, Read-in 
current line 
        TimePoint = float(line[1:] 
        residue_temp = TimePoint % int(p) #### Obtain the timing 
status within this cycle 
        if residue_temp != 0.0:           #### Check if it is a new cycle, 
if not 
            inner_count = i+1;             
            line1 = lines[inner_count]    #### Continue reading the 
next line of .VCD file 
            while line1[0] != '#':        #### Check if this line is 
a timestamp, if not, 
                if line1[1] == sbl:       #### Check if this line is 
the intrested node switching record, if yes,       
                    ft.write(str(int(residue_temp))+" ")    
#### Save current timestamp into the output file 
                inner_count = inner_count+1                 
#### Read-in next line. 
 79 
                if inner_count >= l:                        
#### Chenk if the end of the .VCD file 
                line1 = lines[inner_count] 
         
        else:    #### if current timestamp is a new cycle, start a new entry in 
the output file. 
            ft.write("\n")   
 
(2) The script to extract settling timestamp of every cycle for the given PO ( with 
example of Benchamrk C1908, PO 2892). 
#!/usr/bin/env python 
# encoding: utf-8 
 
import sys 
import os 
import string 
import math 
import numpy as np 
 
ft 
=open('c1908_output_N2892_rvt_2200_transition_time.txt','r') 
fs 
=open('c1908_output_N2892_rvt_2200_settling_time.txt','wb') 
 
lines_ft=ft.readlines() 
l_ft=len(lines_ft) 
settle_time=[] 
 
j=0 
for j in range (0,l_ft): 
    list_ft=lines_ft[j].split() 
    if list_ft == []: 
        continue 
    else: 
        settle_time.append(list_ft[-1:]) 
        fs.write(str(list_ft[-1:])+"\n") 
     
print settle_time    
fs.close() 
ft.close() 
 
 
 80 
 
PART B:  
(1) The script to extract switched nodes for each cycle in sequence.  
#!/usr/bin/env python 
# encoding: utf-8 
 
import sys 
import os 
import string 
import math 
import numpy as np 
 
vcd_index = sys.argv[-1] 
txt_index = sys.argv[-2] 
f0 = open(sys.argv[-1],'r+') 
tran = open(sys.argv[-2],'wb') 
p = sys.argv[-3] 
lines = f0.readlines() 
l = len(lines) 
 
 
i = 0  
for i in range(0,l) : 
    line = lines[i] 
    if line[0] == '$': 
        continue 
    if line[0] == '#': 
        TimePoint = float(line[1:]) 
        residue_temp = TimePoint % int(p) 
        if residue_temp != 0: 
            inner_count = i+1; 
            line1 = lines[inner_count] 
            while line1[0] != '#': 
                tran.write(line1[0:-1]+" ")     #### Record the 
switched node symbol 
                inner_count=inner_count+1 
                if inner_count>=l: 
                    break 
                line1 = lines[inner_count] 
 
        else: 
            tran.write("\n") 
 
 81 
(2) The script to detect and calculate errors. 
#!/usr/bin/env python 
# encoding: utf-8 
 
import sys 
import os 
import string 
import math 
import numpy as np 
import matplotlib.mlab as mlab 
import matplotlib.pyplot as plt 
 
#Initialize primary outputs of golden copy and test copy; po_diff is the comparison 
results for each cycle; er_output is the set showing errors for each cycle; er_count is 
the accumilative error count of output. 
po = 
['A','B','C','D','E','F','G','H','I','J','K','L','M','N','O
','P','Q','R','S','T','U','V','W','X','Y','Z','[','\\',']',
'^','_','`']    # Define symbol used in .VCD file to represent POs. 
po_name = 
['N545','N1581','N1901','N2223','N2548','N2877','N3211','N3
552','N3895','N4241','N4591','N4946','N5308','N5672','N5971
','N6123','N6150','N6160','N6170','N6180','N6190','N6200','
 82 
N6220','N6230','N6240','N6250','N6260','N6270','N6280','N62
87','N6288']    # Define the PO name of tested Benchmark 
po_g = 
[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0]    # Initial the golden copy’s LUT 
po_t = 
[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0]    # Initial the tested copy’s LUT 
po_diff = []  
gold_vcd = sys.argv[-1] 
test_vcd = sys.argv[-2] 
f0 = open(sys.argv[-1],'r+') 
f1 = open(sys.argv[-2],'r+') 
lines0 = f0.readlines() 
lines1 = f1.readlines() 
 
l = len(lines0) 
i = 0 
j = 0 
k = 0 
er_count = 
[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0] 
 83 
 
# compare settlized output of two copies for each cycle; text file is processed with 
another python script, which sumarized internal transitions into one line one cycle. 
for i in range (0,l):  
    line_g = lines0[i] 
    line_t = lines1[i] 
    j = 0 
    k = 0 
    po_diff = [int(po_g[n])-int(po_t[n]) for n in 
range(0,len(po_t))]  #find differences 
    er_output=[abs(po_diff[m]) for m in 
range(0,len(po_diff))]  #each output could have one error for each cycle 
er_count = [sum(x) for x in zip(er_count,er_output)] 
while j < len(line_g)-2:   # -2 because the last two symbol is '/n', 
if do not remove, it will affect the iternation. 
        if line_g[j] == ' ': 
            j = j+1 
        elif line_g[j] == '0': 
            j = j+1 
        elif line_g[j] == '1': 
            j = j+1 
        else:    #find output symbol in text file 
            g_index = po.index(line_g[j])    #find the right index 
in output list, and record current value in po_g  
            po_g[g_index] = line_g[j-1] 
 84 
            j = j+1 
 
    while k < len(line_t)-2: 
        #print j 
        #print line_g[j] 
        if line_t[k] == ' ': 
            k = k+1 
        elif line_t[k] == '0': 
            k = k+1 
        elif line_t[k] == '1': 
            k = k+1 
        else: 
            t_index = po.index(line_t[k]) 
            po_t[t_index] = line_t[k-1] 
            k = k+1 
print er_count 
#print zip(po,er_count) 
#print zip(po_name,er_count) 
er_rate = float(sum(er_count))/float(10000) 
print er_rate 
 
po_data = open('./po_data_mvt_2410.txt','a') 
rate_data = open('./rate_data_mvt_2410.txt','a') 
 85 
po_data.write(str(er_count)+'\n') 
po_data.close() 
rate_data.write(str(er_rate)+'\n') 
rate_data.close() 
 
 
 86 
REFERENCES 
[1] S. Krishnamurthy, S. Paul, and S. Bhunia, “Adaptation to Temperature-Induced 
Delay Variations in Logic Circuits Using Low-Overhead Online Delay Calibration,” 
in 8th International Symposium on Quality Electronic Design (ISQED’07), 2007, pp. 
755–760. 
[2] M. Horowitz, E. Alon, D. Patil, S. Naffziger, R. Kumar, and K. Bernstein, “Scaling , 
Power , and the Future of CMOS,” in IEEE International Electron Devices Meeting 
(IEDM 2005), 2005, pp. 9–15. 
[3] S. Bhunia, S. Mukhopadhyay, and K. Roy, “Process variations and process-tolerant 
design,” in 20th International Conference on VLSI Design, 2007, pp. 699–704. 
[4] S. Borkar, “Designing reliable systems from unreliable components: the challenges 
of transistor variability and degradation,” in IEEE Micro, 2005, vol. 25, no. 6, pp. 
10–16. 
[5] S. Ghosh and K. Roy, “Parameter Variation Tolerance and Error Resiliency: New 
Design Paradigm for the Nanoscale Era,” in Proceedings of the IEEE, 2010, vol. 98, 
no. 10, pp. 1718–1751. 
[6] P. Asenov, N. a. Kamsani, D. Reid, C. Millar, S. Roy, and A. Asenov, “Combining 
process and statistical variability in the evaluation of the effectiveness of corners in 
digital circuit parametric yield analysis,” in The European Solid-State Device 
Research Conference (ESSDERC), 2010, pp. 130–133. 
[7] S. R. Sarangi, B. Greskamp, R. Teodorescu, J. Nakano, A. Tiwari, and J. Torrellas, 
“VARIUS: A Model of Process Variation and Resulting Timing Errors for 
Microarchitects,” IEEE Trans. Semicond. Manuf., vol. 21, no. 1, pp. 3–13, 2008. 
[8] B. Greskamp, L. Wan, U. R. Karpuzcu, J. J. Cook, J. Torrellas, D. Chen, and C. 
Zilles, “Blueshift: Designing processors for timing speculation from the ground up.,” 
in IEEE 15th International Symposium on High Performance Computer Architecture 
(HPCA 2009), 2009, pp. 213–224. 
[9] V. Mehrotra and D. S. Boning, “Modeling the effects of systematic process variation 
on circuit performance,” Massachusetts Institute of Technology, 2001. 
[10] L. Xie and A. Davoodi, “Post-Silicon Failing-Path Isolation Incorporating the 
Effects of Process Variations,” IEEE Trans. Comput. Des. Integr. Circuits Syst., vol. 
31, no. 7, pp. 1008–1018, Jul. 2012. 
[11] K. Bowman, J. Tschanz, C. Wilkerson, S.-L. Lu, T. Karnik, V. De, and S. Borkar, 
“Circuit techniques for dynamic variation tolerance,” 46th Annu. Des. Autom. Conf., 
pp. 4–7, 2009. 
[12] S. K. Nithin, G. Shanmugam, and S. Chandrasekar, “Dynamic voltage (IR) drop 
analysis and design closure: Issues and challenges,” in 11th International 
Symposium on Quality Electronic Design (ISQED’11), 2010, pp. 611–617. 
 87 
[13] L. Wan and D. Chen, “Analysis of circuit dynamic behavior with timed ternary 
decision diagram,” in IEEE/ACM International Conference on Computer-Aided 
Design (ICCAD), 2010, pp. 516–523. 
[14] J. A. Kumar and S. Vasudevan, “Formal Probabilistic Timing Verification in RTL,” 
in IEEE Transactions on Computer-Aided Design of Integrated Circuits and 
Systems, 2013, vol. 32, no. 5, pp. 788–801. 
[15] X. Wang and W. H. Robinson, “A Dual-Threshold Voltage Approach for Timing 
Speculation in CMOS Circuits,” in 2016 IEEE Computer Society Annual Symposium 
on VLSI (ISVLSI), 2016, pp. 691–696. 
[16] D. Blaauw, K. Chopra, A. Srivastava, and L. Scheffer, “Statistical Timing Analysis : 
From Basic Principles to State of the Art,” IEEE Trans. Comput. Des. Integr. 
Circuits Syst., vol. 27, no. 4, pp. 589–607, 2008. 
[17] L. Wan and D. Chen, “Analysis of Digital Circuit Dynamic Behavior With Timed 
Ternary Decision Diagrams for Better-Than-Worst-Case Design,” in IEEE 
Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2012, 
vol. 31, no. 5, pp. 662–675. 
[18] L. Wan and D. Chen, “DynaTune : Circuit-Level Optimization for Timing 
Speculation Considering Dynamic Path Behavior,” in 2009 International 
Conference on Computer-Aided Design (ICCAD), San Jose, CA, 2009, pp. 172–179. 
[19] L. Wan and D. Chen, “CCP: common case promotion for improved timing error 
resilience with energy efficiency,” in IEEE/ACM international symposium on Low 
Power Electronics and Design (ISLPED’12), 2012, p. 135. 
[20] Y. Kuo, Y. Chang, and S. Chang, “Efficient Boolean Characteristic Function for 
Fast Timed ATPG,” in 2006 IEEE/ACM International Conference on Computer 
Aided Design, 2006, pp. 96–99. 
[21] R. E. Bryant, “Symbolic Boolean manipulation with ordered binary-decision 
diagrams,” ACM Comput. Surv., vol. 24, pp. 293–318, 1992. 
[22] D. Ernest, S. Das, S. Lee, D. Blaauw, T. Austin, T. Mudge, N. S. Kim, and K. 
Flautner, “Razor: circuit-level correction of timing errors for low-power operation,” 
IEEE Micro, vol. 24, pp. 10–20, 2004. 
[23] L. Benini, E. Macii, M. Poncino, and G. De Micheli, “Telescopic units: a new 
paradigm for performance optimization of VLSI designs,” IEEE Trans. Comput. 
Des. Integr. Circuits Syst., vol. 17, no. 3, pp. 220–232, Mar. 1998. 
[24] R. E. Bryant, “Algorithms for Boolean Function Manipulation,” vol. C, no. 8, 1986. 
[25] T. Sasao, “Ternary decision diagrams. Survey,” in Proceedings 1997 27th 
International Symposium on Multiple- Valued Logic, 1997, pp. 241–250. 
[26] A. B. Kahng, S. Kang, R. Kumar, and J. Sartori, “Slack Redistribution for Graceful 
Degradation under Voltage Overscaling,” in Proceeding of ASP-DAC, 2010, pp. 
825–831. 
 88 
[27] Berkely Logic Synthesis and Verification Group, “ABC: A system for sequential 
synthesis and verification.” 
[28] D. Ernst, N. S. Kim, S. Das, S. Pant, R. Rao, T. Pham, C. Ziesler, D. Blaauw, T. 
Austin, K. Flautner, and T. Mudge, “Razor : A Low-Power Pipeline Based on 
Circuit-Level Timing Speculation,” in 36th Annual IEEE/ACM International 
Symposium on Microarchitecture (MICRO-36), 2003, pp. 7–18. 
[29] S. Das, C. Tokunaga, S. Pant, W. Ma, S. Kalaiselvan, K. Lai, D. M. Bull, and D. T. 
Blaauw, “RazorII : In Situ Error Detection and Correction for PVT and SER 
tolerance,” in IEEE Journal of Solid-State Circuits, 2009, vol. 44, no. 1, pp. 32–48. 
[30] S. Das, S. Member, D. Roberts, S. Lee, S. Pant, D. Blaauw, T. Austin, K. Flautner, 
and T. Mudge, “A Self-Tuning DVS Processor Using Delay-Error Detection and 
Correction,” J. Solid-State Circuits, vol. 41, no. 4, pp. 792–804, 2006. 
[31] P. Kocher, D. Genkin, D. Gruss, W. Haas, and M. Hambury, “Microarchitectural 
innovations: boosting microprocessor performance beyond semiconductor 
technology scaling,” Proc. IEEE, vol. 89, pp. 1560–1575, 2001. 
[32] K. A. Bowman, J. W. Tschanz, N. S. Kim, J. C. Lee, C. B. Wilkerson, S.-L. L. Lu, 
T. Karnik, and V. K. De, “Energy-Efficient and Metastability-Immune Resilient 
Circuits for Dynamic Variation Tolerance,” in IEEE Journal of Solid-State Circuits, 
2009, vol. 44, no. 1, pp. 49–63. 
[33] M. Choudhury, V. Chandra, K. Mohanram, and R. Aitken, “TIMBER: Time 
borrowing and error relaying for online timing error resilience,” in Design, 
Automation & Test in Europe Conference & Exhibition, 2010, pp. 1554–1559. 
[34] A. K. Uht, “Going Beyond Worst-Case Specs with TEAtime,” in Computer, 2004, 
vol. 37, no. 3, pp. 51–56. 
[35] M. R. Choudhury and K. Mohanram, “Approximate logic circuits for low overhead, 
non-intrusive concurrent error detection,” in Design, Automation & Test in Europe 
Conference & Exhibition (DATE ’08), 2008, no. 0, pp. 903–908. 
[36] M. R. Choudhury and K. Mohanram, “Low Cost Concurrent Error Masking Using 
Approximate Logic Circuits,” in IEEE Transactions on computer-aided design of 
integrated circuits and systems, 2013, vol. 32, no. 8, pp. 1163–1176. 
[37] M. Pedram, “Power minimization in IC design: principles and applications,” ACM 
Trans. Des. Autom. Electron. Syst., vol. 1, no. 1, pp. 3–56, 1996. 
[38] V. Venkatachalam and M. Franz, “Power reduction techniques for microprocessor 
systems,” ACM Comput. Surv., vol. 37, no. 3, pp. 195–237, 2005. 
[39] L. Chang, D. J. Frank, R. K. Montoye, S. J. Koester, B. L. Ji, P. W. Coteus, R. H. 
Dennard, and W. Haensch, “Practical strategies for power-efficient computing 
technologies,” Proc. IEEE, vol. 98, pp. 215–236, 2010. 
[40] R. G. Dreslinski, M. Wieckowski, D. Blaauw, D. Sylvester, and T. Mudge, “Near-
Threshold Computing: Reclaiming Moore’s Law Through Energy Efficient 
 89 
Integrated Circuits,” Proc. IEEE, vol. 98, no. 2, pp. 253–266, Feb. 2010. 
[41] S. M. Kang and Y. Leblebici, CMOS Digital Integrated Circuits: Analysis and 
Design, 3rd ed. Mc Graw Hill, 2003. 
[42] H. Kaul, M. Anders, S. Hsu, A. Agarwal, R. Krishnamurthy, and S. Borkar, “Near-
threshold voltage (NTV) design: opportunities and challenges,” 49th 
ACM/EDAC/IEEE Des. Autom. Conf., pp. 1149–1154, 2012. 
[43] M. Keating, D. Flynn, R. Aitken, A. Gibbons, and K. Shi, Low Power Methodology 
Manual: For System-on-Chip Design. Springer Publishing Company, Incorporated, 
2007. 
[44] Y. Liu, R. Ye, F. Yuan, R. Kumar, and Q. Xu, “On Logic Synthesis for Timing 
Speculation,” pp. 591–596, 2012. 
[45] N. P. Carter, H. Naeimi, and D. S. Gardner, “Design techniques for cross-layer 
resilience,” in Design, Automation & Test in Europe Conference & Exhibition 
(DATE), 2010, pp. 1023–1028. 
[46] A. DeHon, H. M. Quinn, and N. P. Carter, “Vision for cross-layer optimization to 
address the dual challenges of energy and reliability,” in Design, Automation & Test 
in Europe Conference & Exhibition (DATE), 2010, pp. 1017–1022. 
[47] S. Mitra, K. Brelsford, and P. N. Sanda, “Cross-layer resilience challenges: Metrics 
and optimization,” in Design, Automation & Test in Europe Conference & 
Exhibition (DATE), 2010, pp. 1029–1034. 
[48] S. Mutoh, T. Douseki, Y. Matsuya, T. Aoki, S. Shigematsu, and J. Yamada, “I-V 
power supply high-speed digital circuit technology with multi-threshold-voltage 
CMOS,” IEEE J. Solid-State Circuits, vol. 30, pp. 847–854. 
[49] Q. W. Q. Wang and S. B. K. Vrudhula, “Static power optimization of deep 
submicron CMOS circuits for dual VT technology,” 1998 IEEE/ACM Int. Conf. 
Comput. Des. Dig. Tech. Pap. (IEEE Cat. No.98CB36287), pp. 490–496, 1998. 
[50] L. Wei, Z. Chen, M. Johnson, and K. Roy, “Design and optimization of low voltage 
high performance dual threshold CMOS circuits,” in IEEE proceedings of Design 
Austomation Conference, 1998, pp. 535–549. 
[51] S. B. K. Vrudhula, “An investigation of power delay trade-offs for dual V/sub t/ 
CMOS circuits,” Proc. 1999 IEEE Int. Conf. Comput. Des. VLSI Comput. Process. 
(Cat. No.99CB37040), pp. 556–562, 1999. 
[52] N. Jayakumar and S. P. Khatri, “A Predictably Low-Leakage ASIC Design Style,” 
vol. 15, no. 3, pp. 276–285, 2007. 
[53] M. Lipp, M. Schwarz, D. Gruss, T. Prescher, W. Haas, S. Mangard, P. Kocher, D. 
Genkin, Y. Yarom, and M. Hamburg, “Meltdown,” ArXiv e-prints, 2018. 
[54] P. Kocher, D. Genkin, D. Gruss, W. Haas, M. Hamburg, M. Lipp, S. Mangard, T. 
Prescher, M. Schwarz, and Y. Yarom, “Spectre Attacks: Exploiting Speculative 
 90 
Execution,” 2018. 
[55] F. Brglez, “A neutral netlist of 10 combinational benchmark circuits and a target 
translation in FORTRAN,” IEEE Int. Symp. Circuits Syst., 1985. 
[56] M. Hansen, H. Yalcin, and J. Hayes, “Unveiling the ISCAS-85 benchmarks: a case 
study in reverse engineering,” IEEE Des. Test Comput., vol. 16, pp. 72–80, 1999. 
[57] “Synopsys. Synopsys University Program. Available: 
http://www.synopsys.com/Community/UniversityProgram/Pages/default.aspx.” . 
[58] K. E. Murray, A. Suardi, V. Betz and G. Constantinides, "Calculated Risks: 
Quantifying Timing Error Probability with Extended Static Timing Analysis," 
in IEEE Transactions on Computer-Aided Design of Integrated Circuits and 
Systems. 
 
 
 
  
