Analysis and optimization of digital circuit dynamic behavior by Wan, Lu
c© 2012 Lu Wan
ANALYSIS AND OPTIMIZATION OF DIGITAL CIRCUIT DYNAMIC
BEHAVIOR
BY
LU WAN
DISSERTATION
Submitted in partial fulfillment of the requirements
for the degree of Doctor of Philosophy in Electrical and Computer Engineering
in the Graduate College of the
University of Illinois at Urbana-Champaign, 2012
Urbana, Illinois
Doctoral Committee:
Associate Professor Deming Chen, Chair
Assistant Professor Rakesh Kumar
Professor Josep Torrellas
Professor Martin D. F. Wong
ABSTRACT
Very-large-scale-integration (VLSI) circuit design heavily relies on computer
aided design (CAD) tools to synthesize and optimize the circuits targeting
a cycle time tclk. To account for variations, this cycle time usually includes
a conservative timing guardband to accommodate delay changes. With the
shrinking of technology, the performance of VLSI circuits has reached giga-
hertz range. However, the traditional way of designing VLSI circuits is facing
great challenges with shrinking performance gain along with the shrinking
cycle time tclk.
Better-than-worst-case (BTWC) designs [1] are proposed to alleviate the
problem by removing the guardband and complementing a circuit with er-
ror detection and correction mechanisms. BTWC deliberately allows timing
errors for rare cases and rectifies them with error correction mechanisms.
This new design methodology can operate a circuit more efficiently than the
traditional way.
From the performance perspective, BTWC design, especially timing spec-
ulation (TS), is a promising technique for boosting VLSI circuit through-
put. In this thesis, path constraint tuning (PCT) is proposed to design
high-throughput processor components. PCT leverages a commercial design
flow to tightly constrain frequently exercised timing paths and relax the in-
frequently exercised portion of the design. PCT is our initial effort to boost
processor component performance with the BTWC design methodology.
Considering that commercial CAD tools are designed to work with the
traditional design methodology, they are not suitable for BTWC designs. To
address the limitations of the existing CAD tools, a novel concept of dynamic
behavior is proposed to quantify the timing error probabilities for digital
circuits. Given input static probabilities of a circuit, its error statistics can be
analyzed and represented with a dynamic behavior curve. The throughput of
the circuit can then be derived from this behavior curve. Timed characteristic
ii
function (TCF) is proposed as a way to derive the behavior graph analytically
using binary decision diagrams (BDD).
Based on this analytical timing-error-probability framework, a logic op-
timization algorithm is developed to utilize dual threshold voltage (dualVt)
cells to optimize a circuit’s throughput in such a way that the most dynam-
ically critical gates of a circuit can be identified and optimized.
To make the circuit-level dynamic behavior analysis scalable, a technique
called timed ternary decision diagrams (tTDD) is developed to analyze circuit
dynamic behavior. It uses a divide-and-conquer method by analyzing each
partitioned sub-circuit first and then combining the analysis results. False
path pruning and random variable compaction are proposed to enable the
computation of stabilization probabilities.
Given that dynamic behavior becomes a new optimization dimension, a dy-
namic behavior driven logic synthesis technique is proposed to re-structure
the BTWC circuit. A BTWC design flow called common case promotion
(CCP) is proposed to specifically optimize circuit dynamic behavior. Aim-
ing at improving the circuit stabilization probability for common cases, this
proposed CCP consists of (1) probability-driven re-synthesis that changes a
digital circuit’s internal structure, (2) a dynamic behavior aware SAT-based
redundancy remover that reduces area overhead, and (3) a TCF based circuit
dynamic behavior analyzer that provides optimization convergence. Utilizing
error correction capability of BTWC design methodology, CCP can improve
a circuit’s timing error resilience by effectively manipulating the circuit’s
dynamic behavior.
To sum up, this thesis focus on an emerging VLSI design methodology
and proposes the use of dynamic behavior as a new measure to the quality
of VLSI circuits. Based on dynamic behavior analysis, various optimization
techniques are proposed. This work provides a thorough treatment to this
new subject.
iii
To my parents, for their love and support
iv
ACKNOWLEDGMENTS
During the years I spent in Champaign-Urbana to pursue my PhD degree, I
was inspired and touched by many brilliant people on the campus of Univer-
sity of Illinois. From the classrooms in Siebel Center for Computer Science
to the labs in the Coordinated Science Laboratory, from Grainger Engineer-
ing Library to the Illini Union study rooms, the diligence and dedication of
professors and a generation of young engineers always gave me strength to
carry out the challenging research tasks day in and day out.
Among those who directly enlightened me during my study, I owe the most
to my PhD adviser, Dr. Deming Chen, whose advice and patient support
allowed me to see a path to the solutions to research problems and the path
to overcome life’s challenges. I also owe thanks to Dr. Josep Torrellas for the
insightful discussions throughout the BlueShift project that turned out to
shape my interest and laid out the foundation of this thesis. I also owe Dr.
Martin D. F. Wong a debt of gratitude for his lectures in design automation
that directly inspired the solutions to many of my research problems. I
also thank Dr. Naresh R. Shanbhag and Dr. Rakesh Kumar for enlightening
discussions on my thesis research topics.
I would like to also thank my colleagues Chen Dong, Brian Greskamp,
Alex Papakonstantinou, Artem Rogachev, Ulya Karpuzcu, Scott Chilstedt,
Greg Lucas, Christine Chen and Yun Heo. Their feedback through weekly
group meetings helped me to greatly improve my work.
I am also deeply grateful to my wife Jian Sun for her support throughout
the years of my PhD study. Her dedication to our daughter Evelyn enabled
me to focus on research. She gave me strength and courage in facing chal-
lenges. Finally, I would like to express my deepest gratitude to my parents. It
is their generous support and love that has enabled me to enjoy this precious
PhD journey as well as the journey of life.
v
TABLE OF CONTENTS
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . 1
CHAPTER 2 BETTER-THAN-WORST-CASE DESIGN METHOD-
OLOGY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1 Ever-tightening design constraints . . . . . . . . . . . . . . . . 5
2.2 Critical path wall . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Foundation of BTWC design . . . . . . . . . . . . . . . . . . . 10
2.4 Microarchitecture-level BTWC design . . . . . . . . . . . . . . 12
2.5 Application-specific BTWC design . . . . . . . . . . . . . . . . 13
2.6 Circuit-level BTWC design . . . . . . . . . . . . . . . . . . . . 15
2.7 New optimization dimension for BTWC designs . . . . . . . . 19
CHAPTER 3 USE COMMERCIAL CAD TOOLS FOR BTWC
DESIGN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.1 Microarchitectural-level timing speculation . . . . . . . . . . . 20
3.2 Delay trading for timing speculation . . . . . . . . . . . . . . 21
3.3 Identifying dynamic overshooting paths . . . . . . . . . . . . . 22
3.4 Iterative optimization flow . . . . . . . . . . . . . . . . . . . . 23
3.5 Improve performance with PCT . . . . . . . . . . . . . . . . . 25
3.6 Experimental results . . . . . . . . . . . . . . . . . . . . . . . 27
CHAPTER 4 OPTIMIZE THROUGHPUT WITH DYNAMIC BE-
HAVIOR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.1 Overview of DynaTune optimization . . . . . . . . . . . . . . 32
4.2 Generalized throughput model . . . . . . . . . . . . . . . . . . 33
4.3 Motivative example of DynaTune optimization . . . . . . . . . 36
4.4 DynaTune optimization . . . . . . . . . . . . . . . . . . . . . . 38
4.5 Timed characteristic function . . . . . . . . . . . . . . . . . . 40
4.6 Dynamic behavior curve . . . . . . . . . . . . . . . . . . . . . 42
4.7 Stepwise optimization . . . . . . . . . . . . . . . . . . . . . . . 45
4.8 Experimental results . . . . . . . . . . . . . . . . . . . . . . . 49
vi
CHAPTER 5 SCALABLE DYNAMIC BEHAVIOR ANALYSIS
WITH TTDD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.1 Need for circuit dynamic behavior analysis . . . . . . . . . . . 55
5.2 Works related to tTDD . . . . . . . . . . . . . . . . . . . . . . 57
5.3 tTDD preliminaries and definitions . . . . . . . . . . . . . . . 59
5.4 Encode stabilization conditions with tTDD . . . . . . . . . . . 63
5.5 Stabilization probability calculation . . . . . . . . . . . . . . . 68
5.6 Probability of one satisfying path . . . . . . . . . . . . . . . . 72
5.7 Put it all together . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.8 Partitioning for tTDD . . . . . . . . . . . . . . . . . . . . . . 86
5.9 Experimental results . . . . . . . . . . . . . . . . . . . . . . . 93
CHAPTER 6 COMMON CASE PROMOTION . . . . . . . . . . . . 103
6.1 Improve timing error resilience by manipulating dynamic
behavior curve . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
6.2 Motivation and example . . . . . . . . . . . . . . . . . . . . . 106
6.3 Common case promotion . . . . . . . . . . . . . . . . . . . . . 108
6.4 Validate structure changes . . . . . . . . . . . . . . . . . . . . 114
6.5 Overhead control . . . . . . . . . . . . . . . . . . . . . . . . . 116
6.6 Correctness and convergence . . . . . . . . . . . . . . . . . . . 118
6.7 Experimental results . . . . . . . . . . . . . . . . . . . . . . . 121
CHAPTER 7 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . 124
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
vii
LIST OF TABLES
3.1 OpenSPARC modules used to evaluate BlueShift . . . . . . . 27
3.2 PCT parameters . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3 Power consumption and switching energy per cycle for each
module implementation . . . . . . . . . . . . . . . . . . . . . . 29
4.1 DynaTune terminology and abbreviations . . . . . . . . . . . 34
4.2 Three function units in Leon3 synthesized with Synopsys
Design Compiler . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.3 Three function units in Leon3 w/o DynaTune optimization
running in RZ mode . . . . . . . . . . . . . . . . . . . . . . . 54
4.4 Three function units in Leon3 w/ DynaTune optimization
running in RZ mode . . . . . . . . . . . . . . . . . . . . . . . 54
4.5 Three function units in Leon3 w/o DynaTune optimization
running in TU mode . . . . . . . . . . . . . . . . . . . . . . . 54
4.6 Three function units in Leon3 w/ DynaTune optimization
running in TU mode . . . . . . . . . . . . . . . . . . . . . . . 54
5.1 Possible combinations of random variables in an input sta-
tus group . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.2 A complete example of false assignment pruning . . . . . . . . 85
5.3 A complete example of random variable compaction . . . . . . 86
5.4 Partition procedure sequence (K=5) . . . . . . . . . . . . . . . 95
5.5 Dynamic behavior curve accuracy comparison . . . . . . . . . 98
5.6 Runtime comparison for the benchmark circuits . . . . . . . . 99
5.7 Accuracy compared with MFFC-based partitioning . . . . . . 99
5.8 Error under various input static probabilities . . . . . . . . . . 101
5.9 Accuracy and runtime comparison for MCNC benchmarks . . 102
6.1 Circuit profiles before and after CCP . . . . . . . . . . . . . . 123
6.2 CCP optimization performance results . . . . . . . . . . . . . 123
viii
LIST OF FIGURES
2.1 A flip-flop example and the setup time requirement . . . . . . 6
2.2 Critical path wall forms when a design has tight timing
constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Leader and checker in paceline microarchitecture . . . . . . . . 13
2.4 Algorithmic noise-tolerance (ANT) . . . . . . . . . . . . . . . 14
2.5 Capture timing error with razor logic . . . . . . . . . . . . . . 16
2.6 Telescopic unit . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.1 Delay trading enhances TS by reshaping the PE(f) curve . . . 21
3.2 Circuit annotated with net transition times, showing two
overshooting paths for this cycle . . . . . . . . . . . . . . . . . 23
3.3 The BlueShift optimization flow . . . . . . . . . . . . . . . . . 24
3.4 Performance difference of razor base and razor+PCT con-
figurations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.5 Power difference of razor base and razor+PCT
configurations . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.1 Throughput gain from timing speculation . . . . . . . . . . . . 35
4.2 Example circuit to illustrate problem of static optimization
for timing speculation . . . . . . . . . . . . . . . . . . . . . . 36
4.3 Behavior curve lifting can provide extra throughput gain . . . 39
4.4 Local TCF and zero-delay virtual gate . . . . . . . . . . . . . 42
4.5 BDD produced from timed characteristic function . . . . . . . 43
4.6 Dynamic behavior curve for a 32bit adder . . . . . . . . . . . 44
4.7 Dynamic stepwise optimization using min-cut . . . . . . . . . 46
4.8 Register access and execution stages of Leon3 processor . . . . 51
4.9 Delay distribution for operand2 bypassing lane . . . . . . . . . 52
4.10 Delay distribution for operand2 bypassing lane after Dy-
naTune optimization . . . . . . . . . . . . . . . . . . . . . . . 53
5.1 Dynamic behavior curves of two primary outputs . . . . . . . 56
5.2 Behavior graph of a node n . . . . . . . . . . . . . . . . . . . 62
5.3 A partitioned sub-circuit example . . . . . . . . . . . . . . . . 65
5.4 Encoding stabilization conditions with tTDD . . . . . . . . . . 68
5.5 Behavior graphs of a partitioned sub-circuit’s inputs . . . . . . 74
ix
5.6 Annotate a generic tTDD with probabilities . . . . . . . . . . 75
5.7 Compare tBDD with tTDD, tTDD models the stabilization
conditions accurately with the ‘U ’ edges . . . . . . . . . . . . 76
5.8 Example of false assignment in a tTDD . . . . . . . . . . . . . 78
5.9 An example of unifying timing tags in a tTDD . . . . . . . . . 82
5.10 An example of unifying phase tags in a tTDD . . . . . . . . . 84
5.11 Paritioning examples of a small circuit . . . . . . . . . . . . . 89
5.12 Problem of high-fan-out node on the partition boundary . . . 93
5.13 An example of partitioning procedure . . . . . . . . . . . . . . 94
5.14 Dynamic behavior of a 32-bit CLA adder’s carryout bit . . . . 97
5.15 tTDD partition parameter K’s effect on accuracy and
runtime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.1 Dynamic behavior curves of two implementations of the
same Boolean function ‘o0’ in apex2 . . . . . . . . . . . . . . . 105
6.2 Different dynamic behaviors of two implementations . . . . . . 107
6.3 Overall flow of common case promotion . . . . . . . . . . . . . 109
6.4 Promote the common-cases in f = (a ∧ b) ∨ (b ∧ c) ∨ (a ∧ c) . 114
6.5 An example of using SAT to remove circuit redundancy
during CCP . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
x
CHAPTER 1
INTRODUCTION
Very-large-scale-integration (VLSI) enables integrated circuits (IC) to be
manufactured in a cost-effective manner so that millions or billions of tran-
sistors can be packed in a single chip. To meet a given performance specifi-
cation, a process called timing closure is carried out throughout the design
cycle to guarantee the worst-case delay of a chip can meet a specified timing
requirement. Static timing analysis (STA) is applied during timing closure
to check whether the delay of the longest path is within a given timing con-
straint. A path with a worst-case delay equal to or larger than the required
timing constraint is called a critical path. The endpoint of such a critical
path is called a critical primary output.
With the existence of critical paths, optimization techniques, ranging from
logic synthesis to place and route, are carried out to reduce the path delay
at the cost of area and power. At the same time, technology scaling, which
decreases the dimension of transistors, allows more and more functions to be
integrated in a single chip. However, this ever-increasing complexity of chips
makes timing closure more difficult generation by generation.
In this traditional timing-closure-driven design methodology, circuit opti-
mization is tuned to the worst-case conditions to guarantee error-free com-
putation. But optimization may also lead to very inefficient designs because
ensuring that every timing path meet a timing requirement for the worst-case
scenario incurs significant area and power cost.
As a matter of fact, among the critical primary outputs, some can be
stabilized very quickly by input vectors, even if their topological delays from
primary inputs are very long. Exploiting this dynamic property of a circuit,
a novel design methodology called better-than-worst-case (BTWC) [1] design
has been proposed as an alternative way to operate a circuit more efficiently
by deliberately allowing timing errors for rare cases and rectifying them with
error correction mechanisms.
1
BTWC design methodology can be used at the micro-architectural level
or circuit level. Micro-architectural BTWC designs, such as paceline [2],
proposed to run a pair of CPU cores in parallel; one lead core produces results
speculatively, while a slower checker core is utilized to check the results of
the lead core to ensure the correctness of execution.
At the circuit level, razor logic is proposed [3] to use a main flip-flop to
produce speculative output which is then checked by a shadow flip-flop. As
long as the static critical paths are rarely exercised, the penalty of correcting
the erroneous speculative results is very small. Using the fast speculative re-
sults creates opportunities to operate a circuit more efficiently. The increase
in efficiency can translate to throughput performance gain or reliability im-
provement.
To use it for high throughput, a BTWC-designed processor component can
be over-clocked to the point where timing errors occur. An error detection
mechanism coupled with the processor component is used to detect timing
errors due to over-clocking. Upon the detection of such a timing error, an
error correction mechanism will kick in to rectify it with some performance
penalty. As long as the error probability is low, the processor component
designed with BTWC can gain throughput performance. Exploiting high
throughput of BTWC designed components by over-clocking is called timing
speculation (TS) [4].
Another promising application of BTWC design is to use it for delay vari-
ation tolerance. Dynamic variation, such as voltage droop or temperature
fluctuation, can deteriorate the performance of a circuit [5]. As a result, tim-
ing errors may occur during normal operation by clocking a circuit with a
static cycle time tclk when dynamic variations deteriorate the activated path
delay to a point beyond tclk. Equipping a circuit with integrated error de-
tection and correction mechanism enables it to still operate correctly under
such delay variations, thus enhancing its reliability.
To effectively utilize a BTWC circuit, it is desirable to know how it re-
sponds to inputs statistically. Knowing such statistics can help us to under-
stand which portion of the input space is commonly activated, thus provid-
ing insights into how to improve a BTWC circuit’s performance for common
cases. To capture this statistical information, a new concept of dynamic
behavior is proposed in this dissertation to represent a circuit’s stochastic
response to input vectors analytically. Formal ways to derive the circuit dy-
2
namic behavior are provided in this study. Different approaches of deriving
dynamic behavior by using global binary decision diagrams (BDD) or lo-
cal timed ternary decision diagrams (tTDD) are compared. This study also
demonstrates how to utilize the proposed dynamic behavior to optimize for
BTWC circuit performance using a dual threshold voltage (dualVt) assign-
ment approach and a logic synthesis approach.
This dissertation is organized as follows. In Chapter 2, the design require-
ments and challenges in deep sub-micro era are summarized. Utilizing BTWC
methodology to mitigate the critical path wall effect is reviewed. The exist-
ing work of micro-architecture-level, algorithm-level and circuit-level BTWC
designs are briefly summarized. The need of innovative CAD support for
BTWC design is pointed out.
Chapter 3 presents a new approach of designing BTWC circuits where
the processor itself is designed from the ground up for timing speculation
(TS) by using a novel path constraint tuning (PCT) methodology exploit-
ing existing commercial CAD tools. The idea is to identify and optimize
the most frequently-exercised critical paths in the design, at the expense of
the rarely-exercised static critical paths, which are allowed to occasionally
experience timing errors. The experimental results show the performance
potential of TS and also suggest that innovations in CAD techniques that
optimize specifically for BTWC designs are required.
To address the CAD problems for BTWC designs, in Chapter 4 an an-
alytical approach is proposed to compute the dynamic behavior curve of a
circuit using timed characteristic function (TCF) and BDD. The dynamic
behavior of a digital circuit is captured in the form of a dynamic behavior
curve, which describes the probability of a circuit producing correct outputs
at any given clock target tclk. A novel circuit optimization technique called
DynaTune is proposed specifically for BTWC design. It optimizes a circuit
by assigning dualVt cells based on the derived dynamic behavior curves. It
spends more optimization effort on the more dynamically critical gates, thus
improving a circuit’s common-case throughput.
DynaTune extensively uses global BDDs to capture the dynamic behav-
ior. The size of BDD heavily depends on variable ordering and the circuit
functionality. This may give rise to scalability issues. To make the behavior
curve analysis more scalable, Chapter 5 presents a novel tTDD technique to
analyze the circuit’s dynamic behavior from partitioned sub-circuits. tTDD
3
is used to encode stabilization conditions for primary outputs. To compute
probabilities based on a tTDD and preserve probability calculation accuracy,
false assignment pruning and random variable compaction are proposed to
take care of the temporal correlation due to reconvergent nets. A novel
partitioning algorithm is also proposed to reduce the structural correlation
induced by partitioning.
In Chapter 6, a novel logic re-synthesis flow, called common case promo-
tion (CCP) is proposed to improve a circuit’s timing error resilience utilizing
dynamic behavior curves. CCP optimizes circuit dynamic behavior as a new
optimization dimension. Aiming at improving the circuit error resilience for
common cases, CCP utilizes probability-driven re-synthesis, a SAT-based
redundancy remover and dynamic behavior analysis that provides optimiza-
tion convergence. The experimental results show the possibility of improving
circuit timing error resilience with negligible design overhead. Finally, Chap-
ter 7 concludes this dissertation.
4
CHAPTER 2
BETTER-THAN-WORST-CASE DESIGN
METHODOLOGY
Driven by the Moore’s law, the VLSI industry has entered the gigahertz
era. VLSI designs tend to have very aggressive timing constraints and tight
power constraints, resulting in a critical path wall effect. This critical path
wall creates significant challenges to timing closure and also gives rise to
reliability issues. To mitigate the critical path wall effect, a novel design
methodology called better-than-worst-case (BTWC) design is proposed.
2.1 Ever-tightening design constraints
To correctly perform the specified function, an IC chip relies on accurate
synchronization among its sub-modules. Designing with synchronous timing
elements, such as flip-flops, is generally referred to as synchronous design. In
contrast, a design is called asynchronous design if the timing elements are
self-timed. Synchronous design style is the most widely used design style, es-
pecially for digital circuits, and is used in all types of VLSI products including
general purpose CPUs, application-specific ASIC chips, and programmable
FPGAs, etc.
In a synchronous design, clocks play a central role to make sure that its sub-
modules interact with one another in a disciplined manner. A synchronous
design usually utilizes flip-flops as the boundary to separate combinational
logic groups. These flip-flops are synchronized by a special reference signal,
called clock. In a traditional synchronous design, every combinational logic
group guarantees to finish computation within a specified cycle time tclk.
As shown in Figure 2.1, the simplest flip-flop typically consists of two
inputs, D and CK, and an output Q. D is an input pin that receives the
incoming data. CK is an input pin that receives the clock signal. Q is
an output pin that send the data received by D to trailing logic when it
5
Flip-flop Combinational logicD Q Flip-flopD Q
the longest path
tmaxdelay
the shortest path
tmindelay
CK CK
tclk
Computation (data unstable) Data stable
toverheadtmaxdelay
CK
D
One pipeline stage
Figure 2.1: A flip-flop example and the setup time requirement
is activated by the CK pin. Depending on implementation, a flip-flop can
be rising-clock-edge activated or falling-clock-edge activated. Since rising-
clock-edge activated flip-flops are most commonly used in real IC products,
all flip-flops are assumed to be rising-clock-edge activated in the rest of this
dissertation. A more complicated flip-flop can also include a set pin or/and
a reset pin to initialize it to a desired value.
To ensure D can capture correct data, a flip-flop and the leading com-
binational logic that feeds into its D pin need to satisfy a set of timing
requirements. To guarantee D can capture a stabilized input value, a setup
timing relationship is required as described in formula 2.1.
tmaxdelay + tsetup < tclk − tskew − tuncertainty − tguardband (2.1)
where tmaxdelay is the longest path delay of the combinational logic that feeds
into the flip-flop’s D pin, tsetup is the minimal setup time required by the cell
library, tclk is the clock period, tskew is the max misalignment of clock edges
among flip-flops, tuncertainty is the amount of delay uncertainty at manufacture
stage, and tguardband is a reserved timing margin to account for variations that
may deteriorate delay. Formula 2.1 is a simplified setup time requirement,
which lists the most important factors. Other factors, such as clock jitter
6
and clock slew, play secondary roles and thus are ignored here for clarity.
In addition to the setup time requirement, the data on the D pin must also
remain unchanged for a specified minimal delay thold, called hold time. This
hold time requirement, described in formula 2.2, ensures that the transition in
the combinational logic cannot be too fast so that it may allow signals along
the short paths to propagate through the combinational logic and eventually
overwrite the stabilized value at the D pin.
tmindelay > thold + tskew + tuncertainty (2.2)
Setup time and hold time requirements are checked at design time using
a technique called static timing analysis (STA) [6]. This traditional way of
timing analysis searches for the longest timing path between pipeline stages
and uses its delay as tmaxdelay to check against the setup time requirement
in formula 2.1. For hold time requirement, STA looks for the shortest path
between pipeline stages and uses its delay as tmindelay to check against the
formula 2.2.
In the traditional design flow, the delay of the longest path determines the
minimal cycle time the chip can operate on. The smaller tmaxdelay, the higher
frequency a chip may run. In other words, the setup time requirement is the
primary factor that determines how fast a chip can run. Therefore, it is the
focus of this study.
In the setup time requirement described in formula 2.1, tskew is determined
by the clock tree, which is inserted in the physical design stage that follows
logic design. tuncertainty is primarily determined by the process variations
and is usually suggested by the foundries. tsetup is specified in the technology
library. tguardband takes into account the other variations that may affect the
delay, including supply voltage fluctuation and temperature changes during
operation, etc. The above timing terms are determined by manufacture
process factors or other physical factors that are beyond the scope of logic
design and optimization.
For clarity, we can group those non-controllable timing terms during logic
design stage into a lumped term toverhead as follows:
toverhead = tsetup + tskew + tuncertainty + tguardband (2.3)
7
The remaining tclk and tmaxdelay are the timing factors that can actually
be controlled at logic design and optimization stages. Substituting toverhead
into formula 2.1, we have
tmaxdelay + toverhead < tclk (2.4)
To satisfy the setup timing requirement, the delay of the longest path in
combinational logic between the register boundaries plus the delay overhead
(tmaxdelay + toverhead) has to be pushed to be shorter than the cycle time tclk.
With the continuation of VLSI technology shrinking, the portion of toverhead
has significantly increased to even 30% of the total clock period. At the same
time, VLSI chips are becoming more and more complex to accommodate the
sophisticated functionalities. As a result, more and more logic cells need to be
packed between register boundaries. Moreover, the increasing performance
need also requires IC chips to run faster with shorter tclk from one generation
to another. Decreasing tclk, increasing toverhead and the diminishing tmaxdelay
lead to higher design challenges from one generation to another.
2.2 Critical path wall
When a circuit is constrained with loose timing constraints, the path delay
distribution spreads out as shown by the dashed curve in Figure 2.2. Some
paths have zero slack just to meet the timing requirement, while other paths
have delay shorter than the target clock period, resulting in positive slack.
Unfortunately, most IC chips have very tight timing constraints because
of the decreasing cycle time tclk, as pointed out in Section 2.1. To meet these
tight timing constraints, CAD tools must optimize a large number of long
paths using a variety of synthesis techniques, such as logic balancing [7, 8],
bi-decomposition [9], delay driven decomposition [10], performance-driven
resynthesis [11], etc. These logic optimization techniques tend to break down
a long path and substitute its function with several carefully balanced shorter
paths. On one hand, the delay of a long path can be reduced; on the other
hand, more paths with delay close to tclk are created. As a result, a large
number of paths with similar delay close to the cycle time are created. This
concentration of delay distribution close to tclk builds up a steep critical path
8
wall, as illustrated by the solid curve in Figure 2.2.
At the same time, the need of design for low power exacerbates the critical
path wall problem. Design for low power requires the power consumption of
the design to be constrained within a given budget. A number of optimization
techniques ranging from logic synthesis to physical layout have been proposed
in the literature to reduce the power consumption of VLSI circuits, including
low-power technology mapping [12], low-power gate sizing [13], power-aware
gate decomposition [14], reducing glitch power using path balancing [15], and
reducing leakage power by dualVt assignment [16].
All these power reduction techniques share a common characteristic: they
all reduce power at the cost of timing. They take advantage of the existence
of short paths, and leverage various types of techniques, such as mapping and
sizing, etc. to reduce the power consumption on these short paths. Inevitably,
the cost of lengthening the delay on the short paths must be paid.
Obviously, reducing power makes the critical path wall problem even worse.
On one hand, logic synthesis tools try to shorten the delay on the long paths,
while on the other hand, power optimization tools lengthen the short paths
for low power thus creating more long paths.
Aggeresive
timing constraint
Path delay distribution 
spreads out with relaxed 
timing constraint
Relaxed
timing constraint
Critical path wall forms with 
tight timing constraint 
Pa
th
 d
ela
y
dis
tri
bu
tio
n
Clock period
(ns)
Figure 2.2: Critical path wall forms when a design has tight timing
constraints
The critical path wall effect brings serious timing closure challenges and
reliability issues. At design time, logic optimization faces an abrupt increase
of difficulty when the delay of the circuit is pushed close to critical path wall,
9
where any further timing optimization incurs significant inefficiency because
an exponentially increasing number of timing paths need to be shortened.
During operation, this critical path wall makes a design less reliable. Specif-
ically, when the delay of the circuit deteriorates due to dynamic variations
such as supply voltage fluctuation or environment temperature changes, a
large number of timing paths will be affected and thus may fail simultane-
ously.
To demonstrate how the critical path wall can affect the reliability of IC
chips, a well-known study [17] was carried out by Professor Janak Patel of
the Univerisity of Illinois at Urbana Champaign. He stressed an old 233MHz
PowerPC 750 processor and a 2GHz Pentium M processor with a program
designed to exercise all major functional units. For each run of the program,
the supply voltage was decreased by a fixed step. His study showed that for
the old generation 233MHz PowerPC processor, the error rate increased grad-
ually with decreasing supply voltage, while the modern Pentium M processor
experienced massive simultaneous errors resulting in system crash when the
supply voltage was decreased to a critical point of 1.052V, suggesting the
existence of such a critical path wall in modern IC chips. To mitigate this
critical path wall effect, a new design methodology different from the tradi-
tional wisdom is required.
2.3 Foundation of BTWC design
Due to process and environmental variations, traditional worst case oriented
design methodology exploits a conservative timing margin for safety at design
time to guarantee error-free operations under delay variations. This extra
timing margin is called a guardband.
STA is applied throughout the design cycle to check if there exists any
critical path that has a large tmaxdelay so that it may consume the reserved
timing margin in the guardband. STA reports these critical paths to synthesis
or place-and-route tools. Optimization will be applied on those critical paths
at the cost of area or power.
As its name suggests, STA checks setup time requirements only considering
the topological length of timing paths in a static manner. It ignores the
dynamic characteristic of timing paths, i.e., how frequently signals actually
10
transmit along those long paths. In fact, as pointed out in [1,18], some long
paths are rarely exercised for typical applications.
In reality, the criticality of the critical paths in a circuit may differ dy-
namically. Some critical paths may be exercised frequently, while some other
critical paths may only be exercised rarely. An example of such cases is stud-
ied in [1]. The authors study an adder design where the longest critical path
is the carry chain from the adder’s carry-in bit to its carry-out bit. Their
study shows that for typical applications the activities on the carry chain con-
centrate on the lower bits. The probability of exercising the whole carry-out
chain is very small. They also observed a similar effect for other functional
units. Therefore, spending effort in optimizing those rarely exercised long
paths may incur unnecessary design overhead.
Based on the observation that a critical path’s criticality is affected by
dynamic factors, BTWC design proposes a new chip design methodology that
discards the static correctness guarantee for the worst-case design corners.
It proposes to avoid over-designing the parts that are not frequently utilized
and to relax the non-dynamic critical part of the design to an extent that the
delay on the longest path in such part no longer satisfies the setup timing
requirement described in formula 2.4.
As a cost, removing the guardband no longer guarantees that a circuit
operates with 100% correctness. Some timing paths may become longer
than tclk. When exercised, they need more than a clock period to produce
the correct result at the outputs. If those outputs are sampled with a clock
period of tclk, erroneous data may be captured by flip-flops. Such errors are
called timing errors, because they manifest only if there is not enough time
to compute. If enough computation time is allowed, such timing errors will
eventually disappear.
BTWC design allows a circuit to be designed with a relaxed timing con-
straint. Long paths are allowed to exist as long as they are rarely exercised.
When these long paths are exercised and timing errors occur, these errors
will be detected by an integrated error detection mechanism and then be
rectified by an error correction mechanism.
To handle timing errors, BTWC design methodology proposes to use er-
ror correction mechanisms to protect a circuit against rare-case timing er-
rors. As long as the timing errors manifest infrequently, the error correction
mechanism may rarely kick in, resulting in a very small performance penalty.
11
Overall, a circuit’s performance is improved because the common cases are
operated more efficiently.
With the integrated error detection/correction mechanism, BTWC design
methodology abandons the use of guardband so that a circuit can be operated
more efficiently for common cases. It in effect allows a VLSI circuit to be
designed with a loose timing constraint for the rarely exercised part, hence
alleviating the critical path wall problem.
Depending on how timing errors are handled, BTWC design can be fur-
ther classified into three categories, including microarchitecture-level BTWC
design, application-level BTWC design and circuit-level BTWC design. In
the rest of this chapter, I will briefly introduce the first two and then focus
on the category of circuit-level BTWC design techniques to which the novel
CAD techniques proposed in this dissertation belong.
2.4 Microarchitecture-level BTWC design
At micro-architecture level, timing speculation (TS) is a common applica-
tion [2, 3, 19–22] of BTWC design methodology to harvest high processor
throughput (generally defined as the amount of computation performed in
a time unit). The idea of TS is to increase a processor’s clock frequency
to the point where timing errors begin to occur and to equip the processor
with micro-architectural techniques for detecting and correcting the resulting
errors.
A common approach of implementing TS is to use two paired cores in
a leader-checker organization, with both running the same (or very simi-
lar) code, as in paceline [2], slipstream [22], optimistic tandem [21], and
reunion [23]. The leader runs speculatively and can relax functional correct-
ness. The checker executes correctly and may be sped up by hints from the
leader as it checks the leader’s work.
For example, paceline [2] as shown in Figure 2.3 uses two processor cores (a
leader core and a checker core) to perform timing speculation. The leader is
clocked at a frequency higher than the safe error-free frequency f0, while the
checker is clocked at lower frequency fr. The leader sends branch results to
the checker and also prefetches data into a shared level-2 (L2) cache, allowing
the checker to keep up. These two cores periodically exchange checkpoints
12
of architectural state. If they disagree, the checker copies its register state
to the leader for error correction.
!" #$ %&%'($%") $*% +,%-'.+'/%0 .%10%- 0#""#21$%" *#3*%-
2+4%- $*15 1 61"%.#5% '+-%7 8%154*#.%) $*% '*%'/%-) 69
.%,%-13#53 $*% 2-%:%$'*%0 01$1 150 6-15'* +($'+;%" :-+;
$*% .%10%-) "2%50" .%"" 2+4%- $*15 1 61"%.#5% '+-%7 <+ 6%
16.% $+ "("$1#5 *#3*%- $*15 61"%.#5% "2%%0 4#$*+($ +,%-*%1$=
#53) $*% $4+ '+-%" 2%-#+0#'1..9 1.$%-51$% $*% .%10%- 2+"#$#+57
<*% -%"(.$ #" $*1$ $*% '*#2>" 2+4%- 0%5"#$9 150 ;1&#;(;
$%;2%-1$(-% 1-% 5+$ %&2%'$%0 $+ #5'-%1"% "(6"$15$#1..9 +,%-
1 61"%.#5% "9"$%;7 ?5$(#$#,%.9) $*% +2%-1$#+5 #" 151.+3+(" $+
1 21'%.#5% +: $4+ 6#'9'.% -#0%-" 4*%-% -#0%-" $1/% $(-5" $+
.%107 <*% .%10%- %&2%50" ;+-% %::+-$ 4*#.% "*%.$%-#53 $*%
+$*%- -#0%-7
! @1'%.#5% A8@ '+5$1#5" ;(.$#2.% +: $*%"% .%10%-B
'*%'/%- '+-%" 1" "*+45 #5 C#3(-% D7 E1'* '+-% 21#- "*1-%"
15 FD '1'*% 150 #5'.(0%" "#;2.% *1-041-% $+ 2%-#+0#'1..9
'+;21-% "$1$% 150 $+ 21"" 6-15'* +($'+;%"7 <*#" *1-041-%
-%G(#-%" +5.9 ,%-9 ";1.. '+-% ;+0#!'1$#+5" 150 '15 6% %1"=
#.9 0#"16.%0) -%$(-5#53 $*% '+-% 21#- $+ $*% "$1501-0 A8@
4+-/#53 ;+0%7
P3
P1
P2
...
Interconnect
P16
CMP Die
L1
L1
L2
L2
= Hardware modifications
!"#$%& '( @1'%.#5% A8@ 4#$* HI '+-%"7
J,%-1..) @1'%.#5% "2%%0" (2 1 "#53.% $*-%10 K+: 1 "%-#1. +-
1 21-1..%. 2-+3-1;L 4#$*+($ "#35#!'15$.9 #5'-%1"#53 A8@
2+4%- 0%5"#$9 +- *1-041-% 0%"#35 '+;2.%&#$97 ?5 $*% :+.=
.+4#53) 4% !-"$ 3#,% 15 +,%-,#%4 +: $*% ;#'-+1-'*#$%'$(-%
150 '*1-1'$%-#M% $*% $92%" +: %--+-" #$ '15 %5'+(5$%-7 <*%5)
4% 2-%"%5$ $4+ 0#::%-%5$ @1'%.#5% ,1-#1$#+5") %1'* "2%'#1.=
#M%0 :+- *150.#53 0#::%-%5$ $92%" +: %--+-"7
3.1 Overview of the Microarchitecture
?5 @1'%.#5%) $*% .%10%- 150 $*% '*%'/%- '+-%" +2%-1$% 1$
0#::%-%5$ :-%G(%5'#%") 4#$* $*% '*%'/%- .133#53 6%*#50 150
-%'%#,#53 6-15'* +($'+;%" 150 ;%;+-9 2-%:%$'*%" #5$+ $*%
"*1-%0 FD :-+; $*% .%10%-7 C#3(-% N "*+4" $*% ;#'-+1-'*#=
$%'$(-%7 <*% -%3#+5 #5 $*% 01"*%0 6+(501-9 #" +,%-'.+'/%0)
4*#.% %,%-9$*#53 %."% -(5" 1$ $*% -1$%0) "1:% :-%G(%5'97
<*% "*10%0 '+;2+5%5$" 1-% $*% 5%4 *1-041-% ;+0(.%"
100%0 #5 @1'%.#5%7 O2%'#!'1..9) $*% +($'+;%" +: $*% .%10%->"
6-15'*%" 1-% 21""%0 $+ $*% '*%'/%- $*-+(3* $*% !"#$%&
'()() KPQL7 8+-%+,%-) $*% *1-041-% #5 6+$* .%10%- 150
'*%'/%- $1/%" -%3#"$%- '*%'/2+#5$" %,%-9 n #5"$-('$#+5" 150
"1,%" $*%; .+'1..9 #5 EAA=2-+$%'$%0 "1:% "$+-13%7 ?5 100#=
$#+5) $*% *1-041-% *1"*%" $*% '*%'/2+#5$" #5$+ "#351$(-%"
150 "%50" $*%; $+ $*% EAA=2-+$%'$%0 *#+,-#.,/$ '()()
KRQL7 !" %&%'($#+5 '+5$#5(%") $*% RQ '*%'/" :+- 13-%%=
;%5$ 6%$4%%5 $*% *1"*%0 -%3#"$%- '*%'/2+#5$" +: $*% .%10%-
150 $*% '*%'/%-7 <*% RQ "#$" #5 $*% '1'*% *#%-1-'*9 6%=
$4%%5 $*% FH 150 FD '1'*%"7 O#5'% $*% FH '1'*%" +2%-1$%
#5 4-#$%=$*-+(3* ;+0% 1" #5 $*% @%5$#(; S TDSU) $*% RQ '15
2+$%5$#1..9 "%% 1.. $*% ;%;+-9 4-#$%" #5 +-0%-7 O('* '121=
6#.#$9 1..+4" #$ $+ 2-+,#0% %&$-1 :(5'$#+51.#$9 $*1$ 0%2%50"
+5 $*% $92%" +: %--+-" *150.%07 V% 4#.. "%% $*% 0%$1#." #5
O%'$#+5 N7N7
Co
he
re
nt
 L
2
in
te
rc
on
ne
ct
VQ
Leader
P0
L1Reg
ckpt
Checker
P1
L1
Reg
ckpt
BQ
Hash
Hash
!"#$%& )( <*% @1'%.#5% ;#'-+1-'*#$%'$(-%7
!.$*+(3* $*% .%10%- 150 '*%'/%- '+-%" -%0(5015$.9 %&%=
'($% $*% "1;% $*-%10) $*%9 0+ 5+$ %&%'($% #5 .+'/="$%27 O#5'%
$*%9 1-% +($=+:=+-0%- 2-+'%""+-") $*%9 %&%'($% #5"$-('$#+5"
#5 0#::%-%5$ +-0%-" 150 %,%5 %&%'($% 0#::%-%5$ #5"$-('$#+5"
W 0(% $+ 6-15'* ;#"2-%0#'$#+57 X+4%,%-) #5 $*% 16"%5'% +:
%--+-") $*%#- -%$#-%;%5$ "$-%1;" 1-% #0%5$#'1.7
Y#,%5 1 0951;#' 4-#$% #5"$-('$#+5 #5 1 2-+3-1;) $*%
.%10%- 150 $*% '*%'/%- 4#.. #""(% $*% "$+-% $+ $*% FH 1$ 0#:=
:%-%5$ $#;%") 4*%5 %1'* -%$#-%" $*% 4-#$% #5"$-('$#+57 <*%
RQ 4#.. +5.9 1..+4 +5% +: $*% $4+ "$+-%" $+ 2-+2131$% $+ $*%
FD) 2+""#6.9 1:$%- 2%-:+-;#53 "+;% ,1.#01$#+57
C+- -%10") *+4%,%-) $*%-% #" 5+ "('* !.$%-#537 ! .+10 #"=
"(%0 69 $*% .%10%- $*1$ ;#""%" #5 $*% FH #" #;;%0#1$%.9 "%5$
$+ $*% FD7 ?: 150 4*%5 $*% '*%'/%- #""(%" $*% '+--%"2+50#53
.+10) #$ ;19 1."+ "%50 1 -%10 -%G(%"$ $+ FD7 <*% 10,15$13%
+: $*#" 122-+1'* #" $*1$ #$ 0+%" 5+$ -%G(#-% 159 -%10 6(::%-=
#53 1$ 1.. 150) $*%-%:+-%) #$ #" %1"9 $+ "(22+-$ #5 *1-041-%7
X+4%,%-) #$ ;19 -%"(.$ #5 $*% $4+ .+10" -%$(-5#53 0#::%-%5$
,1.(%" W #: $*% .+'1$#+5 6%#53 -%10 #" ;+0#!%0 #5 6%$4%%5
$*% -%10" 69) :+- %&1;2.%) 1 4-#$% :-+; 15+$*%- $*-%10) 1
Z8! 1'$#+5) +- 1 :1(.$7 O;+.%5" ). #+0 TD[U '1.. $*#" 2-+6=
.%; $*% 1$2(. 1$%/&)")$%) 3"/4+)57 <*#" 122-+1'* #" 1."+
("%0 #5 O.#2"$-%1; TDDU 150 \%(5#+5 TD[U7
3.2 Types of Errors
<+ 0%"#35 @1'%.#5%) 4% '+5"#0%- $*% $*-%% 2+$%5$#1.
"+(-'%" +: %--+- "*+45 #5 <16.% H] $#;#53 %--+-" 0(% $+ +,%-=
'.+'/#53) %--+-" 0(% $+ $*% #52($ #5'+*%-%5'% 2-+6.%;) 150
16th International Conference on
Parallel Architecture and Compilation Techniques (PACT 2007)
0-7695-2944-5/07 $25.00  © 2007
Figure 2.3: Leader and checker in paceline microarchitecture [2]
Another type of leader-checker architecture pursues high frequency by
making the leader core functionally incorrect by design. Optimistic tan-
dem [21] achieves this by pruning infrequently-used functionality from the
leader. A dynamic implementation verification architecture (DIVA) [19] can
also be used in this manner by using a functionally incorrect main pipeline.
This approach requires the checker to be dedicated and always on.
One limitation of the above timing speculation work is that it is based
on traditional chip design methodology; therefore, the performance may be
limited by the nature of traditional design flow. To address this, a new
design flow of designing TS processor is proposed in Chapter 3. It leverages
the existing commercial tools, but the design flow is modified to take into
account of the needs of BTWC design methodology.
2.5 Application-specific BTWC design
BTWC design methodology is also explored for designing application-specific
circuits, especially in signal processing and communication systems. Some-
times the end user may not need the full digital resolution of the compu-
tational results from certain applications [24]. Representative applications
13
include: sensor network, speech recognition, image recognition, data mining,
music synthesis, etc. Taking image recognition for example, to distinguish
a person from an car, the color of the car is irrelevant. Similarly, in speech
recognition even if some words are unclear in a sentence, humans can still
understand the sentence based on the context. This algorithmic-level er-
ror tolerance gives rise to another family of BTWC design style, referred as
algorithmic noise tolerance (ANT) in [25].
ANT utilizes the fact that sometimes low resolution approximation of a
full resolution computation result can still provide sufficient information for
certain signal processing applications. ANT designs use low resolution in-
termediate results when full resolution is unavailable or too inefficient to
acquire.
An example ANT circuit is shown in Figure 2.4; the main block is a com-
putation unit that may be relaxed for power saving, and its output ya is
the exact computation result. In parallel, a simplified estimator provides
an approximated version of that exact result. Their results are compared
with a comparator. As long as the error is smaller than a preset threshold,
the fast estimated result with be used. Otherwise, the slower exact result
from the main block will be used. With the integration of a simple estima-
tor, the design constraint for the main block may be relaxed, thus providing
opportunities for power and energy savings.
Main Block
Estimator
|  | >THx
!"# oa yy
eyy oe "#
yˆ
actual error-freeerrors
estimation errors
corrected
0
)(ePe)(!!P
(a) (b)
Figure 2: Algorithmic noise-tolerance (ANT): (a)
framework, and (b) error distributions.
where ya is the actual main block output, yo is the error-
free main block output, η is the hardware error, ye is the
estimator output, and e is the estimation error. Note: the
estimator has estimation error e because it is simpler than
the main block. ANT exploits the difference in the statistics
of η and e as shown in Fig. 2(b). To enhance robustness, it
is necessary that when η != 0, that η be large compared to
e. In addition, the probability of the event η != 0, i.e., the
component probability of error pe of the main block, must be
small. The final/corrected output of an ANT-based system
yˆ is obtained via the following decision rule:
yˆ =
{
ya, if |ya − ye| < τ
ye, otherwise
(3)
where τ is an application-dependent parameter chosen to
maximize the performance of ANT. Under the conditions
outlined above, it is possible to show that
SNRuc # SNRe # SNRANT ≈ SNRo (4)
where SNRuc, SNRe, SNRANT and SNRo are the SNRs of
the uncorrected main block (η dominates), the estimator (e
dominates), the ANT system, and the error-free main block
(ideal), respectively. Thus, ANT detects and corrects errors
approximately, but does so in a manner that satisfies an
application-level performance specification (SNR or BER).
It employs estimation, by constructing an efficient estimator,
and detection, by formulating the decision rule (3) derived
from detection theory.
For ANT to also provide energy-efficiency, it is necessary
that the errors in the main block be primarily due to en-
hancement of its energy-efficiency. In practice, these prop-
erties are easily satisfied when errors in the main block occur
to voltage overscaling (VOS) [5], or a nominal case design
being subjected to a worse case process corner (better than
worst-case design (BTWC)). In VOS, the supply voltage is
scaled below the critical voltage Vdd,crit needed for error-
free operation. As most computations are least-significant-
bit (LSB) first, timing violations due to VOS or BTWC are
generally large magnitude most-significant-bit (MSB) errors.
Thus, timing violations satisfy the error distribution shown
in Fig. 2(b).
A number of ANT techniques have been proposed in the
past [5,22,23] for finite-impulse response (FIR) filters. ANT
has been shown to achieve up to 3× energy savings in theory
and in practice via prototype IC design [24] for finite impulse
response (FIR) filters. ANT has also been employed in the
design of error-resilient low-power motion estimators [6] and
Computation
1
Sensor 1
Sensor 2
Sensor 3
Sensor M
Statistically
similar
Decomposition
Fusion Block
x
x
1ey
eMy
3ey
2ey
yˆ
corrected
oy
Figure 3: The stochastic sensor network-on-a-chip
(SSNOC).
Viterbi decoders [7] (800× improvement in BER with 3×
improvement in energy savings).
4. STOCHASTIC SENSORNETWORK-ON-
A-CHIP (SSNOC)
SSNOC [8] relies only on multiple estimators or sensors to
compute, permitting hardware errors to occur (see Fig. 3),
and then fusing their outputs to generate the final corrected
output yˆ. Thus, the output of the ith sensor is given as
yei = yo + ei + ηi (5)
where ηi and ei are the hardware and estimation errors in
the ith estimator, respectively.
If hardware errors are due to timing violations, one can
approximate the error term in (5) as (1−pe)ei+peηi, where
pe is the probability of ηi != 0, i.e., the component proba-
bility of error. Such an #-contaminated model lends itself
readily to the application of robust statistics [25] for er-
ror compensation. SSNOC has been applied to a CDMA
PN-code acquisition system [8], where the sensors were ob-
tained through polyphase decomposition of the matched fil-
ter. Simulations indicate an 800× improvement in detection
probability while achieving up to 40% power savings. A key
drawback of SSNOC is the feasibility of decomposing com-
putation into several sensors whose outputs are statistically
similar, i.e., its generality. SSNOC has been applied to a
CDMA PN-code acquisition system, where the sensors were
obtained through polyphase decomposition.
5. STOCHASTICCOMPUTATIONWITHER-
ROR STATISTICS
ANT and SSNOC rely on certain properties of the dis-
tribution of hardware errors η and the estimation error e.
For ANT, the distributions of η and e should be sufficiently
distinct, and for SSNOC, the composite error distribution
should be #-contaminated. ANT and SSNOC both have
been shown to be powerful in enhancing robustness while
providing significant energy-savings. In this section, we show
that even more powerful versions of stochastic computation
can be developed if error statistics are explicitly employed
in computation.
We first provide an example of an error distribution. The
timing error distribution at the output of a 8×8, 8-bit input,
14-bit output, 2-D DCT block using Chen’s algorithm [26],
with mirror adders and array multipliers [27] as fundamental
861
50.1
Figure 2.4: Algorithmic noise-tolerance (ANT) [25]
The authors of [25] observed that most signal processing applications exer-
cise the least significant bits frequently. The timing errors are usually large,
so they can be easily detected by the comparator.
14
ANT was employed in prototype IC designs including finite impulse re-
sponse (FIR) filters [26], an error-resilient low-power motion estimator [27]
and Viterbi decoders [28]. It was demonstrated that up to 3× energy savings
can be achieved using ANT.
2.6 Circuit-level BTWC design
At the circuit level, there is a parallel effort on VLSI circuit design targeting
high throughput. For asynchronous designs, speculative completion [29] is
proposed to detect the computation completion for asynchronous units. It
associates multiple speculative delay models for different (e.g., worst-case
versus best-case) speeds of early completion. Such an asynchronous unit
is allowed to speculatively complete operations when the associated trigger
conditions are detected.
For synchronous designs, a common way of detecting timing error for syn-
chronous designs is to use razor logic [3] that utilizes double flip-flops to
detect timing error. Upon detection of a timing error, it can notify the
processor of the error at the next cycle.
Another way of detecting timing error is proposed as telescopic unit [30].
Telescopic unit transforms a single cycle unit into a variable-cycle implemen-
tation, which has data-dependent latency. The clock rate for such a modified
unit can be sped up to match the common case; i.e., the common-case com-
putation can still finish in one clock cycle. Complex computation will be
split over two (or more) cycles. Average throughput can be improved if the
probability of a long-latency computation is much smaller than that of a
short-latency one [30,31].
Razor logic and telescopic unit are discussed in detail next because they
are compatible with the popular synchronous design style. Later they will
be used as circuit-level error detection/correction mechanisms for BTWC
optimization.
2.6.1 Razor logic
As a result of the relaxation of design constraints for rarely exercised parts of
a circuit, these relaxed long path may be subject to timing error when they
15
are actually exercised. Therefore, an error detection mechanism is required
to capture such timing errors and recover from such errors accordingly. Razor
logic (RZ) [3] is proposed to implement such a timing error detector using
shadow flip-flops. An ordinary register captures the data speculatively at
tclk regardless whether the leading combinational logic finishes computation,
while a shadow latch clocked slightly later captures the output of the com-
binational logic again at a slightly later time point after a delay of trelax.
With the extra time margin of trelax, the shadow flip-flops can guarantee the
correctness of the outputs.
Figure 2.5 demonstrates the concept of the razor logic. Under normal
operation where timing errors do not happen, the main FF can latch in the
correct data. However, when timing errors take place, the comparator is used
to catch the error. The advantage of razor logic is that in most cases (with
probability P ) the main flip-flop can latch in the correct value. When an
error is detected, at the architecture level, an auxiliary logic will initiate the
recovery phase by stopping the incorrectly latched value from propagating
into trailing pipeline stages.
chain. This approach to DVS has the advantage that it dynam-
ically adjusts the operating voltage to account for global vari-
ations in supply voltage drop, temperature fluctuation, and
process variations. However, it cannot account for local varia-
tions, such as local supply voltage drops, intra-die process
variations, and cross-coupled noise, and therefore requires the
addition of safety margins to the critical voltage. Also, the
delay of an inverter chain does not scale with voltage and
temperature in the same way as the delays of the critical paths
of the actual design, which can contain complex gates and
pass-transistor logic, which again necessitate extra voltage
safety margins. In future technologies, the local component of
environmental and process variation is expected to become
more prominent and, as noted in [6], the sensitivity of circuit
performance to these variations is higher at lower operating
voltages, thereby increasing the necessary margins and reduc-
ing the scope for energy savings.
In this paper, we propose a new approach to DVS,
referred to as Razor, which is based on dynamic detection and
correction of speed path failures in digital designs. The key
idea of Razor is to tune the supply voltage by monitoring the
error rate during operation. Since this error detection provides
in-situ monitoring of the actual circuit delay, it accounts for
both global and local delay variations and does not suffer
from voltage scaling disparities. It therefore eliminates the
need for voltage margins that are necessary for “always-cor-
rect” circuit operation in traditional designs. In addition, a
key feature of Razor is that operation at sub-critical supply
voltages does not constitute a catastrophic failure, but instead
represents a trade-off between the power penalty incurred
from error correction against additional power savings
obtained from operating at a lower supply voltage.
It was previously observed that circuit delay is strongly
data dependent, and only exhibits its worst-case delay for
very specific instruction and data sequences [24]. From this it
can be conjectured that for moderately sub-critical supply
voltages only a few critical instructions will fail, while a
majority of instructions will continue to operate correctly.
Our hardware measurements and circuit simulation studies
support this conjecture and demonstrate that the circuit opera-
tion degrades gracefully for sub-critical supply voltages,
showing a gradual increase in the error rate. The proposed
Razor approach automatically exploits this data-dependence
of circuit delay by tuning the supply voltage to obtain a small,
but non-zero error rate. It was found that if the error rate is
maintained sufficiently low, the power overhead from error
correction is minimal, while substantial power savings are
obtained due to operating the circuit at a lower supply volt-
age. Note that as the processor executes different sets of
instructions, the supply voltage automatically adjusts to the
delay characteristics of the executed instruction sequence,
lowering the supply voltage for instruction sequences with
many non-critical instructions, and raising the supply voltage
for instruction sequences that are more delay intensive.
We propose a combination of circuit and architectural
techniques for low cost in-situ error detection and correction
of delay failures. At the circuit level, each delay-critical flip-
flop is augmented with a so-called shadow latch which is
controlled using a delayed clock. The operating voltage is
constrained such that the worst-case delay is guaranteed to
meet the shadow latch setup time, even though the main flip-
flop could fail. By comparing the values latched by the flip-
flop and the shadow latch, a delay error in the main flip-flop
is detected. The value in the shadow latch, which is guaran-
teed to be correct, is then utilized to correct the delay failure.
We present several architectural solutions for error correction,
ranging from simple clock gating to more sophisticated
mechanisms that augment the existing mispeculation recov-
ery infrastructure. 
The proposed Razor technique was implemented in a
prototype 64-bit Alpha processor design. This prototype
implementation was used to obtain a realistic prediction of
the power overhead for in-situ error correction and detection.
We also studied the error-rate trends for datapath components
using both circuit-level simulation as well as silicon measure-
ments of a full-custom multiplier block. Architectural simula-
tions were then performed to analyze the overall throughput
and power characteristics of Razor based DVS for different
benchmark test programs. We demonstrate that on average,
Razor reduced simulated power consumption by more than
40%, compared to traditional design-time DVS and delay-
chain based approaches.
The remainder of this paper is organized as follows. In
Section 2, we present the implementation of Razor, providing
a detailed description of both the proposed circuit and archi-
tectural techniques. In Section 3, we discuss the simulation
framework for Razor-based DVS and present error rate stud-
ies and our simulation results. In Section 4 we present a
detailed survey of prior work in DVS. Finally, in Section 5,
we draw our conclusions.
2   Razor Error Detection/Correction
Razor relies on a combination of architectural and circuit
level techniques for efficient error detection and correction of
delay path failures. The concept of Razor is illustrated in Fig-
ure 1(a) for a pipeline stage. Each flip-flop in the design is
augmented with a so-called shadow latch which is controlled
by a delayed clock. We illustrate the operation of a Razor flip-
flop in Figure 1(b). In clock cycle 1, the combinational logic
L1 meets the setup time by the rising edge of the clock and
both the main flip-flop and the shadow latch will latch the
correct data. In this case, the error signal at the output of the
XOR gate remains low and the operation of the pipeline is
unaltered. 
In cycle 2 in Figure 1(b), we show an example of the
operation when the combinational logic exceeds the intended
delay due to sub-critical voltage scaling. In this case, the data
is not latched by the main flip-flop, but since the shadow-
Figure 1. Pipeline augmented with Razor latches and control lines.
(a) (b)
clock
instr 1
clock_d
D
Error
Q
instr 2
instr 1 instr 2
cycle 1 cycle 2 cycle 3 cycle 4
Error_L
Error
comparator
RAZOR FF
clk_del
Main 
Flip-Flop
clk
Shadow 
Latch
Q1D1Logic Stage
L1
0
1
Logic Stage
L2
!"#$%%&'()*+#,+-.%+/0-.+1(-%"(2-'#(23+4567#*'86+#(+9'$"#2"$.'-%$-8"%+:91;<=>/0?@/A+
@>B0CD>E@F/>GH@/+I+JBK@@+L+E@@/+1MMM+
Authorized licensed use limited to: Arm Limited. Downloaded on October 15, 2009 at 08:48 from IEEE Xplore.  Restrictions apply. 
Figure 2.5: Capture timing error with razor logic [3]
In traditional design methodology, the delay o the combinational logic
needs to satisfy the timing requirement as described in formula 2.4. While in
this relaxed design, the timing requirement for the logic cone that feeds into
16
the input of razor logic flip-flop can be relaxed up to t′maxdelay, such that:
t′maxdelay < tclk − toverhead + trelax (2.5)
Comparing with the tmaxdelay in formula 2.4, the timing constraint of the
combinational logic can be relaxed by amount trelax. This can prevent the
logic synthesis or physical optimization tool from spending excessive opti-
mization effort to speed up the rarely-exercised logic cones incurring unnec-
essary design overhead, such as logic duplication, sizing up or lowVt assign-
ment. This design overhead can lead to larger area, higher dynamic and
leakage power as well as less reliability.
In the case of a timing error being captured by the razor logic, error recov-
ering actions should be taken to avoid the erroneous data captured by the
main flip-flop of the razor logic to impair the overall functionality. To recover
from timing error, in-place recomputing by repeating the same computation
for one more cycle, or flushing the affected pipelines in pipelined design is
required to reject the timing error. Extra cycles may be needed for recovery.
For a shallow pipelined processor, such as the Leon3 processor, this recovery
penalty r can be 5 cycles. The throughput of using RZ can then be estimated
as follows:
TPRZ =
1
tclk
(P +
1− P
5
) (2.6)
where tclk is the clock period, and P is the probability the circuit can produce
the correct results within tclk.
Given that razor logic is easy to implement, it has received wide attention
from both academia and industry. A recent research work done by Intel [32]
utilizes a variation of razor logic, called embedded error-detection sequential
(EDS) circuits, to improve microprocessor performance and energy efficiency.
In following chapters, razor logic is used as one way for error detection and
correction.
2.6.2 Telescopic unit
Another circuit-level technique to tolerate timing error is to use telescopic
unit (TU). Telescopic unit is an attractive alternative because it may signifi-
cantly reduce the penalty of error recovery by predicting timing errors before
17
the errors actually propagate through the flip-flop boundary [30]. The idea is
based on the fact that among all the possible input vectors, some can produce
the correct results in one cycle (class C1 ), and some in two cycles (class C2 ).
As shown in Figure 2.6, a piece of auxiliary logic fh is created and connected
to the circuit inputs. fh will be asserted when the input vectors fall in the C2
class, indicating a two-cycle computation is in progress, and the result will be
captured one cycle later. Therefore, no timing error can propagate through
flip-flops. The worst case operation can finish in two cycles, resulting in a
penalty factor r=2. The throughput of TU can be estimated as follows:
TPTU =
1
tclk
(P +
1− P
2
) (2.7)
!"#$% #&&'&% ()% *#+#,+#*% '&% -&#*(,+#*.% +"#/% "01#% *(22#&#$+%!0/)% +'%
"0$*3#%+"#%#&&'&%!(+"%*(22#&#$+%-#&2'&40$,#%-#$03+(#)5%6$%+"#%'+"#&%
"0$*.%+"#/%7'+"%+&/%+'%)-##*%8-%,'44'$%,0)#)%0$*%033'!%+"#%*#30/%
'2% &0&#9,0)#% ,'4-8+0+('$% #:,##*% '$#% ,/,3#5% ;$% +"0+% )#$)#.% 7'+"%
<0='&%>'?(,%0$*%@#3#),'-(,%A$(+%,0$%7#%,30))(2(#*%0)%,(&,8(+%3#1#3%
+(4($?%)-#,830+(1#%+#,"$(B8#)5%%
%%@"#&#% 0&#% '+"#&% &#30+#*% !'&C)5% <#2#&#$,#% DEF% -&'-')#*% 0%
-0&0*(?4% 2'&% 3'!9-'!#&% 10&(0+('$% +'3#&0$+% ,(&,8(+% *#)(?$5% @"#%
-&($,(-03%(*#0%()%+'%G0H%()'30+#%0$*%-&#*(,+%+"#%)#+%'2%-'))(73#%-0+")%
+"0+%40/%7#,'4#%,&(+(,03%8$*#&%-&',#))%10&(0+('$).%G7H%#$)8&#%+"0+%
+"#/%0&#%0,+(10+#*% &0&#3/.%0$*% G,H%01'(*%-'))(73#%*#30/% 20(38&#)% ($%
+"#%,&(+(,03%-0+")%7/%*/$04(,033/%)!(+,"($?%+'%+!'9,/,3#%'-#&0+('$%
G0))84($?%033%)+0$*0&*%'-#&0+('$)%0&#%)($?3#%,/,3#H.%!"#$%+"#/%0&#%
0,+(10+#*5%@"()%,0$%-&'1(*#%0$%0*10$+0?#%'2%'-#&0+($?% +"#%,(&,8(+%
0+% &#*8,#*% )8--3/% 1'3+0?#% !"(3#% 0,"(#1($?% +"#% &#B8(&#*% /(#3*5%%
<#2#&#$,#%DIJF%-&'-')#*%6$9*#40$*%K#3#,+(1#%L(0)($?%G6KLH%0$*%
M0+"%N'$)+&0($+%@8$($?%GMN@H%+'%*'%,(&,8(+93#1#3%'-+(4(=0+('$%2'&%
+(4($?% )-#,830+('$% 7/% 3#1#&0?($?% 0% ,'44#&,(03% +''3% 23'!5% @"#%
-&'707(3(+/% -&'-#&+/% '2% )(?$03)% ()% $'+% 0$% ($+#?&03% -0&+% '2% +"#%
,'44#&,(03%+''35%@"#%!'&C%&#-'&+#*%0%+"&'8?"-8+%?0($%'2%OP%!(+"%
6KL% 0$*% QP% !(+"% <0='&% >'?(,% R% MN@5% S'!#1#&.% +"#% '1#&033%
'-+(4(=0+('$%23'!%()%1#&/%+(4#%,'$)84($?%G#5?5.%'$#%!##CH%*8#%+'%
+"#%#:+#$)(1#%)(4830+('$%+0)C)%($,'&-'&0+#*%($+'%+"#%23'!5%%
%%;$% +"()% )+8*/.% !#% *#1#3'-% 0% 3'?(,% '-+(4(=0+('$% -&',#*8&#% +"0+%
8+(3(=#)% *803%!"% G+"&#)"'3*% 1'3+0?#H% ,#33)% +'% '-+(4(=#% 0% ,(&,8(+% ($%
)8,"%0%!0/%+"0+%+"#%4')+%*/$04(,033/%,&(+(,03%?0+#)%'2%0%,(&,8(+%0&#%
*#+#,+#*.% 0$03/=#*.% 0$*% '-+(4(=#*% 2'&% "(?"#&% 2&#B8#$,/% 0$*%
+"&'8?"-8+5%68&%,'$+&(78+('$)%0&#%0)%2'33'!)T%
•! U#%)"'!%+"0+% +"#% +&0*(+('$03% +(4($?%'-+(4(=0+('$% +#,"$(B8#)%
0&#% $'+% )8(+073#% 2'&% +(4($?% )-#,830+('$5% @"8)% !#% $##*% $#!%
+(4($?%'-+(4(=0+('$%+#,"$(B8#)%+"0+%,0$%,'$)(*#&%+"#%*/$04(,%
7#"01('&%'2%0%,(&,8(+5%%
•! V(1#$% +"#% ($,&#0)($?% 0,,#-+0$,#% '2% +"#% *8039!"% *#)(?$%
4#+"'*'3'?/.%!#%($1#)+(?0+#%+"#%+(4($?%)-#,830+('$%),"#4#)%
!(+"%*803%!"%,#33)%8$*#&%0%-'!#&%78*?#+5%
•! U#% -&'-')#% 0$% '-+(4(=0+('$% #$?($#%#$%&'(%)% +'% (4-&'1#%
,(&,8(+%2&#B8#$,/%0$*%+"&'8?"-8+%8)($?%+"#%*/$04(,%7#"01('&%
($2'&40+('$%*#&(1#*%2&'4%)(?$03%-&'707(3(+(#)5%%%
•! U#%#10380+#%+!'%+(4($?%)-#,830+('$%),"#4#)%W%<0='&%>'?(,%
0$*% @#3#),'-(,% A$(+% W% 7#2'&#% 0$*% 02+#&% X/$0@8$#%
'-+(4(=0+('$.% 0$*% &#-'&+% +"#(&% ($*(1(*803% 0--3(,07(3(+/% 2'&%
+(4($?%)-#,830+('$5%%
%%@"#%&#40($($?%'2%+"()%-0-#&%()%'&?0$(=#*%0)%2'33'!)5%;$%K#,+('$%Y.%
+!'% +(4($?% )-#,830+('$% ),"#4#)% 0&#% &#1(#!#*5% ;$% K#,+('$% Z.% +"#%
4'+(10+('$% '2% X/$0@8$#% ()% -&#)#$+#*5% ;$% K#,+('$% [.% 7#"01('&%
,8&1#% ()% ($+&'*8,#*% 0$*% ,(&,8(+% '-+(4(=0+('$% !(+"% X/$0@8$#% ()%
-&#)#$+#*5%;$%K#,+('$%E.%4'*83#)%#:+&0,+#*%2&'4%>#'$Z%-&',#))'&%
DIZF%0$*%0%)#+%'2%\N]N%7#$,"40&C)%0&#%'-+(4(=#*%7/%X/$0@8$#.%
0$*%+"#%&#)83+)%0&#%0$03/=#*5%K#,+('$%Q%,'$,38*#)%+"#%-0-#&5%
!"! #$%$&'()*+,-./0$1&(/&2(3+4$&$0$1&5(
!!X8#% +'% -&',#))% 0$*% #$1(&'$4#$+03% 10&(0+('$).% +&0*(+('$03% !'&)+%
,0)#% '&(#$+#*% *#)(?$%4#+"'*'3'?/% 8)#)% 0??&#))(1#% ?80&*70$*($?%
+'%-8+%0%,'$)#&10+(1#%)02#+/%+(4($?%40&?($%+'%0%,(&,8(+%+'%?80&0$+##%
#&&'&92&##% '-#&0+('$)5% @"()% 4#,"0$()4.% "'!#1#&.% ()% -#))(4()+(,%
0$*% ($#22(,(#$+% !"#$% +"')#% !'&)+% ,0)#% ),#$0&(')% !'83*% '$3/%
"0--#$% ($2&#B8#$+3/5% '*+*%,- ./)0(1&"*2%% ()% 0$% (*#0% +'% !'&C%
0&'8$*% +"().% !"(,"% '1#&9,3',C)% +"#% ,"(-% +'% 0,"(#1#% "(?"#&%
-#&2'&40$,#%+'%+"#%-'($+%!"#&#%+(4($?%#&&'&)%',,8&.%0$*%+"#$%#&&'&%
*#+#,+('$% 0$*% ,'&&#,+('$% 0&#% -#&2'&4#*% +'% ?80&0$+##% &#3(07(3(+/5%
U#% ($+&'*8,#% +!'% +(4($?% )-#,830+('$% ),"#4#)T%<0='&%>'?(,% 0$*%
@#3#),'-(,%A$(+5%%
%
#/6.+(7"(#+8%$&1.1'$+5(/&2(/668+9$/0$1&5(
Optimization method
 G K+0+(,%'-+(4(=0+('$%!(+"%K/$'-)/)%X#)(?$%
N'4-(3#&%GXNH.%1#&5%YJJ^5IY%
 G X/$0@8$# '-+(4(=0+('$%
Timing speculation modes 
{|G @#3#),'-(,%A$(+%+(4($?%)-#,830+('$
yG <0='&%>'?(,%+(4($?%)-#,830+('$
Throughput configurations 
{yhG K/$'-)/)%XN%'-+(4(=#*%,(&,8(+%!'&C($?%($%$'$9
+(4($?%)-#,830+(1#%4'*#5%@"#%3'$?#)+%-0+"%s%
*#+#&4($#)%+"#%,/,3#%+(4#5%%
{|R G _--3/%@A *(&#,+3/%'$%XN9'-+(4(=#*%,(&,8(+5%
yR G _--3/%<` *(&#,+3/%'$%XN9'-+(4(=#*%,(&,8(+
{|R G _--3/%@A '$%X/$0@8$#%'-+(4(=#*%,(&,8(+
yR G _--3/%<` '$%X/$0@8$#%'-+(4(=#*%,(&,8(+
Other Terminologies
wOPG @"#%)+0+(,%-&'707(3(+/%'2%0%)(?$03T%+"#%-&'707(3(+/%%
'2%7#($?%3'?(,%I%'7)#&1#*%'1#&%0%8$(+%+(4#%%
sOPG >'$?#)+%-0+"%*#30/%($%-(,'%)#,'$*
{wG @"&'8?"-8+%
{G 6-#&0+($?%,/,3#%+(4#%
wG {"#%-&'707(3(+/%+"0+%+"#%,(&,8(+%,0$%-&'*8,#%+"#%
,'&&#,+%&#)83+)%!(+"($%{%
mG 6-#&0+($?%2&#B8#$,/G
usG ]847#&%'2%3'!a+%,#33)%($%+"#%*#)(?$
nG V0($%($%-#&,#$+0?#%'1#&%-#&2'&40$,#%'2%+"#%,(&,8(+%
'-+(4(=#*%7/%@<_%
!"7! :/;18(<1'$,(=:>?(
%
G0H%<0='&%>'?(,%
%
G7H%@#3#),'-(,%A$(+%
@$'-8+(7A(:/;18(<1'$,(/&2(#+.+5,1*$,(B&$0(
%%b(?8&#%IG0H% *#4'$)+&0+#)% +"#% ,'$,#-+% '2%<0='&%>'?(,% G<`H% DIF5%
A$*#&% $'&403% '-#&0+('$%!"#&#% +(4($?% #&&'&)% *'% $'+% "0--#$.% +"#%
40($% bb% ,0$% 30+,"% ($% +"#% ,'&&#,+% *0+05% S'!#1#&.% !"#$% +(4($?%
#&&'&)% +0C#% -30,#.% +"#% ,'4-0&0+'&% ()% 8)#*% +'% ,0+,"% +"#% #&&'&5% @"#%
0*10$+0?#%'2%<0='&% ()% +"0+%($%4')+%,0)#)%G!(+"%-&'707(3(+/%3H% +"#%
40($%bb%,0$%30+,"%($%+"#%,'&&#,+%1038#5%U"#$%0$%#&&'&%()%*#+#,+#*.%
0+% +"#% 0&,"(+#,+8&#% 3#1#3.% 0$% 08:(3(0&/% 3'?(,% !(33% ($(+(0+#% +"#%
&#,'1#&/% -"0)#% 7/% )+'--($?% +"#% ($,'&&#,+3/% 30+,"#*% 1038#% 2&'4%
-&'-0?0+($?% ($+'% +&0(3($?% -(-#3($#% )+0?#)5% @"#% &#,'1#&/% -"0)#%
8)8033/% +0C#)% )#1#&03% ,/,3#)% 7#,08)#% (+% $##*)% +'% 238)"% +"#%!"'3#%
-(-#3($#% )+0?#)5% ;$% '8&% )+8*/.%!#% 8)#% E% ,/,3#)% G&#,'1#&/% -#$03+/%
20,+'&% 4- c% EH% 0)% <`d)% &#,'1#&/% ,')+% 2'&% +"#% )+8*/% '2% +"#% >#'$Z%
-&',#))'&.% !"(,"% "0)% 0% )"033'!% -(-#3($#5% @"#% +"&'8?"-8+% '2% <`%
,0$%7#%#)+(40+#*%7#3'!T%
I IG H
E
56
3'3 3
'
−
= + % % GIH%
!"#&#%{% ()% +"#% ,3',C% -#&('*.% 0$*%3% ()% +"#% -&'707(3(+/% +"#% ,(&,8(+%
,0$% -&'*8,#% +"#% ,'&&#,+% &#)83+)% !(+"($%{5% ]'+#% +"0+% 2'&% #0,"%{.%
+"#&#%#:()+)%0$%0))',(0+#*%35%;2%G{SGwH%-0(&)%0&#%-3'++#*%7/%10&/($?%
{.%!#%,0$%?#+% +"#%7$%&+*0-8)9&:*24-0(4:)%'2%0%,(&,8(+5%@"()%!(33%
7#%*(),8))#*%($%K#,+('$%[5%_3+"'8?"%<`%,0$%+'3#&0+#%+(4($?%#&&'&).%
(+%()%*#)(&073#%+'%&#*8,#%+"#%-&'707(3(+/%'2%+"#%',,8&&#$,#%'2%+"#)#%
+(4($?%#&&'&)%+'%(4-&'1#%+"#%'3-G+"&'8?"-8+H5%
Combinational 
Logic Block
Re
gis
te
rs
fh
Figure 2.6: Telescopic unit [30]
Comparing Equation 2.7 with Equation 2.6, they only differ in the penalty
factor r in that RZ needs to flush the processor pipeline and re-operate the
erroneous instruction with r = 5 for shallow pipelined processors, while TU
only needs one extra cycle to allow input vectors in class C2 to be stabilized,
resulting in r = 2. Obviously, regardless what the exact value r takes, the
throughput of circuit-level BTWC design can be unified as follows.
TP =
1
tclk
(P +
1− P
r
) (2.8)
where the penalty factor is represented as a variable r to accommodate both
cases for RZ and TU. In the following chapters, Equation 2.8 will be used
as a measurement of the throughput of a circuit-level BTWC design. High
18
throughput TP is also one important optimization goal in this dissertation.
2.7 New optimization dimension for BTWC designs
The novel BTWC design methodology allows designers to use a variety of
ways to mitigate design and operation inefficiency associated with critical
path wall. Section 2.4 describes micro-architecture-level error detection and
correction mechanisms to improve processor throughput. Section 2.5 de-
scribes techniques that allow timing errors propagating in the system for
noise tolerant applications. Section 2.6 describes circuit-level error detection
and correction mechanism to tolerate timing errors. Though using different
approaches, they all allow a portion of the circuit being designed with relaxed
constraints. The part in a design that can be relaxed must be rarely exer-
cised; otherwise, frequent timing errors can significantly degrade the overall
performance.
Unfortunately, the existing circuit optimization tools and methodology are
not driven by BTWC requirements, and they can hardly provide effective op-
timization on circuits to suit BTWC usage. In fact, traditional optimization
techniques by their nature are opposed to the idea of BTWC design, because
they inevitably create critical path walls. Imagine if we want to operate
a BTWC circuit for high throughput utilizing timing speculation; we may
over-clock it at a t′clk that is smaller than tclk. Then, due to the existence
of the critical path wall, a large number of timing paths have similar delay
close to tclk. Then a large number of timing paths may fail equally likely.
As a result, the error rate could be so high that we do not gain throughput
performance.
To fully exploit the potential benefit of BTWC designs, modifications to
the existing traditional methodology and invention of new circuit optimiza-
tion techniques are required. Among them, analyzing the error probability
and then engineering the error probability are essential needs [24]. These
requirements give rise to a brand new perspective for circuit and system
optimization. In the following chapters, novel methods for analyzing and
engineering the error probability will be described in detail. In the next
chapter, a new BTWC-aware CAD flow leveraging existing commercial CAD
tools will be presented.
19
CHAPTER 3
USE COMMERCIAL CAD TOOLS FOR
BTWC DESIGN
In this chapter, BTWC design methodology is applied on a processor design
by leveraging the existing commercial CAD tools. To accommodate the needs
of manipulating error probabilities, the design flow is tailored to optimize the
processor components according to their exercise rates.
3.1 Microarchitectural-level timing speculation
As mentioned in Chapter 2, one way to exploit BTWC design methodology
is to use timing speculation (TS) for high throughput. The idea is to increase
the processor’s clock frequency to the point where timing faults begin to occur
and to equip the design with micro-architectural techniques for detecting and
correcting the resulting errors. A large number of proposals exist for TS
architectures (e.g., [2, 3, 19–22]). These proposals add a variety of hardware
modifications to a processor, such as enhanced latches, additional back-ends,
a checker module, or an additional core that works in a cooperative manner.
A limitation of current proposals is that they assume traditional design
methodologies, which are tuned for worst-case conditions and deliver sub-
optimal performance under TS. Specifically, existing methodologies strive to
eliminate slack from all timing paths in order to minimize power consump-
tion at the target frequency. Unfortunately, this creates a critical path wall
that impedes overclocking. If the clock frequency increases slightly beyond
the target frequency, the many paths that make up the wall quickly fail. The
error recovery penalty then quickly overwhelms any performance gains from
higher frequency.
BlueShift [4] work proposes a novel approach where the processor itself
is designed from the ground up for TS. The idea is to identify the most
frequently-exercised critical paths in the design and speed them up enough
20
so that the error rate grows much more slowly as frequency increases. The
majority of the static critical paths, which are rarely exercised, are left unop-
timized or even deoptimized - relying on the TS microarchitecture to detect
and correct the infrequent errors in them. In other words, common cases are
optimized, possibly at the expense of the uncommon ones.
I also introduce a technique called path constraint tuning (PCT) which,
applied under BlueShift, improves processor performance. PCT targets the
paths that would cause the most frequent timing violations under TS, and
adds slack by applying strong timing constraints on them.
PCT is applied on modules of the OpenSPARC T1 processor. Compared
to a conventional TS design, PCT speeds up applications by an average of
6% with an average processor power overhead of 23% - providing a way to
speed up logic modules that is orthogonal to voltage scaling.
3.2 Delay trading for timing speculation
The main idea of PCT is to do delay trading within the existing commercial
CAD tool flow. Delay trading (Figure 3.1) slows down infrequently-exercised
paths and uses the resources saved in this way to speed up frequently-
exercised paths for a given design budget. This leads to a lower limit fre-
quency f ′0 when compared to the one in the base design f0 in exchange for
a higher frequency under TS. The environment for variation-aﬄicted logic
(EVAL) framework [33] also pointed out that the error rate versus frequency
curve can be reshaped.
Freq
P E
(c) Delay Scaling(b) Pruning(a) Delay Trading (d) Targeted Acceleration
f0 f'0f0f'0 f0 f0 , f'0
FreqFreq Freq
a
a
a
a
b
b
b b
!"#$%& '( )&*&%+, +--%.+/0&1 2. &*0+*/& 34 56 %&10+-"*# 20& PE(f) /$%7&8 9+/0 +--%.+/0 10.:1 20& /$%7& 5&;.%&
%&10+-"*# <"* =+10&1> +*= +;2&% <1.,"=>? +*= 20& :.%@"*# -."*2 .; + -%./&11.% 5&;.%& <!> +*= +;2&% <">8
!"# $%&''(!)&*(+, +- ./ 0()1+&1)2(*3)*413'
A& /,+11";6 &B"12"*# -%.-.1+,1 .; 34 C"/%.+%/0"2&/2$%&1 +/D
/.%="*# 2.( <E> :0&20&% 20& ;+$,2 =&2&/2".* +*= /.%%&/2".* 0+%=D
:+%& "1 +,:+61 .* <#$%&'%( )%(*+*,%-&%>? <'> :0&20&% ;$*/2".*+,
/.%%&/2*&11 "1 1+/%"!/&= 2. C+B"C"F& 1-&&=$- %&#+%=,&11 .; 20&
.-&%+2"*# ;%&G$&*/6 <./-&,+0-!1 #0((%&,-%**>? +*= <H> :0&20&%
/0&/@"*# "1 =.*& +2 -"-&,"*&D12+#& 5.$*=+%"&1 .% $-.* %&2"%&C&*2
.; .*& .% C.%& "*12%$/2".*1 <#$%&'+-2 3(!-/1!(+,4>8 I* 20& ;.,,.:D
"*#? :& ="1/$11 20&1& +B&18 3+5,& E /,+11"!&1 &B"12"*# -%.-.1+,1 .;
34 C"/%.+%/0"2&/2$%&1 +//.%="*# 2. 20&1& +B&18
J"/%.+%/0"2&/2$%& K0&/@&% !$*/2".*+, K0&/@"*#
L&%1"12&*/& K.%%&/2*&11 )%+*$,+%"26
M+F.% NOP Q,:+61D.* K.%%&/2 42+#&
L+/&,"*& NRP S*D=&C+*= K.%%&/2 M&2"%&C&*2
TDL"-& N'OP Q,:+61D.* K.%%&/2 42+#&
K3U NEVP Q,:+61D.* K.%%&/2 42+#&
3IJ9MM3SW N'VP Q,:+61D.* K.%%&/2 42+#&
KW4 NXP Q,:+61D.* M&,+B&= 42+#&
4,"-12%&+C N'YP Q,:+61D.* M&,+B&= M&2"%&C&*2
S-2"C8 3+*=&C NEEP Q,:+61D.* M&,+B&= M&2"%&C&*2
ZIUQ NEP Q,:+61D.* M&,+B&= M&2"%&C&*2
3+5,& E( K,+11"!/+2".* .; &B"12"*# -%.-.1+,1 .; 34 C"D
/%.+%/0"2&/2$%&18
!"#"# $23)531 631'('*3,)3
30& /0&/@&% 0+%=:+%& 20+2 -&%;.%C1 ;+$,2 =&2&/2".* +*= /.%D
%&/2".* /+* 5& @&-2 516!4*70- .% [$12 8-79%:!-98 I; 1"*#,&D
20%&+= -&%;.%C+*/& "1 /%$/"+, +,, 20& 2"C&? 20& -%./&11.% :",,
+,:+61 .-&%+2& +2 + 1-&/$,+2"7& ;%&G$&*/68 K.*1&G$&*2,6? +*
Q,:+61D.* /0&/@&% 1$;!/&18 30"1 "1 20& +--%.+/0 .; C.12 &BD
"12"*# -%.-.1+,18 \.:&7&%? ;$2$%& KJL1 C$12 C+*+#& + C"B .;
20%.$#0-$2D +*= ,+2&*/6D.%"&*2&= 2+1@18 3. 1+7& -.:&% :0&* &BD
&/$2"*# 20%.$#0-$2D.%"&*2&= 2+1@1? "2 "1 =&1"%+5,& 2. ="1+5,& 20&
/0&/@&% ,.#"/ +*= .-&%+2& +2 fr 8 A& %&;&% 2. 1/0&C&1 :0&%& 20&
/0&/@&% /+* 5& &*#+#&= +*= ="1&*#+#&= +1 S*D=&C+*= /0&/@&%18
!"#"7 84,)*(+,&% $+113)*,3''
M&,+B"*# ;$*/2".*+, /.%%&/2*&11 /+* ,&+= 2. 0"#0&% /,./@ ;%&D
G$&*/"&18 30"1 /+* 5& +//.C-,"10&= 56 *.2 "C-,&C&*2"*# %+%&,6D
$1&= ,.#"/? 1$/0 +1 "* S-2"C"12"/ 3+*=&C NEEP +*= KW4 NXP? 56
*.2 %$**"*# 20& ;$,, -%.#%+C? 1$/0 +1 "* 4,"-12%&+C N'YP? .% &7&*
56 2.,&%+2"*# -%./&11.%1 :"20 =&1"#* 5$#1? 1$/0 +1 "* ZIUQ NEP8
30&1& ;%1!<%9 1/0&C&1 1$;;&% ;%.C &%%.%1 %&#+%=,&11 .; 20& /,./@
;%&G$&*/68 30"1 "1 "* /.*2%+12 2. #0((%&, 1/0&C&1? :0"/0 #$+%+*D
2&& &%%.%D;%&& .-&%+2".* +2 +*= 5&,.: 20& W"C"2 !%&G$&*/68
M&,+B"*# ;$*/2".*+, /.%%&/2*&11 "C-.1&1 + 1"*#,& <1-&/$,+2"7&>
C.=& .; .-&%+2".*? =&C+*="*# +* Q,:+61D.* /0&/@&%8 K.%%&/2D
*&11 +2 20& W"C"2 !%&G$&*/6 +*= 5&,.: "1 + *&/&11+%6 /.*="2".*
;.% /0&/@&% 1/0&C&1 5+1&= .* :+7& -"-&,"*"*# NVP ,"@& M+F.% NOP?
.% S*D=&C+*= /0&/@&% 1/0&C&1 ,"@& L+/&,"*& NRP8
!"#"! $23)5(,9 :1&,4%&1(*;
K0&/@"*# /+* 5& -&%;.%C&= +2 -"-&,"*&D12+#& 5.$*=+%"&1
<=,!2%> .% $-.* %&2"%&C&*2 .; .*& .% C.%& "*12%$/2".*1 <;%,+(%7
:%-,>8 I* 42+#& 1/0&C&1? 1-&/$,+2"7& %&1$,21 +%& 7&%"!&= +2 &+/0
-"-&,"*& ,+2/0 5&;.%& -%.-+#+2"*# 2. 20& *&B2 12+#&8 ]&/+$1& ;+$,21
+%& =&2&/2&= :"20"* .*& /6/,& .; 20&"% .//$%%&*/&? 20& %&/.7&%6
&*2+",1? +2 :.%12? + -"-&,"*& "$108 30& 1C+,, %&/.7&%6 -&*+,26
&*+5,&1 20&1& 1/0&C&1 2. =&,"7&% -&%;.%C+*/& &7&* +2 0"#0 ;+$,2
%+2&18 \.:&7&%? &+#&% ;+$,2 =&2&/2".* -%&7&*21 20&C ;%.C &B-,."2D
"*# :!*'+-2 +/%.11 -"-&,"*& 12+#&18
30& +,2&%*+2"7& "1 2. =&;&% /0&/@"*# $*2", %&2"%&C&*28 I* 20"1
/+1&? 5&/+$1& =&2&/2".* "1 =&,+6&=? +*= 5&/+$1& %&/.7&%6 C+6 "*D
7.,7& 0&+7"&%D:&"#02 .-&%+2".*1? 20& %&/.7&%6 -&*+,26 "1 0"#0&%8
S* 20& .20&% 0+*=? M&2"%&C&*2 1/0&C&1 =. *.2 *&&= 2. %&/.7&%
.* ;+$,21 20+2 +%& C"/%.+%/0"2&/2$%+,,6 C+1@&=? +*= 20& ,..1&,6D
/.$-,&= /0&/@&% C+6 5& &+1"&% 2. 5$",=8
!"7 :3,31&% <==1+&)23' *+ >,2&,)3 ./
)"7&* + 34 C"/%.+%/0"2&/2$%&? 9G$+2".* E 10.:1 20+2 :& /+*
"C-%.7& "21 -&%;.%C+*/& 56 %&=$/"*# PE(f)8 3. +//.C-,"10 20"1?
:& -%.-.1& ;.$% #&*&%+, +--%.+/0&18 30&6 +%& #%+-0"/+,,6 10.:*
"* !"#$%& '8 9+/0 .; 20& +--%.+/0&1 "1 10.:* +1 + :+6 .; %&10+-D
"*# 20& .%"#"*+, PE(f) /$%7& .; !"#$%& E<+> <*.: "* =+10&1> "*2.
+ C.%& ;+7.%+5,& .*& <1.,"=>8 !.% &+/0 +--%.+/0? :& 10.: 20+2 +
-%./&11.% 20+2 "*"2"+,,6 :.%@&= +2 -."*2 ! *.: :.%@1 +2 "? :0"/0
0+1 + ,.:&% PE ;.% 20& 1+C& >8
?%1!4 @(!9+-2 <!"#$%& '<+>> 1,.:1D=.:* +->(%A/%-,147
%<%(&+*%9 -+201 +*= $1&1 20& %&1.$%/&1 1+7&= "* 20"1 :+6 2. 1-&&=
$- >(%A/%-,147%<%(&+*%9 -+201 ;.% + #"7&* =&1"#* 5$=#&28 30"1
,&+=1 2. + ,.:&% W"C"2 !%&G$&*/6 f ′0 :0&* /.C-+%&= 2. 20& .*& "*
20& 5+1& =&1"#* f0 "* &B/0+*#& ;.% + 0"#0&% ;%&G$&*/6 $*=&% 348
)(/-+-2 .% #+(&/+,71%B%1 =C%&/1!,+0- <!"#$%& '<5>> %&C.7&1 20&
"*;%&G$&*2,6D&B&%/"1&= -+201 ;%.C 20& /"%/$"2 "* .%=&% 2. 1-&&=D$-
20& /.CC.* /+1&8 !.% &B+C-,&? 20& /+%%6 /0+"* .; 20& +==&% "1
.*,6 -+%2"+,,6 "C-,&C&*2&= 2. %&=$/& 20& %&1-.*1& 2"C& ;.% C.12
"*-$2 7+,$&1 NXP8 L%$*"*# %&1$,21 "* + 0"#0&% ;%&G$&*/6 ;.% +
#"7&* PE ? 5$2 1+/%"!/&1 20& +5","26 2. .-&%+2& &%%.%D;%&& +2 +*6
;%&G$&*/68
?%1!4 =&!1+-2 <!"#$%& '</>> +*= @!(2%,%9 5&&%1%(!,+0- <!"#D
$%& '<=>> 1-&&=D$- -+201 +*=? 20&%&;.%&? 10";2 20& /$%7& 2.:+%=
0"#0&% ;%&G$&*/"&18 30& +--%.+/0&1 =";;&% "* :0"/0 -+201 +%&
1-&=D$-8 Z&,+6 4/+,"*# 1-&&=1D$- ,+%#&,6 +,, -+201? :0",& 3+%D
#&2&= Q//&,&%+2".* 2+%#&21 20& /.CC.*D/+1& -+2018 Q1 + %&1$,2?
215
Figure 3.1: Delay trading enhances TS by reshaping the PE(f) curve
Conventional design methods use timing analysis to identify the static
21
critical paths in the design. Since these paths would determine the cycle time,
they are then optimized to reduce their latency. The result of this process
is that designs end up having a critical path wall, described in Section 2.2,
where many paths have a latency equal to or only slightly below the clock
period.
A different design method for TS processors is proposed, where it is fine
if some paths take longer than the period. When these paths are exercised
and induce an error, a recovery mechanism is invoked. We call the paths
that take longer than the period overshooting paths. They are not critical
because they do not determine the period. However, they hurt performance
in proportion to how often they are exercised and cause errors.
Consequently, a key principle when designing processors for TS is that,
rather than working with static distributions of path delays, dynamic distri-
butions of path delays need to be engineered. Moreover, optimization effort
needs to focus on the paths that overshoot most frequently dynamically by
trying to reduce their latency. Finally, at the end of the optimization there
can still exist many unoptimized overshooting paths that are exercised only
infrequently; therefore, a error correction mechanism is needed.
BlueShift is a design methodology for TS processors that uses these princi-
ples. In the following, I describe how BlueShift identifies dynamic overshoot-
ing paths and its iterative approach to optimization.
3.3 Identifying dynamic overshooting paths
BlueShift begins with a gate-level implementation of the circuit from a tra-
ditional design flow. A representative set of benchmarks is then executed on
a simulator of the circuit. At each cycle of the simulation, BlueShift looks
for latch inputs that change after the cycle has elapsed. Such endpoints are
referred to as overshooting. As an example, Figure 3.2 shows a circuit with
a target period of 500ps. The numbers on the nets represent their switching
times on a given cycle. Note that a net may switch more than once per cycle.
Since endpoints X and Y both transition after 500ps, they are designated as
overshooting for this cycle. Endpoint Z has completed all of its transitions
before 500ps, so it is non-overshooting for this cycle.
Once the overshooting endpoints for a cycle are known, BlueShift deter-
22
!" #$%&'(&%)$*+%*,&(- .)(&(%*+&$/*$% 012-$%(*$'3 '3 !"453)(3%$36 722&'(%)
.)+%8+& 9+&/$/*+3%+ !"#$% &'$()*+ $/ ,3:+/$&(;-+ <$*) ,*-(".$*( 1$%&'(&%)$*+%*,&+/
=,3%*$'3(- .'&&+%*3+// /'0*)*+ $/ $3%'12(*$;-+ <$*) 12''"34 1$%&'(&%)$*+%*,&+/
.)+%8$36 >&(3,-(&$*? 7-- (22&'(%)+/ (&+ (22-$+: 1'&+ (66&+//$@+-? *' 54$+" 1$%&'(&%)$*+%*,&+/
!(;-+ AB C'< !" 1$%&'(&%)$*+%*,&(- %)'$%+/ $12(%* <)(* !"4+3)(3%$36 (22&'(%)+/ (&+ 1'/* (22&'2&$(*+D
<)$-+ E+-(? "%(-$36 (-<(?/ $3%&+(/+/ *)+ F$1$* =&+G,+3%?H !(&4
6+*+: 7%%+-+&(*$'3 :'+/ 3'*H (/ f ′0 1(? ;+ :+*+&1$3+: ;? *)+
$3I&+G,+3*-?4+J+&%$/+: %&$*$%(- 2(*)/D C'<+@+&H !(&6+*+: 7%%+-4
+&(*$'3 $/ 1'&+ +3+&6?4+I!%$+3*D K'*) (22&'(%)+/ %(3 ;+ (%%'14
2-$/)+: <$*) *+%)3$G,+/ /,%) (/ /,22-? @'-*(6+ /%(-$36 '& ;':?
;$(/$36 LAAMD
!)+ 5N7F I&(1+<'&8 'I "(&(36$ "4 $#6 LOPM (-/' 2'$3*+: ',*
*)(* *)+ +&&'& &(*+ @+&/,/ I&+G,+3%? %,&@+ %(3 ;+ &+/)(2+:D !)+$&
I&(1+<'&8 +J(1$3+: %)(36$36 *)+ %,&@+ (/ $3 *)+ !"#$% 53$#)*+
(3: !(&6+*+: 7%%+-+&(*$'3 (22&'(%)+/H <)$%) <+&+ %(--+: 57)84
(3: &)#4H &+/2+%*$@+-?H *' $3:$%(*+ )'< *)+ %,&@+ %)(36+/ /)(2+D
!"! #$%%&'( )% *++ ,-(.%/.0
!)+ %)'$%+ 'I ( !" 1$%&'(&%)$*+%*,&+ :$&+%*-? $12(%*/ <)$%)
!"4+3)(3%$36 (22&'(%)+/ (&+ 1'/* (22&'2&$(*+D !(;-+ A /,11(4
&$Q+/ )'< !" 1$%&'(&%)$*+%*,&+/ (3: !"4+3)(3%$36 (22&'(%)+/
&+-(*+D
17"39"' /"':):4"*3" :$&+%*-? $12(%*/ *)+ (22-$%(;$-$*? 'I E+-(?
!&(:$36D R+%(-- *)(* E+-(? !&(:$36 &+/,-*/ $3 ( -'<+& F$1$* =&+4
G,+3%? *)(3 *)+ ;(/+ %(/+D !)$/ <',-: I'&%+ S34:+1(3: %)+%8$36
(&%)$*+%*,&+/ *' '2+&(*+ (* ( -'<+& I&+G,+3%? $3 *)+ 3'34!" 1':+
*)(3 $3 *)+ ;(/+ :+/$63H -+(:$36 *' /,;4'2*$1(- '2+&(*$'3D .'3/+4
G,+3*-?H E+-(? !&(:$36 $/ ,3:+/$&(;-+ <$*) S34:+1(3: %)+%8+&/D
!)+ ;0*34)2*$# 12''"34*":: 'I *)+ 1$%&'(&%)$*+%*,&+ $12(%*/
*)+ (22-$%(;$-$*? 'I 9&,3$36D 9&,3$36 &+/,-*/ $3 ( 3'34Q+&' PE &+4
6(&:-+// 'I *)+ I&+G,+3%?D .'3/+G,+3*-?H 9&,3$36 $/ $3%'12(*$;-+
<$*) .'&&+%* !" 1$%&'(&%)$*+%*,&+/H /,%) (/ *)'/+ ;(/+: '3 <(@+
2$2+-$3$36 T+D6DH R(Q'&U '& '34:+1(3: %)+%8$36 T+D6DH 9(%+-$3+UD
17"39"' <'$*0#$')4% :$%*(*+/ )'< (66&+//$@+-? (3? 'I *)+ !"4
+3)(3%$36 (22&'(%)+/ %(3 ;+ (22-$+:D 73 (22&'(%) $/ %'3/$:4
+&+: 1'&+ (66&+//$@+ $I $* (--'</ 1'&+ +&&'&/ (* ( 6$@+3 I&+G,+3%?D
"$3%+ "*(6+ 1$%&'(&%)$*+%*,&+/ )(@+ ( /1(--+& &+%'@+&? 2+3(-*?
*)(3 R+*$&+1+3* '3+/H (-- *)+ !"4+3)(3%$36 (22&'(%)+/ %(3 ;+
(22-$+: 1'&+ (66&+//$@+-? *' "*(6+ 1$%&'(&%)$*+%*,&+/D
1 2.3&('&'( #0-4.33-03 5-0 ,6
S,& 6'(- $/ *' :+/$63 2&'%+//'&/ *)(* (&+ +/2+%$(--? /,$*+: I'&
!"D K(/+: '3 *)+ $3/$6)*/ I&'1 *)+ 2&+@$',/ /+%*$'3H <+ 2&'2'/+B
TOU ( 3'@+- 2&'%+//'& :+/$63 1+*)':'-'6? *)(* <+ %(-- =#0"57)84
(3: TAU *<' *+%)3$G,+/ *)(*H <)+3 (22-$+: ,3:+& K-,+")$I*H $14
2&'@+ 2&'%+//'& I&+G,+3%?D !)+/+ *<' *+%)3$G,+/ (&+ $3/*(3*$4
(*$'3/ 'I *)+ (22&'(%)+/ $3*&':,%+: $3 "+%*$'3 PDAD V+J*H <+
2&+/+3* K-,+")$I* (3: *)+3 *)+ *<' *+%)3$G,+/D
1"7 ,/. 8+$.6/&5% 90:;.<-0=
.'3@+3*$'3(- :+/$63 1+*)':/ ,/+ *$1$36 (3(-?/$/ *' $:+3*$I?
*)+ /*(*$% %&$*$%(- 2(*)/ $3 *)+ :+/$63D "$3%+ *)+/+ 2(*)/ <',-:
:+*+&1$3+ *)+ %?%-+ *$1+H *)+? (&+ *)+3 '2*$1$Q+: *' &+:,%+ *)+$&
-(*+3%?D !)+ &+/,-* 'I *)$/ 2&'%+// $/ *)(* :+/$63/ +3: ,2 )(@$36
( 3')4)3$# >$47 ?$##H <)+&+ 1(3? 2(*)/ )(@+ ( -(*+3%? +G,(- *' '&
'3-? /-$6)*-? ;+-'< *)+ %-'%8 2+&$':D
W+ 2&'2'/+ ( :$II+&+3* :+/$63 1+*)': I'& !" 2&'%+//'&/H <)+&+
$* $/ !3+ $I /'1+ 2(*)/ *(8+ -'36+& *)(3 *)+ 2+&$':D W)+3 *)+/+
2(*)/ (&+ +J+&%$/+: (3: $3:,%+ (3 +&&'&H ( &+%'@+&? 1+%)(3$/1 $/
$3@'8+:D W+ %(-- *)+ 2(*)/ *)(* *(8+ -'36+& *)(3 *)+ 2+&$': ,@"'-
:7224)*+ 2(*)/D !)+? (&+ 3'* %&$*$%(- ;+%(,/+ *)+? :' 3'* :+*+&4
1$3+ *)+ 2+&$':D C'<+@+&H *)+? ),&* 2+&I'&1(3%+ $3 2&'2'&*$'3 *'
)'< 'I*+3 *)+? (&+ +J+&%$/+: (3: %(,/+ +&&'&/D
.'3/+G,+3*-?H ( 8+? 2&$3%$2-+ <)+3 :+/$63$36 2&'%+//'&/ I'&
!" $/ *)(*H &(*)+& *)(3 <'&8$36 <$*) /*(*$% :$/*&$;,*$'3/ 'I 2(*) :+4
-(?/H <+ 3++: *' <'&8 <$*) (%*$.)3 :$/*&$;,*$'3/ 'I 2(*) :+-(?/D
#'&+'@+&H <+ 3++: *' I'%,/ '3 '2*$1$Q$36 *)+ 2(*)/ *)(* '@+&4
/)''* 1'/* I&+G,+3*-? (%*$.)3$##% X ;? *&?$36 *' &+:,%+ *)+$&
-(*+3%?D =$3(--?H <+ %(3 -+(@+ ,3'2*$1$Q+: 1(3? '@+&/)''*$36
2(*)/ *)(* (&+ +J+&%$/+: '3-? $3I&+G,+3*-? X /$3%+ <+ )(@+ ( I(,-*
%'&&+%*$'3 1+%)(3$/1D
K-,+")$I* $/ ( :+/$63 1+*)':'-'6? I'& !" 2&'%+//'&/ *)(* ,/+/
*)+/+ 2&$3%$2-+/D 03 *)+ I'--'<$36H <+ :+/%&$;+ )'< K-,+")$I*
$:+3*$!+/ :?3(1$% '@+&/)''*$36 2(*)/ (3: $*/ $*+&(*$@+ (22&'(%)
*' '2*$1$Q(*$'3D
1"7"7 )>.'%&5?&'( 2?':;&4 @A.03/--%&'( #:%/3
K-,+")$I* ;+6$3/ <$*) ( 6(*+4-+@+- $12-+1+3*(*$'3 'I *)+ %$&%,$*
I&'1 ( *&(:$*$'3(- :+/$63 "'<D 7 &+2&+/+3*(*$@+ /+* 'I ;+3%)1(&8/
$/ *)+3 +J+%,*+: '3 ( /$1,-(*'& 'I *)+ %$&%,$*D 7* +(%) %?%-+ 'I *)+
/$1,-(*$'3H K-,+")$I* -''8/ I'& -(*%) $32,*/ *)(* %)(36+ (I*+& *)+
%?%-+ )(/ +-(2/+:D ",%) +3:2'$3*/ (&+ &+I+&&+: *' (/ '@+&/)''*$36D
7/ (3 +J(12-+H =$6,&+ P /)'</ ( %$&%,$* <$*) ( *(&6+* 2+&$': 'I
YZZ3/D !)+ 3,1;+&/ '3 *)+ 3+*/ &+2&+/+3* *)+$& /<$*%)$36 *$1+/
'3 ( 6$@+3 %?%-+D V'*+ *)(* ( 3+* 1(? /<$*%) 1'&+ *)(3 '3%+ 2+&
%?%-+D "$3%+ +3:2'$3*/ A (3: B ;'*) *&(3/$*$'3 (I*+& YZZ3/H *)+?
(&+ :+/$63(*+: (/ '@+&/)''*$36 I'& *)$/ %?%-+D 53:2'$3* C )(/ %'14
2-+*+: (-- 'I $*/ *&(3/$*$'3/ ;+I'&+ YZZ3/H /' $* $/ 3'34'@+&/)''*$36
I'& *)$/ %?%-+D
f
b
ca
d
e
360
447
520
511
458
288
375
448
216
303
376
107
85
172
318
529
Z
X
Y
=$6,&+ PB .$&%,$* (33'*(*+: <$*) 3+* *&(3/$*$'3 *$1+/H
/)'<$36 *<' '@+&/)''*$36 2(*)/ I'& *)$/ %?%-+D
S3%+ *)+ '@+&/)''*$36 +3:2'$3*/ I'& ( %?%-+ (&+ 83'<3H
K-,+")$I* :+*+&1$3+/ *)+ 2(*) 'I 6(*+/ *)(* 2&':,%+: *)+$& *&(3/$4
*$'3/D !)+/+ (&+ *)+ '@+&/)''*$36 2(*)/ I'& *)+ %?%-+H (3: (&+ *)+
';[+%*/ '3 <)$%) (3? '2*$1$Q(*$'3 <$-- '2+&(*+D !' $:+3*$I? *)+/+
2(*)/H K-,+")$I* (33'*(*+/ (-- 3+*/ <$*) *)+$& *&(3/$*$'3 *$1+/D 0*
*)+3 ;(%8*&(%+/ I&'1 +(%) '@+&/)''*$36 +3:2'$3*D 7/ $* ;(%84
*&(%+/ I&'1 ( 3+* <$*) *&(3/$*$'3 *$1+ tnH $* -'%(*+/ *)+ :&$@$36
6(*+ (3: $*/ $32,* <)'/+ *&(3/$*$'3 (* *$1+ ti %(,/+: *)+ %)(36+
(* tnD ='& +J(12-+H $3 =$6,&+ PH *)+ (-6'&$*)1 ;(%8*&(%+/ I&'1 A
(3: !3:/ *)+ 2(*) b → c → eD !)+&+I'&+H 2(*) b → c → e $/
216
Figure 3.2: Circuit annotated with net transition times, showing two
overshooting paths for this cycle
mines the path of gates that produced their transitions. These are the over-
shooting paths for the cycle, and are the objects on which any optimization
will operate. To identify these paths, BlueShift annotates all nets with their
transition times. It then backtraces from each overshooting endpoint. As it
backtraces from a net with transition time tn, it locates the driving gate and
its input whose transition at time ti caused the change at tn. For example, in
Figure 3.2, the algorithm backtraces from X and finds the path b → c → e.
Therefore, path b→ c→ e is overshooting for the cycle shown.
For each path p in the circuit, the analysis creates a set of cycles D(p)
in which that path overshoots. If Ncycles is the number of simulated cycles,
we define the frequency of overshooting of path p as d(p) = |D(p)|/Ncycles.
Then, the rate of errors per cycle in the circuit (PE) is upper-bounded by
min(1,
∑
P
d(p))
To reduce PE, BlueShift focuses on the paths with the highest frequency
of overshooting first. Once enough of these paths have been accelerated and
PE drops below a preset target, optimization is complete; the remaining
overshooting paths are ignored.
3.4 Iterative optimization flow
BlueShift makes iterative optimizations to the design, addressing the paths
with the highest frequency of overshooting first. As the design is transformed,
23
new dynamic overshooting paths are generated and addressed in subsequent
iterations. This iterative process stops when PE falls below target. Figure 3.3
illustrates the full process. It takes as inputs an initial gate-level design and
the designer’s target speculative frequency and PE.
!"#$%&!!'()* +!$ '&# ,-,.# %&!/)0
1!$ #2,& 32'& p () '&# ,($,4('5 '&# 2)2.-%(% ,$#2'#% 2 %#' !+ ,-6
,.#%D(p) () /&(,& '&2' 32'& !"#$%&!!'%0 7+Ncycles (% '&# )489#$
!+ %(84.2'#: ,-,.#%5 /# :#!)# '&# !"#$%#&'( )* +,#"-.))/0&1 !+
32'& p 2% d(p) = |D(p)|/Ncycles0 ;&#)5 '&# $2'# !+ #$$!$% 3#$ ,-6
,.# () '&# ,($,4(' <PE= (% 433#$69!4):#: 9-min(1,
P
p d(p))0 ;!
$#:4,# PE 5 >.4#?&(+' +!,4%#% !) '&# 32'&% /('& '&# .01.#-/ +$#6
@4#),- !+ !"#$%&!!'()* !$%'0 A),# #)!4*& !+ '&#%# 32'&% &2"#
9##) 2,,#.#$2'#: 2): PE :$!3% 9#.!/ 2 3$#6%#' '2$*#'5 !3'(8(B26
'(!) (% ,!83.#'#C '&# $#82()()* !"#$%&!!'()* 32'&% 2$# (*)!$#:0
!"#"$ %&'()&*+' ,-&*.*/)&*01 2304
>.4#?&(+' 82D#% ('#$2'("# !3'(8(B2'(!)% '! '&# :#%(*)5 2::$#%%6
()* '&# 32'&% /('& '&# &(*&#%' +$#@4#),- !+ !"#$%&!!'()* !$%'0 E%
'&# :#%(*) (% '$2)%+!$8#:5 )#/ :-)28(, !"#$%&!!'()* 32'&% 2$#
*#)#$2'#: 2): 2::$#%%#: () %49%#@4#)' ('#$2'(!)%0 ;&(% ('#$2'("#
3$!,#%% %'!3% /&#) PE +2..% 9#.!/ '2$*#'0 1(*4$# F (..4%'$2'#% '&#
+4.. 3$!,#%%0 7' '2D#% 2% ()34'% 2) ()('(2. *2'#6.#"#. :#%(*) 2): '&#
:#%(*)#$G% '2$*#' %3#,4.2'("# +$#@4#),- 2): PE 0
Benchmark 0 Benchmark 1 Benchmark n-1
Path profile
Design changes
Physical design
PE < targetPE > target
Final design
Select training benchmarks2
Compute training set error rate4
Gate level simulation3
Speed up paths with highest 
frequency of overshooting5
Initial Netlist
1
Restructuring
Placement
Clock tree synth
Routing
Leakage minimization
Physical-aware
Optimization
1(*4$# FH ;&# >.4#?&(+' !3'(8(B2'(!) "!/0
E' '&# &#2: !+ '&# .!!3 <?'#3 I=5 2 3&-%(,2.62/2$# !3'(8(B26
'(!) "!/ '2D#% 2 .(%' !+ :#%(*) ,&2)*#% +$!8 '&# 3$#"(!4% ('#$26
'(!) 2): 233.(#% '&#8 2% (' 3#$+!$8% 2**$#%%("# .!*(,2. 2): 3&-%6
(,2. !3'(8(B2'(!)%0 ;&# !4'34' !+ ?'#3 I (% 2 +4..- 3.2,#: 2):
$!4'#: 3&-%(,2. :#%(*) %4('29.# +!$ +29$(,2'(!)0 ?'#3 J 9#*()%
'&# #892$$2%%()*.-632$2..#. 3$!!.()* 3&2%# 9- %#.#,'()* & /"20&3
0&1 9#),&82$D%0 7) ?'#3 K5 !)# *2'#6.#"#. '(8()* %(84.2'(!) (%
()('(2'#: +!$ #2,& 9#),&82$D0 L2,& %(84.2'(!) $4)% 2% 82)- ()6
%'$4,'(!)% 2% (% #,!)!8(,2. 2): '&#) ,!834'#% '&# +$#@4#),(#% !+
!"#$%&!!'()* +!$ 2.. 32'&% #M#$,(%#: :4$()* '&# #M#,4'(!)0 >#+!$#
?'#3 F5 2 *.!92. 92$$(#$ /2('% +!$ 2.. !+ '&# ():("(:42. %(84.2'(!)%
'! !)(%&0 ;&#)5 '&# !"#$2.. +$#@4#),- !+ !"#$%&!!'()* +!$ #2,&
32'& (% ,!834'#: 9- 2"#$2*()* '&# 8#2%4$# +!$ '&2' 32'& !"#$ '&#
():("(:42. %(84.2'(!) ()%'2),#%0 >.4#?&(+' 2.%! ,!834'#% '&# 2"6
#$2*# PE 2,$!%% 2.. %(84.2'(!) ()%'2),#%0
>.4#?&(+' '&#) 3#$+!$8% '&# #M(' '#%'0 7+ PE (% .#%% '&2) '&# :#6
%(*)#$G% '2$*#'5 '&#) !3'(8(B2'(!) (% ,!83.#'#C '&# 3&-%(,2. :#%(*)
2+'#$ ?'#3 I !+ '&# ,4$$#)' ('#$2'(!) (% $#2:- +!$ 3$!:4,'(!)0 E% 2
!)2. "2.(:2'(!)5 >.4#?&(+' #M#,4'#% 2)!'&#$ %#' !+ '(8()* %(84.26
'(!)% 4%()* 2 :(++#$#)' %#' !+ 9#),&82$D% <'&# 4,25%2/0)& %#'= '!
3$!:4,# '&# !)2. PE "#$%4% f ,4$"#0 ;&(% (% '&# ,4$"# '&2' /# 4%#
'! #"2.42'# '&# :#%(*)0
7+5 !) '&# !'&#$ &2):5 PE #M,##:% '&# '2$*#'5 /# ,!..#,' '&#
%#' !+ 32'&% /('& '&# &(*&#%' +$#@4#),- !+ !"#$%&!!'()*5 2): 4%#
2) !3'(8(B2'(!) '#,&)(@4# '! *#)#$2'# 2 .(%' !+ :#%(*) ,&2)*#% '!
%3##:643 '&#%# 32'&% <?'#3 N=0 O(++#$#)' !3'(8(B2'(!) '#,&)(@4#%
,2) 9# 4%#: '! *#)#$2'# '&#%# ,&2)*#%0 P# 3$#%#)' '/! )#M'0
!"$ 5'671*89': &0 %.-(0+' ;'(<0(.)16'
;! %3##:643 3$!,#%%!$ 32'&%5 /# 3$!3!%# '/! '#,&)(@4#% '&2'
/# ,2.. +&36#72&6 8#5#'/0,# 902-0&1 :+89; 2): <2/. =)&-/"20&/
>%&0&1 :<=>;0 ;&#- 2$# %3#,(!, (83.#8#)'2'(!)% !+ '/! !+ '&#
*#)#$2. 233$!2,&#% '! #)&2),# ;? :(%,4%%#: () ?#,'(!) K0J5
)28#.- ;2$*#'#: E,,#.#$2'(!) 2): O#.2- ;$2:()*5 $#%3#,'("#.-0
P# :! )!' ,!)%(:#$ '#,&)(@4#% +!$ '&# !'&#$ 233$!2,&#% () 1(*6
4$# J 9#,24%# 2 '#,&)(@4# +!$ Q$4)()* /2% 2.$#2:- 3$!3!%#:
() RIIS 2): O#.2- ?,2.()* (% 2 :#*#)#$2'#5 .#%% #)#$*-6#+!,(#)'
"2$(2)' !+ ;2$*#'#: E,,#.#$2'(!) '&2' .2,D% 32'& '2$*#'()*0
!"$"# ,1=>'.)1? @'3'6&*+' A*):*1B C,@AD
A)6:#82): ?#.#,'("# >(2%()* <A?>= 233.(#% +!$/2$: 9!:- 9(6
2%()* <1>>= RJJS '! !)# !$ 8!$# !+ '&# *2'#% !+ #2,& !+ '&# 32'&%
/('& '&# &(*&#%' +$#@4#),- !+ !"#$%&!!'()*0 L2,& *2'# '&2' $#6
,#("#% 1>> %3##:% 435 $#:4,()* '&# 32'&G% +$#@4#),- !+ !"#$6
%&!!'()*0 P('& A?>5 /# 34%& '&# PE "#$%4% f ,4$"# 2% () 1(*6
4$# J<:=5 82D()* '&# 3$!,#%%!$ +2%'#$ 4):#$ ;?0 T!/#"#$5 9- 236
3.-()* 1>>5 /# 2.%! (),$#2%# '&# .#2D2*# 3!/#$ ,!)%48#:0
1(*4$# N<2= %&!/% &!/ A?> (% 233.(#:5 /&(.# 1(*4$# N<9=
%&!/% 3%#4:! ,!:# +!$ '&# 2.*!$('&8 !+ ?'#3 N () 1(*4$# F +!$
A?>0 ;&# 2.*!$('&8 '2D#% 2% ()34' 2 ,!)%'2)' ?5 /&(,& (% '&# +$2,6
'(!) !+ 2.. '&# :-)28(, !"#$%&!!'()* () '&# :#%(*) '&2' /(.. $#82()
4)62::$#%%#: 2+'#$ '&# 2.*!$('&8 !+ 1(*4$# N<9= ,!83.#'#%0
;&# 2.*!$('&8 3$!,##:% 2% +!..!/%0 E' 2)- '(8#5 '&# 2.*!$('&8
82()'2()% 2 %#' !+ 32'&% '&2' 2$# #.(*(9.# +!$ %3##:43 <Pelig=0 7)(6
'(2..-5 2' #)'$- '! ?'#3 N () 1(*4$# F5 U()# I !+ '&# 3%#4:! ,!:#
() 1(*4$# N<9= %#'% 2.. '&# :-)28(, !"#$%&!!'()* 32'&% <Poversh=
'! 9# #.(*(9.# +!$ %3##:430 V#M'5 () U()# J !+ 1(*4$# N<9=5 2 .!!3
9#*()% () /&(,& !)# *2'# /(.. 9# %#.#,'#: () #2,& ('#$2'(!) '! $#6
,#("# 1>>0 7) #2,& ('#$2'(!)5 /# %'2$' 9- ,!)%(:#$()* 2.. 32'&% p ()
Pelig /#(*&'#: 9- '&#($ +$#@4#),- !+ !"#$%&!!'()* d(p)0 P# 2.%!
:#!)# '&# /#(*&' !+ 2 *2'# g 2% '&# %48 !+ '&# /#(*&'% !+ 2.. '&#
32'&% () /&(,& (' 32$'(,(32'#% <paths(g)=0 ;&#)5 U()# K !+ 1(*6
4$# N<9= *$##:(.- %#.#,'% '&# *2'# <gsel= /('& '&# &(*&#%' /#(*&'0
U()# F $#8!"#% +$!8 Pelig 2.. '&# 32'&% () /&(,& '&# %#.#,'#: *2'#
32$'(,(32'#%0 V#M'5 U()# N 2::% '&# %#.#,'#: *2'# '! '&# %#' !+ *2'#%
'&2' /(.. $#,#("# 1>> <GFBB=0 1()2..-5 () U()# W5 '&# .!!3 '#$8(6
)2'#% /&#) '&# +$2,'(!) !+ 2.. '&# !$(*()2. :-)28(, !"#$%&!!'()*
'&2' $#82()% 4)62::$#%%#: (% )! &(*&#$ '&2) k0
217
Figure 3.3: The BlueShift optimization flow
At the head of the loop (Step 1), a physical-aware optimization flow takes
a list of design changes from the previous iteration and applies them as it
performs aggressive logical and physical optimizations. The output of Step
1 is a fully placed and routed physical design suitable for fabrication. Step
24
2 begins the embarrassingly-parallel profiling phase by selecting n training
benchmarks. In Step 3, one gate-level timing simulation is initiated for each
benchmark. Each simulation runs as many instructions as is economical
and then computes the frequencies of overshooting for all paths exercised
during the execution. Before Step 4, a global barrier waits for all of the
individual simulations to finish. Then, the overall frequency of overshooting
for each path is computed by averaging the measure for that path over the
individual simulation instances. BlueShift also computes the average PE
across all simulation instances.
BlueShift then performs the exit test. If PE is less than the designer’s
target, then optimization is complete; the physical design after Step 1 of
the current iteration is ready for production. As a final validation, BlueShift
executes another set of timing simulations using a different set of benchmarks
(the evaluation set) to produce the final PE versus frequency f curve. This
is the curve that we use to evaluate the design.
If, on the other hand, PE exceeds the target, we collect the set of paths with
the highest frequency of overshooting, and use an optimization technique to
generate a list of design changes to speed up these paths (Step 5). PCT
optimization technique can be used to generate these changes as described
next.
3.5 Improve performance with PCT
Path constraint tuning (PCT) applies stronger timing constraints on the
paths with the highest frequency of overshooting, at the expense of the timing
constraints on the other paths. The result is that, compared to the period T0
of a processor without TS at the limit frequency f0, the paths that initially
had the highest frequency of overshooting now take less than T0, while the
remaining ones take longer than T0. PCT improves the performance of the
common-case paths at the expense of the uncommon ones. With PCT, the
PE versus f curve is changes as in Figure 3.1, making the processor faster
under TS - although slower if it were to run without TS.
Existing design tools can optimize slack for selected paths in many ways,
using (1) resynthesis [34], (2) technology mapping [35], (3) gate sizing [13],
(4) logic decomposition [36], and (5) dualVt assignment [16].
25
In BlueShift, these optimization techniques are utilized in the existing
commercial CAD tool implicitly. The implementation of PCT is simplified
by the fact that existing design tools already implement the transformations
However, they do all of their optimizations based on static path information.
Fortunately, they provide a way of specifying “timing overrides” that increase
or decrease the allowable delay of a specific path. PCT uses these timing
overrides to specify timing constraints equal to the speculative clock period
for paths with high frequency of overshooting, and longer constraints for the
rest of the paths.
While traditional timing optimization is applied on all paths, BlueShift
PCT speeds up the path but has a power cost, which may need to be recov-
ered by slowing down another path. Intrinsically, in PCT the optimization
is controlled by a dynamic factor, e.g., how frequently paths are exercised.
The task of Step 5 in Figure 3.3 for PCT is simply to generate a list
of timing constraints for a subset of the paths. These constraints will be
processed in Step 1. To understand the PCT algorithm, assume that the
designer has a target period with TS equal to Tts. In the first iteration of the
BlueShift framework of Figure 3.3, Step 1 assigns a relaxed timing constraint
to all paths. This constraint sets the path delays to r × Tts (where r is a
relaxation factor), making them even longer than a period that would be
reasonable without TS. When we get to Step 5, the algorithm first sorts
all paths in order of descending frequency of overshooting at Tts. Then, it
greedily selects paths from this list, leaving those whose combined frequency
of overshooting is less than the target PE. To these selected paths, it assigns
a timing constraint equal to Tts. Later, when the next iteration of Step 1
processes these constraints, it will ensure that these paths all fit within Tts,
possibly at the expense of slowing down the other paths.
At each successive iteration of BlueShift, Step 5 assigns the Tts timing con-
straint to those paths that account for a combined frequency of overshooting
greater than the target PE at Tts. Note that once a path is constrained,
that constraint persists for all future BlueShift iterations. Eventually, after
several iterations, a sufficient number of paths are constrained to meet the
target PE.
26
3.6 Experimental results
To make the level of effort manageable, the analysis focuses on only a few
modules of the OpenSPARC T1 processor core [37]. The chosen modules are
sampled from throughout the pipeline, and are shown in Table 3.1. Taken
together, these modules provide a representative profile of the various pipeline
stages. For each module, the Stage column of the table shows where in the
pipeline (Fetch/Decode, EXEcute, or MEMory) the module resides. The
next two columns show the size in number of standard cells and the shortest
worst-case delay attained by the traditional CAD flow without using any
lowVt cells (which consume more power).
Table 3.1: OpenSPARC modules used to evaluate BlueShift
Module Name Stage Num.
Cells
Tr(ns) PE Description
sparc exu EXE 21,896 1.50 10−4 Integer FU, con-
trol, bypass
lsu stb ctl MEM 765 1.11 10−5 Store buffer con-
trol
lsu qctl1 MEM 2,336 1.50 10−5 Load/Store
quene control
lsu dct1 MEM 3,682 1.00 10−5 L1 D-cache con-
trol
sparc ifu dec F/D 765 0.75 10−5 Instruction
decoder
sparc ifu fdp F/D 7,434 0.94 10−5 Fetch datapath
and PC
sparc ifu fcl F/D 2,299 0.96 10−5 L1-cache and
PC control
The PE column shows the per-module error rate targets under PCT. This
is the error rate that BlueShift will try to ensure for each module. This
number is proportional to a fair share of the total processor PE to each
module roughly according to its size. With these PE targets, when the
full pipeline is assembled (including modules not in the sample set), the
total processor PE will be roughly 10
−3 errors/cycle for PCT. The average
recovery overhead of error recovery for the Razor based design is set to 5
cycles.
The largest and most complex module is ‘sparc exu’. It contains the integer
27
register file, the integer arithmetic and logic datapaths along with the address
generation, bypass and control logic. It also performs other control duties
including exception detection, save/restore control for the SPARC register
windows, and error detection and correction using ECC. This module alone
is larger than many lightweight embedded processor cores.
Using Synopsys Design Compiler 2007.03 and Cadence Encounter 6.2, we
perform full physical (placed and routed) implementations of the modules in
Table 3.1. The standard cell library is based on UMC 130nm technology [38].
To make the results more accurate for a near-future (e.g., 32nm) technology,
the cell leakage is scaled so that it accounts for ≈30% of the total power
consumption. The process has a 10% guardband to tolerate environmental
and process variations. This means that f0 = 1.10× fr, where fr and f0 are
the rated and limit frequencies, respectively. The process also contains lowVt
gates that have a 10x higher leakage and a 20% lower delay than normal
gates [39]. These gates are available for assignment in high-performance
environments such as those with razor logic.
Table 3.2 lists the BlueShift parameters. In the razor+PCT experiments,
we add hold-time delay constraints to the paths to accommodate shadow
latches. Moreover, shadow latches are inserted wherever worst-case delays
exceed the speculative clock period. Each profiling phase (Step 2 of Fig-
ure 3.3) comprises a parallel run of 200 benchmark samples, each one running
for 25K cycles.
Table 3.2: PCT parameters
# Benchmarks run per iteration 200 (PCT)
# Cycles per benchmark 25K
r: PCT relaxation factor 1.5
We use the unmodified RTL sources from OpenSPARC, but we simplify
the physical design by modeling the register file and the 64-bit adder as black
boxes. In a real implementation, these components would be designed in full-
custom logic. We use timing information supplied with the OpenSPARC to
build a detailed 900MHz black box timing model for the register file; then,
we use CACTI [40] to obtain an area estimate and build a realistic physical
footprint. The 64-bit adder is modeled on [41], and has a worst-case delay of
500ps.
28
The experiments use SPECint2006 applications as the training set in the
BlueShift flow (Steps 1-5 of Figure 3.3). After BlueShift terminates, we mea-
sure the error rate for each module using SPECint2000 applications as the
evaluation set. From the latter measurements, we construct a PE versus f
curve for each SPECint2000 application on each module. All PE measure-
ments are recorded in terms of the fraction of cycles on which at least one
latch receives the wrong value. This is an accurate strategy for the razor-
based evaluation.
Circuit-level power estimation for the sample modules is done using Ca-
dence Encounter. We perform detailed capacitance extraction and then use
the tool’s default leakage and switching analysis. Table 3.3 shows the power
and energy data collected for each module.
Table 3.3: Power consumption and switching energy per cycle for each
module implementation
Module Razor base Razor+PCT
P(mW) E(pJ) P(mW) E(pJ)
sparc exu 175.1 217.2 130.3 257.8
lsu std ctl 4.3 6.0 3.9 9.5
lsu qctl1 12.4 18.8 15.4 35.9
lsu dct 20.7 35.1 21.3 54.4
sparc ifu dec 5.9 3.9 5.1 5.3
sparc ifu fdp 36.1 119.6 30.0 146.6
sparc ifu fcl 16.5 15.7 10.6 24.3
Total 271.0 416.1 216.6 533.7
In razor base and razor+PCT, the fraction of lowVt gates is 11% and
5%, respectively. Additionally, the razor-based implementations incur power
overhead from razor itself. This overhead is more severe in razor+PCT than
in razor Base for two reasons. First, note that any latch endpoint that can
exceed the speculative clock period requires a shadow latch. After PCT in-
duced path relaxation, the probability of an endpoint having such a long path
increases, so more razor latches are required. Second, razor+PCT requires
more hold-time fixing. This is because we diverge slightly from the original
razor proposal [3] and assume that the shadow latches are clocked a constant
delay after the main edge rather than a constant phase difference. With PCT
induced path relaxation, the difference between the long and short path de-
lay increases, and the delay between the shadow and main latch clock must
29
be increased. This requires more buffers on the short paths to guarantee
sufficient shadow hold time.
Two razor-based architectures are compared. razor base uses the razor
base module implementation, while razor+PCT uses the razor+PCT one
obtained by applying BlueShift with PCT. As before, razor+PCT runs at
the frequency given by the PE curves of the BlueShiftable components; then,
we apply traditional voltage scaling to the non-BlueShiftable components so
that they can catch up.
Figure 3.4 shows the speedup of the razor base and razor+PCT architec-
tures over the one without TS as a baseline in Figure 3.4. Since these razor-
based architectures target high performance, they deliver higher speedups.
On average, razor+PCT’s performance is 6% higher than that of razor base.
This is the impact of BlueShift with PCT in this design, which is not negli-
gible considering that razor base was already designed for high performance.
0
2
4
6
8
10
12
bz
ip
2
cr
af
ty
ga
p
gc
c
gz
ip
m
cf
pa
rs
er
tw
ol
f
vo
rte
x
vp
r
m
ea
n
Checker Leader NonBS Leader BS Extra
Sp
ee
du
p 
(%
 ov
e
r 
U
n
pa
ire
d)
Po
w
er
 (W
)
(a) (b)
2x
Unpaired
Paceline
Base
Paceline
+OSB
bz
ip
2
cr
a
fty ga
p
gc
c
gz
ip
m
cf
pa
rs
er
tw
ol
f
vo
rte
x
vp
r
hm
ea
n
0
5
10
15
20
25
Paceline Base Paceline+OSB
!"#$%& '( )&%*+%,-./& 0-1 -.2 3+4&% /+.5$,36"+. 071 +* 2"**&%&.6 )-/&8".&97-5&2 3%+/&55+% /+.!#$%-6"+.5: !" -.2
#$%!" %&*&% 6+ ;8$&<="*6-78& -.2 .+.9;8$&<="*6-78& ,+2$8&5> %&53&/6"?&8@:
bz
ip
2
cr
a
fty ga
p
gc
c
gz
ip
m
cf
pa
rs
er
tw
ol
f
vo
rte
x
vp
r
m
e
a
n
0
2
4
6
8
10 NonBS BS
bz
ip
2
cr
a
fty ga
p
gc
c
gz
ip
m
cf
pa
rs
er
tw
ol
f
vo
rte
x
vp
r
hm
ea
n
0
10
20
30
40
Razor Base Razor+PCT
Sp
ee
du
p 
(%
 ov
e
r 
U
n
pa
ire
d)
Po
w
er
 (W
)
(a) (b)
Razor
Base
Razor
+PCT
!"#$%& A( )&%*+%,-./& 0-1 -.2 3+4&% /+.5$,36"+. 071 +* 2"**&%&.6 B-C+%97-5&2 3%+/&55+% /+.!#$%-6"+.5:
!"#$%& '071 5=+45 6=& 3+4&% /+.5$,&2 7@ 6=& 3%+/&55+% -.2
DE /-/=&5 ". &'()*+%) !',)> &'()*+%)-."!> -.2 64+ ".56-./&5 +*
/%0'+1)2: F=& 3+4&% "5 7%+G&. 2+4. ".6+ 3+4&% /+.5$,&2 7@
6=& /=&/G&% /+%& 04="/= "5 .&?&% ;8$&<="*6&21> .+.9;8$&<="*6-78&
,+2$8&5 ". 6=& 8&-2&%> ;8$&<="*6-78& ,+2$8&5 ". 6=& 8&-2&%> -.2
&H6%- )-/&8".& 56%$/6$%&5 0/=&/G3+".6".#> IJ> -.2 ;J1: K. -?&%9
-#&> 6=& 3+4&% /+.5$,&2 7@ &'()*+%)-."! "5 ELM ="#=&% 6=-.
6=-6 +* &'()*+%) !',): N+.5&O$&.68@> ;8$&<="*6 4"6= K<; 2+&5
.+6 -22 ,$/= 6+ 6=& 3+4&% /+.5$,36"+. 4="8& 2&8"?&%".# - 5"#9
."!/-.6 3&%*+%,-./& #-".: P+6& 6=-6 6=& /=&/G&% 3+4&% "5 #&.9
&%-88@ 8+4&% 4=&. 6=& /+%&5 %$. ". 3-"%&2 ,+2&: F="5 "5 7&/-$5&
6=& /=&/G&% /+%& 5-?&5 &.&%#@ 7@ 5G"33".# 6=& &H&/$6"+. +* ,-.@
4%+.#93-6= ".56%$/6"+.5:
!"# $%&'()*+, *-(.'(/%01- %02 *'3-(
Q& .+4 /+,3-%& 64+ B-C+%97-5&2 -%/="6&/6$%&5: 3'4$1 !',)
$5&5 6=& 3'4$1 !',) ,+2$8& ",38&,&.6-6"+.> 4="8& 3'4$1-&56
$5&5 6=& 3'4$1-&56 +.& +76-".&2 7@ -338@".# ;8$&<="*6 4"6=
)NF: R5 7&*+%&> 3'4$1-&56 %$.5 -6 6=& *%&O$&./@ #"?&. 7@ 6=&
PE /$%?&5 +* 6=& ;8$&<="*6-78& /+,3+.&.65S 6=&.> 4& -338@ 6%-9
2"6"+.-8 ?+86-#& 5/-8".# 6+ 6=& .+.9;8$&<="*6-78& /+,3+.&.65 5+
6=-6 6=&@ /-. /-6/= $3:
!"#$%& A0-1 5=+45 6=& 53&&2$3 +* 6=& 3'4$1 !',) -.2 3'7
4$1-&56 -%/="6&/6$%&5 +?&% 6=& /%0'+1)2 +.& $5&2 -5 - 7-5&8".&
". !"#$%& '0-1: <"./& 6=&5& B-C+%97-5&2 -%/="6&/6$%&5 6-%#&6 ="#=
3&%*+%,-./&> 6=&@ 2&8"?&% ="#=&% 53&&2$35: Q& 5&& 6=-6> +. -?&%9
-#&> 3'4$1-&56T5 3&%*+%,-./& "5 UM ="#=&% 6=-. 6=-6 +* 3'4$1
!',): F="5 "5 6=& ",3-/6 +* ;8$&<="*6 4"6= )NF ". 6="5 2&5"#. V
4="/= "5 .+6 .&#8"#"78& /+.5"2&%".# 6=-6 3'4$1 !',) 4-5 -8%&-2@
2&5"#.&2 *+% ="#= 3&%*+%,-./&: Q& -85+ 5&& 6=-6 801 -.2> 6+ -
8&55&% &H6&.6> 9:$*; 2+ .+6 3&%*+%, -5 4&88 -5 6=& +6=&% -338"/-9
6"+.5 $.2&% 3'4$1-&56: F="5 "5 6=& %&5$86 +* 6=& $.*-?+%-78& PE
/$%?& *+% 6=&5& -338"/-6"+.5 ". !"#$%& W021:
!"#$%& A071 5=+45 6=& 3+4&% /+.5$,&2 7@ 6=& 64+ 3%+/&55+%
/+.!#$%-6"+.5: F=& 3+4&% "5 7%+G&. 2+4. ".6+ 6=& /+.6%"7$6"+.5
+* 6=& .+.9;8$&<="*6-78& -.2 6=& ;8$&<="*6-78& ,+2$8&5: K. -?9
&%-#&> 3'4$1-&56 /+.5$,&5 LXM ,+%& 3+4&% 6=-. 3'4$1 !',):
F="5 "5 7&/-$5& "6 %$.5 -6 - ="#=&% *%&O$&./@> $5&5 - ="#=&% 5$39
38@ ?+86-#& *+% 6=& .+.9;8$&<="*6-78& ,+2$8&5> -.2 .&&25 ,+%&
5=-2+4 8-6/=&5 -.2 =+8296",& 7$**&%5:
Y"?&. 3'4$1-&56T5 2&8"?&%&2 53&&2$3 -.2 3+4&% /+56> 4& 5&&
6=-6 ;8$&<="*6 4"6= )NF "5 .+6 /+,3&88".# *%+, -. E ×D2 3&%9
53&/6"?&: Z.56&-2> 4& 5&& "6 -5 - 6&/=."O$& 6+ *$%6=&% 53&&29$3
- ="#=93&%*+%,-./& 2&5"#. 0-6 - 3+4&% /+561 4=&. /+.?&.6"+.-8
6&/=."O$&5 5$/= -5 ?+86-#& 5/-8".# +% 7+2@ 7"-5".# 2+ .+6 3%+?"2&
*$%6=&% 3&%*+%,-./&: <3&/"!/-88@> *+% *$<+( 0":&:> ;8$&<="*6-78&1
,+2$8&5> ;8$&<="*6 4"6= )NF 3%+?"2&5 -. $19=$<$%'* ,&-.5 +*
",3%+?".# 3&%*+%,-./& 4=&. *$%6=&% ?+86-#& 5/-8".# +% 7+2@ 7"9
-5".# 7&/+,&5 $.*&-5"78&: Z. 6="5 /-5& =+4&?&%> *+% 6=& 3"3&8".& -5
- 4=+8&> .+.9;8$&<="*6-78& 56-#&5 %&,-". - 7+668&.&/G 6=-6 ,$56
7& -22%&55&2 $5".# 5+,& +6=&% 6&/=."O$&:
!"4 +'/567%78'0%9 :;-(<-%2
R86=+$#= ,+56 ,+2$8&5 +* F-78& [ 4&%& *$88@ +36","C&2 4"6=
;8$&<="*6 ". +.& 2-@ +. +$% E\\9/+%& /8$56&%> 6=& +36","C-6"+. +*
,0'1( )>? 6++G -7+$6 +.& 4&&G: <$/= 8+.# 6$%.-%+$.2 6",&5 2$%9
223
Figure 3.4: Performance difference of razor base and razor+PCT
configurations
Figure 3.5 shows the power consumed by the two processor configurations.
The power is broken down into the contributions of the non-BlueShiftable
and the BlueShiftable modules. On average, razor+PCT consumes 23% more
power than razor base. This is because it runs at a higher frequency, uses
a higher supply voltage for the non-BlueShiftable modules, and needs more
shadow latches and hold-time buffers.
30
02
4
6
8
10
12
bz
ip2
cr
af
ty
ga
p
gc
c
gz
ip
m
cf
pa
rs
er
tw
ol
f
vo
rte
x
vp
r
m
ea
n
Checker Leader NonBS Leader BS Extra
Sp
ee
du
p 
(%
 ov
er
 U
n
pa
ire
d)
Po
we
r (W
)
(a) (b)
2x
Unpaired
Paceline
Base
Paceline
+OSB
bz
ip
2
cr
af
ty
ga
p
gc
c
gz
ip
m
cf
pa
rs
er
tw
ol
f
vo
rte
x
vp
r
hm
ea
n
0
5
10
15
20
25
Paceline Base Paceline+OSB
!"#$%& '( )&%*+%,-./& 0-1 -.2 3+4&% /+.5$,36"+. 071 +* 2"**&%&.6 )-/&8".&97-5&2 3%+/&55+% /+.!#$%-6"+.5: !" -.2
#$%!" %&*&% 6+ ;8$&<="*6-78& -.2 .+.9;8$&<="*6-78& ,+2$8&5> %&53&/6"?&8@:
bz
ip
2
cr
af
ty
ga
p
gc
c
gz
ip
m
cf
pa
rs
er
tw
ol
f
vo
rte
x
vp
r
m
ea
n
0
2
4
6
8
10 NonBS BS
bz
ip
2
cr
af
ty
ga
p
gc
c
gz
ip
m
cf
pa
rs
er
tw
ol
f
vo
rte
x
vp
r
hm
ea
n
0
10
20
30
40
Razor Base Razor+PCT
Sp
ee
du
p 
(%
 ov
er
 U
n
pa
ire
d)
Po
we
r (W
)
(a) (b)
Razor
Base
Razor
+PCT
!"#$%& A( )&%*+%,-./& 0-1 -.2 3+4&% /+.5$,36"+. 071 +* 2"**&%&.6 B-C+%97-5&2 3%+/&55+% /+.!#$%-6"+.5:
!"#$%& '071 5=+45 6=& 3+4&% /+.5$,&2 7@ 6=& 3%+/&55+% -.2
DE /-/=&5 ". &'()*+%) !',)> &'()*+%)-."!> -.2 64+ ".56-./&5 +*
/%0'+1)2: F=& 3+4&% "5 7%+G&. 2+4. ".6+ 3+4&% /+.5$,&2 7@
6=& /=&/G&% /+%& 04="/= "5 .&?&% ;8$&<="*6&21> .+.9;8$&<="*6-78&
,+2$8&5 ". 6=& 8&-2&%> ;8$&<="*6-78& ,+2$8&5 ". 6=& 8&-2&%> -.2
&H6%- )-/&8".& 56%$/6$%&5 0/=&/G3+".6".#> IJ> -.2 ;J1: K. -?&%9
-#&> 6=& 3+4&% /+.5$,&2 7@ &'()*+%)-."! "5 ELM ="#=&% 6=-.
6=-6 +* &'()*+%) !',): N+.5&O$&.68@> ;8$&<="*6 4"6= K<; 2+&5
.+6 -22 ,$/= 6+ 6=& 3+4&% /+.5$,36"+. 4="8& 2&8"?&%".# - 5"#9
."!/-.6 3&%*+%,-./& #-".: P+6& 6=-6 6=& /=&/G&% 3+4&% "5 #&.9
&%-88@ 8+4&% 4=&. 6=& /+%&5 %$. ". 3-"%&2 ,+2&: F="5 "5 7&/-$5&
6=& /=&/G&% /+%& 5-?&5 &.&%#@ 7@ 5G"33".# 6=& &H&/$6"+. +* ,-.@
4%+.#93-6= ".56%$/6"+.5:
!"# $%&'()*+, *-(.'(/%01- %02 *'3-(
Q& .+4 /+,3-%& 64+ B-C+%97-5&2 -%/="6&/6$%&5: 3'4$1 !',)
$5&5 6=& 3'4$1 !',) ,+2$8& ",38&,&.6-6"+.> 4="8& 3'4$1-&56
$5&5 6=& 3'4$1-&56 +.& +76-".&2 7@ -338@".# ;8$&<="*6 4"6=
)NF: R5 7&*+%&> 3'4$1-&56 %$.5 -6 6=& *%&O$&./@ #"?&. 7@ 6=&
PE /$%?&5 +* 6=& ;8$&<="*6-78& /+,3+.&.65S 6=&.> 4& -338@ 6%-9
2"6"+.-8 ?+86-#& 5/-8".# 6+ 6=& .+.9;8$&<="*6-78& /+,3+.&.65 5+
6=-6 6=&@ /-. /-6/= $3:
!"#$%& A0-1 5=+45 6=& 53&&2$3 +* 6=& 3'4$1 !',) -.2 3'7
4$1-&56 -%/="6&/6$%&5 +?&% 6=& /%0'+1)2 +.& $5&2 -5 - 7-5&8".&
". !"#$%& '0-1: <"./& 6=&5& B-C+%97-5&2 -%/="6&/6$%&5 6-%#&6 ="#=
3&%*+%,-./&> 6=&@ 2&8"?&% ="#=&% 53&&2$35: Q& 5&& 6=-6> +. -?&%9
-#&> 3'4$1-&56T5 3&%*+%,-./& "5 UM ="#=&% 6=-. 6=-6 +* 3'4$1
!',): F="5 "5 6=& ",3-/6 +* ;8$&<="*6 4"6= )NF ". 6="5 2&5"#. V
4="/= "5 .+6 .&#8"#"78& /+.5"2&%".# 6=-6 3'4$1 !',) 4-5 -8%&-2@
2&5"#.&2 *+% ="#= 3&%*+%,-./&: Q& -85+ 5&& 6=-6 801 -.2> 6+ -
8&55&% &H6&.6> 9:$*; 2+ .+6 3&%*+%, -5 4&88 -5 6=& +6=&% -338"/-9
6"+.5 $.2&% 3'4$1-&56: F="5 "5 6=& %&5$86 +* 6=& $.*-?+%-78& PE
/$%?& *+% 6=&5& -338"/-6"+.5 ". !"#$%& W021:
!"#$%& A071 5=+45 6=& 3+4&% /+.5$,&2 7@ 6=& 64+ 3%+/&55+%
/+.!#$%-6"+.5: F=& 3+4&% "5 7%+G&. 2+4. ".6+ 6=& /+.6%"7$6"+.5
+* 6=& .+.9;8$&<="*6-78& -.2 6=& ;8$&<="*6-78& ,+2$8&5: K. -?9
&%-#&> 3'4$1-&56 /+.5$,&5 LXM ,+%& 3+4&% 6=-. 3'4$1 !',):
F="5 "5 7&/-$5& "6 %$.5 -6 - ="#=&% *%&O$&./@> $5&5 - ="#=&% 5$39
38@ ?+86-#& *+% 6=& .+.9;8$&<="*6-78& ,+2$8&5> -.2 .&&25 ,+%&
5=-2+4 8-6/=&5 -.2 =+8296",& 7$**&%5:
Y"?&. 3'4$1-&56T5 2&8"?&%&2 53&&2$3 -.2 3+4&% /+56> 4& 5&&
6=-6 ;8$&<="*6 4"6= )NF "5 .+6 /+,3&88".# *%+, -. E ×D2 3&%9
53&/6"?&: Z.56&-2> 4& 5&& "6 -5 - 6&/=."O$& 6+ *$%6=&% 53&&29$3
- ="#=93&%*+%,-./& 2&5"#. 0-6 - 3+4&% /+561 4=&. /+.?&.6"+.-8
6&/=."O$&5 5$/= -5 ?+86-#& 5/-8".# +% 7+2@ 7"-5".# 2+ .+6 3%+?"2&
*$%6=&% 3&%*+%,-./&: <3&/"!/-88@> *+% *$<+( 0":&:> ;8$&<="*6-78&1
,+2$8&5> ;8$&<="*6 4"6= )NF 3%+?"2&5 -. $19=$<$%'* ,&-.5 +*
",3%+?".# 3&%*+%,-./& 4=&. *$%6=&% ?+86-#& 5/-8".# +% 7+2@ 7"9
-5".# 7&/+,&5 $.*&-5"78&: Z. 6="5 /-5& =+4&?&%> *+% 6=& 3"3&8".& -5
- 4=+8&> .+.9;8$&<="*6-78& 56-#&5 %&,-". - 7+668&.&/G 6=-6 ,$56
7& -22%&55&2 $5".# 5+,& +6=&% 6&/=."O$&:
!"4 +'/567%78'0%9 :;-(<-%2
R86=+$#= ,+56 ,+2$8&5 +* F-78& [ 4&%& *$88@ +36","C&2 4"6=
;8$&<="*6 ". +.& 2-@ +. +$% E\\9/+%& /8$56&%> 6=& +36","C-6"+. +*
,0'1( )>? 6++G -7+$6 +.& 4&&G: <$/= 8+.# 6$%.-%+$.2 6",&5 2$%9
223
Figure 3.5: Power difference of razor base and razor+PCT configurations
Given razor+PCT’s delivered speedup and power cost, BlueShift with PCT
is not compelling from a low power perspective. Instead, it is better viewed
as a technique to further speed up a high-performance design (at a power
cost) when conventional techniques such as voltage scaling or body biasing do
not provide further performance. Specifically, for logic (i.e., BlueShiftable)
modules, BlueShift with PCT provides an orthogonal means of improving
performance when further voltage scaling or body biasing becomes unfeasible.
In this case however, for the pipeline as a whole, non-BlueShiftable stages
remain a bottleneck that must be addressed using some other technique.
Although most modules of Table 3.1 were fully optimized with BlueShift
in one day on our 100-core cluster, the optimization of sparc exu took about
one week. Such long turnaround times during the timing closure process
would be unacceptable in industry. Fortunately, this implementation is only
a prototype, and drastic improvements in runtime are possible through CAD
tool innovation.
31
CHAPTER 4
OPTIMIZE THROUGHPUT WITH
DYNAMIC BEHAVIOR
From the initial effort in BlueShift work, we see that timing speculation is,
on one hand, a promising way to improve performance, while on the other
hand it demands CAD tool innovations to produce BTWC-friendly circuits.
With this new CAD demand, the work DynaTune is developed for BTWC
optimization.
4.1 Overview of DynaTune optimization
In the work of DynaTune [18], a circuit optimization procedure is developed
to utilize dualVt (threshold voltage) cells to optimize a circuit in such a way
that the most dynamically critical gates of a circuit are detected, analyzed,
and optimized for higher frequency and throughput. The novelties of Dy-
naTune include:
• It discovered a new way to represent the dynamic properties of circuits
with dynamic behavior derived from primary inputs’ static probabili-
ties, as will be described in detail in Section 4.2.
• An optimization engine is designed to improve circuit frequency and
throughput using the dynamic behavior information, as will be de-
scribed in Section 4.7.
Two timing speculation schemes - razor logic (RZ) and telescopic unit
(TU) - before and after DynaTune optimization are evaluated. The major
difference between razor Logic and telescopic unit is in how they handle rare
cases: razor logic detects error while telescopic unit predicts possible error
(Section 2.6). When error is detected or predicted, they have different ways
to handle the error with different performance penalties. On the other hand,
32
they both try to speed up common cases and allow the delay of rare-case
computation to exceed one cycle.
To effectively apply BTWC design methodology, a new circuit optimization
technique for timing speculation is required. It should have the capability
of optimization circuit based on the dynamic behavior. It should be able
to selectively optimize the most dynamically critical gates of a circuit and
improve the circuit’s throughput.
4.2 Generalized throughput model
As pointed out in Section 2.6, the throughput of circuit-level BTWC designs
can be derived as follows:
TP = (
1
tclk
P ) + (
1
tclk
1− P
r
) (4.1)
The first term 1
tclk
P can be viewed as a primary throughput, and the
second term 1
tclk
1−P
r
can be viewed as a secondary throughput with a timing
speculation penalty factor r. We can observe several facts from Equation 4.1:
1. When P is close to 1, the primary throughput dominates the overall
throughput.
2. For a given tclk, the larger P , the higher the overall throughput.
3. For traditional non-timing-speculative design, P = 1, however, its
throughput can be much lower than timing speculative designs because
its tclk can be much larger by using the worst-case delay as the clock
period.
Motivated by these observations, our algorithm, DynaTune, will try to
accelerate the part in a circuit that contributes to the primary throughput
by reducing tclk and increasing P to achieve higher overall throughput. Such
optimization will also reduce the chance of recovery and increase secondary
throughput as well.
Note that DynaTune is different from previous logic optimization tech-
niques used for TU, such as in [30,31], in two ways. The first is that previous
techniques focus on the error prediction logic ‘fh’ in Figure 2.6. In contrast,
33
DynaTune is directly applied on the ‘combinational logic block’ in Figure 2.6
to make it timing speculation friendly. Secondly, we extend the use of TU to-
wards microarchitecture optimization using Leon3 as our architectural model,
while previous works have not touched on such topics. Similarly, for the RZ
scheme, DynaTune can be applied on the combinational logic between razor
flip-flops. Therefore, a DynaTune optimized circuit can work with TU or RZ
for timing speculation.
For convenience, the definitions, terminologies and their abbreviations used
in this paper are summarized in Table 4.1.
Table 4.1: DynaTune terminology and abbreviations
Optimization methods
syn Static optimization with Synopsys Design Compiler (DC)
dyn DynaTune optimization
Timing speculation modes
TU Telescopic Unit timing speculation
RZ Razor Logic timing speculation
Throughput configurations
TRA Synopsys DC optimized circuit working in non- timing spec-
ulative mode. The longest path L determines the cycle time.
TU+syn Apply TU directly on DC-optimized circuit
RZ+syn Apply RZ directly on DC-optimized circuit
TU+dyn Apply TU on DynaTune optimized circuit
RZ+dyn Apply RZ on DynaTune optimized circuit
Other Terminologies
P(signal) The static probability of a signal: the probability of being
logic 1 observed over a unit time
L(ps) Longest path delay in pico second
TP Throughput
tclk Operating cycle time
P The probability that the circuit can produce the correct re-
sults within tclk
F Operating frequency
NL Number of lowVt cells in the design
Gain Gain in percentage over performance of the circuit optimized
by TRA
A behavior curve is a curve with axis tclk and P , where tclk is the operating
clock period and P is the probability that the circuit (or a primary output
(PO)) can produce correct results within tclk. By varying tclk, one can plot
34
all (tclk, P ) pairs to get the behavior curve representing the dynamic behavior
of a circuit.
A throughput curve can be derived from the behavior curve using Equa-
tion 4.1. And by varying tclk along the x axis, we can determine t
∗
clk and its
associated P ∗ for the peak throughput. The pair of (t∗clk, P
∗) will be called
the peak operating point. In Figure 4.1, both the behavior curve and its
associated throughput curve have been plotted.
Pictorially, we can plot the throughput in million operation per second
(MOPS) vs. clock period in nanosecond (ns) as shown in Figure 4.1. It is
clear that with decreasing the clock period, the throughput increases until
reaching TPpeak, then starts to decrease because the error recovery effect
kicks in.
From behavi r curv  to throu hput curve!
0% 
20% 
40% 
60% 
80% 
100% 
120% 
140% 
160% 
180% 
200% 
1100 
1300 
1500 
1700 
1900 
2100 
Clock period (ns) 
Behavior curve for a 32bit CLA adder 
Max TP (TRA) Behavior curve (synopsys) 
Behavior curve 
Max Frequency determined by 
the longest critical path 
The longest critical 
 path delay 
Throughput (M
O
PS) 
 
Pr
ob
 o
f n
o 
tim
in
g 
er
ro
r 
 
0  
 
 
 
 
180  
200  
1100 
 
 
2100 
Clock period (ns) 
Behavior curve and throughput for 
32bit CLA adder 
Max TP (TRA) 
Max TP (synopsys) 
TP curve (synopsys) 
Behavior curve (synopsys) 
Be avior curve 
Throughput curve 
Frequency gain 
Throughput gain 
Throughput (M
O
PS) 
 
Pr
ob
 o
f n
o 
tim
in
g 
er
ro
r 
 
1 1 1( ) ( * )PTP P
T T r
!= +
Relationship between behavior curve and throughput curve 
Figure 4.1: Throughput gain from timing speculation
35
4.3 Motivative example of DynaTune optimization
Multiple threshold voltage (Vt) cell assignment is a popular technique used
for timing and power optimization. Fast lowVt cells are usually deployed
along critical paths to reduce delay while slow highVt cells are deployed on
non-critical paths to reduce leakage power. However, a critical path wall can
be formed by balancing all paths. The critical path wall created by such
timing optimization technique can defeat the purpose of timing speculation.
Therefore a new dynamic behavior-aware circuit optimization technique is
needed for the purpose of timing speculation.
Consider a circuit in Figure 4.2. Its function is to implement a multiplexer
controlled by signal sel :
o = (sel == 0)?(a|b) : (a|b′) (4.2)
!"!! #$%$&'()*'+,-*.+/#,0+
!!"#! $%&'&()*$+! *,$! -$./&-$'$(*! 01-! $--1-! -$213$-4! 54! 6-$7&2*&(8!
*&'&(8!$--1-+!5$01-$!*,$!$--1-+!6-16)8)*$!*,-1/8,!*,$!99!51/(7)-4!
:;<=!>'1(8!)%%! *,$!61++&5%$! &(6/*!3$2*1-+?! +1'$!2)(!6-17/2$! *,$!
21--$2*! -$+/%*+! &(! 1($! 242%$! @2%)++!!"A?! )(7! +1'$! &(! *B1! 242%$+!
@2%)++!!#A=!>+!+,1B(!&(!9&8/-$!C@5A?!)!6&$2$!10!)/D&%&)-4!%18&2!$%!&+!
*)88$7! 1(! *,$!6-&')-4! &(6/*+=! $%!B&%%! 5$! )++$-*$7!B,$(! *,$! &(6/*!
3$2*1-! 0)%%+! &(! *,$!!#! 2%)++?! &(7&2)*&(8! )! *B1E242%$! 21'6/*)*&1(=!
",&+! &+! $./&3)%$(*! *1! )! -$213$-4! 6$()%*4! 0)2*1-! &' F! G=! ",$!
*,-1/8,6/*!10!"#!2)(!5$!+&'&%)-%4!$+*&')*$7!5$%1BH!
C C@ A
G
()
*(* *
(
−
= + ! ! @GA!
!"1! 2$-$34%*5$6+#73(897)8.+:(6$%++
!!",$!*,-1/8,6/*!21'6/*)*&1(!2)(!5$!+/'')-&I$7!)+!01%%1B+H!!
C C C@ A @ J A*(* *
( ( &
−
= + ! @KA!
!!",$! 0&-+*! *$-'! @ C *
(
A! 2)(! 5$! 3&$B$7! )+! )! 6-&')-4! *,-1/8,6/*?!
)(7! *,$! +$21(7! *$-'! C@ AC
&
*
(
−
!
2)(! 5$! 3&$B$7! )+! )! +$21(7)-4!
*,-1/8,6/*!B&*,!)!*&'&(8!+6$2/%)*&1(!6$()%*4!0)2*1-!&=!L$!15+$-3$!
+$3$-)%! 0)2*+H! @CA!L,$(!*! &+! 2%1+$! *1! C?! *,$! 6-&')-4! *,-1/8,6/*!
71'&()*$+!*,$!13$-)%%!*,-1/8,6/*=!@GA!91-!)!8&3$(!(?!*,$!%)-8$-!*!&+!
*,$!,&8,$-!*,$!13$-)%%!*,-1/8,6/*!B&%%!5$=!@KA!91-!*-)7&*&1()%!(1(E!
*&'&(8E+6$2/%)*&3$!7$+&8(?!*!F!C?!,1B$3$-?! &*+! *,-1/8,6/*!2)(!5$!
'/2,!%1B$-!*,)(!*&'&(8!+6$2/%)*&3$!7$+&8(+!5$2)/+$!&*+!(!2)(!5$!
'/2,! %)-8$-! 54! /+&(8! *,$! B1-+*E2)+$! 7$%)4! )+! *,$! 2%12M! 6$-&17=!
N1*&3)*$7! 54! *,$+$! 15+$-3)*&1(+?! 1/-! )%81-&*,'?!O4()"/($?!B&%%!
*-4!*1!)22$%$-)*$!*,$!6)-*!&(!)!2&-2/&*!*,)*!21(*-&5/*$+!*1!*,$!6-&')-4!
*,-1/8,6/*! 54! -$7/2&(8! (! )(7! &(2-$)+&(8! *! *1! )2,&$3$! ,&8,$-!
13$-)%%!*,-1/8,6/*=!P/2,!16*&'&I)*&1(!B&%%!)%+1!-$7/2$!*,$!2,)(2$!
10!-$213$-4!)(7!&(2-$)+$!+$21(7)-4!*,-1/8,6/*!)+!B$%%=!
!!Q1*$! *,)*! O4()"/($! &+! 7&00$-$(*! 0-1'! 6-$3&1/+! %18&2!
16*&'&I)*&1(! *$2,(&./$+! /+$7! 01-! "#?! +/2,! )+! &(! :;<:R<?! &(! *B1!
B)4+=! ",$! 0&-+*! &+! *,)*! 6-$3&1/+! *$2,(&./$+! 012/+! 1(! *,$! $--1-!
6-$7&2*&1(! %18&2! S$%T! &(! 9&8/-$! C@5A=! U(! 21(*-)+*?! O4()"/($! &+!
7&-$2*%4! )66%&$7! 1(! *,$! SV1'5&()*&1()%! W18&2! X%12MT! &(! 9&8/-$!
C@5A! *1!')M$! &*! *&'&(8!+6$2/%)*&1(!0-&$(7%4=!P$21(7%4?!B$!$D*$(7!
*,$! /+$! 10! "#! *1B)-7+! '&2-1)-2,&*$2*/-$! 16*&'&I)*&1(! /+&(8!
W$1(K!)+!1/-!)-2,&*$2*/-)%!'17$%?!B,&%$!6-$3&1/+!B1-M+!,)3$!(1*!
*1/2,$7!1(!+/2,!*16&2+=!P&'&%)-%4?! 01-!*,$!YZ!+2,$'$?!O4()"/($!
2)(! 5$! )66%&$7! 1(! *,$! SW18&2! P*)8$+T! 5$*B$$(! Y)I1-! 99+=!
",$-$01-$?!O4()"/($!16*&'&I$7!2&-2/&*! 2)(!$&*,$-!B1-M!B&*,!"#!
1-!YZ!01-!*&'&(8!+6$2/%)*&1(=!
91-! 21(3$(&$(2$?! *,$! 7$0&(&*&1(+?! *$-'&(1%18&$+! )(7! *,$&-!
)55-$3&)*&1(+!/+$7!&(!*,&+!6)6$-!)-$!+/'')-&I$7!&(!")5%$!C=!!
1"! :(.*;4.*(-+<(3+=>-4#8-$+?).*@*54.*(-+
!!N/%*&6%$! *,-$+,1%7! 31%*)8$! @[*A! 2$%%! )++&8('$(*! &+! )! 616/%)-!
*$2,(&./$! /+$7! 01-! *&'&(8! )(7! 61B$-! 16*&'&I)*&1(=! 9)+*! W1B[*!
2$%%+! )-$! /+/)%%4! 7$6%14$7! )%1(8! 2-&*&2)%! 6)*,+! *1! -$7/2$! 7$%)4!
B,&%$!+%1B!\&8,[*!2$%%+!7$6%14$7!1(!(1(E2-&*&2)%!6)*,+!*1!-$7/2$!
%$)M)8$! 61B$-=! \1B$3$-?! )! 2-&*&2)%! 6)*,! B)%%! 2)(! 5$! 01-'$7! 54!
5)%)(2&(8!)%%!6)*,+=!L$!15+$-3$!0-1'!$D6$-&'$(*+!*,)*!*,$!2-&*&2)%!
6)*,! B)%%! 2-$)*$7! 54! +/2,! *&'&(8! 16*&'&I)*&1(! *$2,(&./$! 2)(!
7$0$)*! *,$! 6/-61+$! 10! *&'&(8! +6$2/%)*&1(=! ",$-$01-$! )! ($B!
74()'&2!5$,)3&1-E)B)-$!2&-2/&*!16*&'&I)*&1(! *$2,(&./$! &+!($$7$7!
01-!*,$!6/-61+$!10!*&'&(8!+6$2/%)*&1(=!
!!V1(+&7$-! )! 2&-2/&*! &(! 9&8/-$! G=! U*+! 0/(2*&1(! &+! *1! &'6%$'$(*! )!
'/%*&6%$D$-! 21(*-1%%$7! 54! +&8()%! +$%H! S1F@+$%FF]A^@)_5AH@)_`5AT=!
",$! 7$%)4! 01-! \&8,[*! )(7! W1B[*! *46$+! )-$! %)5$%$7! )513$! $)2,!
2$%%?!-$+6$2*&3$%4=!U(!*,$!01%%1B&(8!7&+2/++&1(+?!B$!B&%%!)66$(7!*,$!
8)*$! ()'$! B&*,! )! 61+*0&D! SWT! &0! *,$! 8)*$! &+! )++&8($7! W1B[*=!
>++/'$! &(&*&)%%4! )%%! 8)*$+! )-$! &(! \&8,[*! )(7! *,$! %$)M)8$! 61B$-!
5/78$*! )%%1B+! )*! '1+*! *,-$$! 2$%%+! *1! 5$! )++&8($7! *1! W1B[*=!
"-)7&*&1()%! 16*&'&I)*&1(! *-&$+! *1! 16*&'&I$! )%%! 6)*,+! 54! )++&8(&(8!
#a?! #R! )(7! #b! *1! W1B[*=! >+! )! -$+/%*?! *,$! 1/*6/*! +! 10! *,&+!
16*&'&I$7! 2&-2/&*! 2)(! 5$21'$! +* 5%$! *! *&'$! C=G(+! 3&)! c)?! #C?!
#aW?!#bW?!1d!B,$(!+$%F]?!1-!)*!C=K!(+!3&)!c5?!#G?!#K?!#RW?!#bW?!
1d!B,$(!+$%FC=!L$!21(+&7$-!*,$!01%%1B&(8!*B1!'17$+!($D*H!!
A*983$+!B+CD4@)%$+'*3'8*.+.(+*%%8&.34.$+)3(E%$@+(<+&.4.*'+().*@*54.*(-+
<(3+.*@*-9+&)$'8%4.*(-+
!
@CA!{yhH!",&+! &+!)!'17$!B&*,1/*! *&'&(8!+6$2/%)*&1(=!",$!2-&*&2)%!
6)*,!@"FC=K(+A!7$*$-'&($+!*,$!'&(&')%!242%$!*&'$=!",$(!"e!F!Cf"!
F!C]]]fC=K!F!bRg=G!NheP=!!
@GA! "&'&(8! +6$2/%)*&1(! '17$! @{|R G VG yR AH! +&(2$! *,$!
1/*6/*!2)(!1(%4!5$!+*)5%$!)*!C=G(+!1-!C=K(+?!B$!2)(!)*!'1+*!13$-E
2%12M! *,&+! 2&-2/&*! B&*,! "FC=G(+=! >++/'$! B$! 15+$-3$! wOPF]=C!
@*,$!6-15)5&%&*4!10!+$(+&*&I&(8!*,$!*16!6)*,!&+!]=gFCE]=CA?!*,$(!*,$!
"e!2)(!5$!2)%2/%)*$7!B&*,!$./)*&1(!@KA=!
GGG{|R !H!"e!F!C]]]fC=GJ@]=gi]=CfGA!F!bgC=b!NheP!!
!!!yR !H!!"e!F!C]]]fC=GJ@]=gi]=CfaA!F!bRR=b!NheP!
!!L,$(!*&'&(8!16*&'&I)*&1(!&+!71($!&(!)!5)%)(2$7!')(($-?!$3$(!&(!
*,$!5$+*!2)+$!B,$-$!{|R G21'5&()*&1(!&+!/+$7?!*,$!*,-1/8,6/*!
2)(!1(%4!&'6-13$!54!Kj!13$-!*,)*!10!{yh=!!
!!h(! *,$! 21(*-)-4?! B&*,! *,$! M(1B%$78$! 10! wOPF]=C! )! 6-&1-&?!
O4()"/($! 2)(! 16*&'&I$! *,$! 2&-2/&*! &(! )! 6-15)5&%&+*&2! B)4! *1!
)2,&$3$! )! 5$**$-! +1%/*&1(! 01-! *&'&(8! +6$2/%)*&1(! 54! )++&8(&(8!
#C?#a?#b!*1!W1B[*=!U*!6-17/2$+!*B1!/(5)%)(2$7!6)*,+?!C=]a(+!c)?!
#CW?!#aW?!#bW?!1d!)(7!C=;a(+!c5?!#G?!#K?!#R?!#bW?!1d=! U0! *,$!
2&-2/&*! &+! 2%12M$7! )*! "FC=]a(+?! +! B&%%! 6-17/2$! 21--$2*! -$+/%*!
B,$($3$-!*,$!*16!6)*,!&+!)2*&3)*$7=!",$!*,-1/8,6/*+!01-!{|R G
)(7GyR !)-$!Ckj!)(7!C;j!,&8,$-!*,)(!{yhH!!
G{|R aG!"e!F!C]]]fC=]aJ@]=gi]=CfGAFg];=k!NheP!@Ckj!8)&(A!
GyR aG!!"e!F!C]]]fC=]aJ@]=gi]=CfaAFkbR=G!NheP!@C;j!8)&(A!
!!L$!01-'/%)*$!1/-!6-15%$'!)+!01%%1B+H!8&3$(!@CA!)(!&(&*&)%!2&-2/&*!
&'6%$'$(*$7!B&*,!\&8,[*!2$%%+?!@GA!)! %$)M)8$!61B$-!5/78$*?!)(7!
@KA! *,$!+&8()%!6-15)5&%&*&$+!10!6-&')-4! &(6/*+?!)++&8(!W1B[*!2$%%+!
B&*,&(! *,$! %$)M)8$!61B$-!5/78$*!B&*,! *,$!15l$2*&3$! *1!15*)&(!*,$!
,&8,$+*!6$)M!*,-1/8,6/*!*,-1/8,!*&'&(8!+6$2/%)*&1(=!!
F"! =>-4#8-$+?).*@*54.*(-+
!!",$!')&(! &7$)!10!O4()"/($! &+! *1!2,)(8$! *,$!74()'&2!5$,)3&1-!
10!)!2&-2/&*!54!-$7/2&(8!*,$!7$%)4!10!%1(8!6)*,!*,)*!)-$!21''1(%4!
$D$-2&+$7!)(7! l/7&2&1/+%4!)%%1B&(8!+1'$! -)-$%4! $D$-2&+$7!2-&*&2)%!
6)*,!*1!0)&%!*,$!1($!242%$!*&'&(8!-$./&-$'$(*=!>!,-%./0+&'12&/-!&+!
/+$7! )+! *,$! 8/&7)(2$! *,-1/8,1/*! *,$! O4()"/($! 16*&'&I)*&1(=! >!
5$,)3&1-! 2/-3$! &+! )! 2/-3$! B&*,! )D&+! {! )(7! w! @-$0$-! *1! *,$!
7$0&(&*&1(+!&(!")5%$!CA=!X4!3)-4&(8'{?!1($!2)(!6%1*!)%%!@{?!wA!6)&-+!
*1!8$*!*,$!5$,)3&1-!2/-3$!-$6-$+$(*&(8!*,$!74()'&2!5$,)3&1-!10!)!
2&-2/&*=! >(! $D)'6%$! 01-! )! KGE5&*! )77$-! &+! +,1B(! &(! 9&8/-$! K=! >!
3%&+24%523' 12&/-! 2)(! 5$! 7$-&3$7! 0-1'! 5$,)3&1-! 2/-3$! /+&(8!
$./)*&1(!@KA=!>(7!54!3)-4&(8!{!)%1(8!*,$!D!)D&+?!B$!2)(!7$*$-'&($!
Figure 4.2: Example circuit to illustrate problem of static optimization for
timing speculation
The delay for highVt and lowVt types are labeled on top of each cell, re-
spectively. In the following discussions, we will append the gate name with
a postfix “L” if the gate is assigned lowVt. Assume initially all gates are in
highVt and the leakage power budget allows at most three cells to be assigned
to lowVt. Traditional optimization tries to optimize all paths by assigning
U5, U6 and U7 to lowVt. As a result, the output o of this optimized circuit
can become stable at time 1.2ns when sel = 0 via
{a→ U1→ U5L→ U7L→ o}
36
or become stable at 1.3 ns when sel = 1 via
{b→ U2→ U3→ U6L→ U7L→ o}
We consider the following two modes next:
1. TRA: This is a mode without timing speculation. The critical path
(tclk=1.3ns) determines the minimal cycle time. Then
TP =
1
tclk
=
1000
1.3
= 769.2MOPS
2. Timing speculation mode (TU+syn / RZ+syn): since the output can
only be stable at 1.2ns or 1.3ns, we can at most over-clock this circuit
with tclk=1.2ns. Assume we observe P (sel)=0.1 (the probability of
sensitizing the top path is 0.9=1-0.1), then the TP can be calculated
with Equation 4.1.
TU + syn : TP =
1000
1.2
∗ (0.9 + 0.1
2
) = 791.7MOPS
RZ + syn : TP =
1000
1.2
∗ (0.9 + 0.1
5
) = 766.7MOPS
When timing optimization is done in a balanced manner, even in the best
case where TU+syn combination is used, the throughput can only improve
by 3% over that of TRA.
On the contrary, with the knowledge of P (sel) = 0.1 a priori, DynaTune
can optimize the circuit in a probabilistic way to achieve a better solution
for timing speculation by assigning U1, U5, U7 to lowVt. It produces two
unbalanced paths:
• 1.05ns: {a→ U1L→ U5L→ U7L→ o}
• 1.45ns: {b→ U2→ U3→ U6→ U7L→ o}
If the circuit is clocked at tclk = 1.05ns, o will produce correct result when-
ever the top path is activated. The throughputs for TU+syn and RZ+syn
are 18% and 14% higher than TRA:
TU + dyn : TP =
1000
1.05
∗ (0.9 + 0.1
2
) = 904.8 MOPS (18% gain)
37
RZ + dyn : TP =
1000
1.05
∗ (0.9 + 0.1
5
) = 876.2 MOPS (14% gain)
The dynamic behavior driven optimization problem can be stated as fol-
lows: given an initial circuit implemented with highVt cells, a leakage power
budget, and the signal probabilities of primary inputs, how to assign lowVt
cells within the leakage power budget with the objective to obtain the highest
peak throughput through timing speculation?
4.4 DynaTune optimization
The main idea of DynaTune is to change the dynamic behavior of a circuit
by reducing the delay of long paths that are commonly exercised and ju-
diciously allowing some rarely exercised critical paths to fail the one cycle
timing requirement. A behavior curve is used as the guidance throughout
the DynaTune optimization.
In Figure 4.3, behavior curves and throughput curves for an adder opti-
mized by Synopsys DC and DynaTune are plotted (Table 4.1 for definitions).
From top to bottom, three horizontal lines indicate the peak throughput for
dyn configuration, syn configuration, using TU as the timing speculation
scheme, and TRA configuration. The lowest two curves are the behavior
curves of syn (marked with ) and dyn (marked with N). DynaTune opti-
mization can create a separation in the middle (0.40ns-0.46ns) of the behavior
curves by sacrificing some uncommonly exercised critical paths (the longest
critical path for syn is 0.52ns while the longest path of dyn is 0.54ns).
The lifted probability for the common cases in the middle range of the
behavior curve can be translated into throughput gain according to Equa-
tion 4.1. As can be seen from the upper two throughput curves, both the
peak (marked with •) and the corresponding F of dyn are higher than those
of syn (marked with ×). In fact, in this example, dyn circuit can produce the
highest throughput (2160 MOPS). Throughput curves for RZ scheme can be
derived in the same way using Equation 4.1 with a penalty factor r = 5.
One can get the behavior curve through timed simulation, but that would
be very time-consuming. To avoid this, I propose to derive the behavior
curve using timed characteristic functions (TCF).
38
0% 
20% 
40% 
60% 
80% 
100% 
120% 
140% 
160% 
180% 
200% 
1100 
1300 
1500 
1700 
1900 
2100 
Clock period (ns) 
Behavior curve and throughput for 
32bit CLA adder 
Max TP (TRA) TP curve (dynatune) 
Max TP (synopsys) TP curve (synopsys) 
Max TP (dynatune) Behavior (dynatune) 
Behavior curve (synopsys) 
Performance gain from DynaTune 
!"#$%#&'()*+,&)
!"#$%&'$()*(+,-&.%/+&
-."/%*.0%1)*+,&)
!"#$%&#0$1*.02*#&.%/+&
•  DynaTune exploits dynamic 
information to change the 
behavior curve shape 
(Yellow) 
–  Allow timing overshooting 
–  Lift the BC in the middle part 
•  New TP curve can be 
computed from DynaTuned 
BC. 
–  Increased peak throughput 
•  Effect of DynaTune 
optimization 
–  Extra throughput gain 
–  Extra frequency gain 
-."/%*.0%1)23
4
567)
)5"
/8
)/
9)&
/)
:;
,&
*)
#"
"/
")
)
Figure 4.3: Behavior curve lifting can provide extra throughput gain
39
4.5 Timed characteristic function
The timed characteristic function (TCF) is originally used in automatic test
pattern generation (ATPG) to find a test pattern that can sensitize an output
at a given time. For convenience, we adopt the earlier-timed TCF as proposed
in [42]. Similar to most previous work [18,42], the TCF is defined in floating
mode where input vectors are applied at time t=0, and before time 0, the
input vectors are treated as uninitialized.
Definition 1. A characteristic function CF (n = val) is a Boolean func-
tion that characterizes the set of input vectors that evaluate an output n to
value val, where val can be logic 0 or 1.
“Characterizes” means that the CF (n = val) evaluates to true for input
vectors if and only if these input vectors evaluate n to the desired value val.
With an additional temporal term t as the required arrival time (RAT) of n,
we have:
Definition 2. A timed characteristic function T (n, t−) is a Boolean
function that characterizes the set of input vectors that stabilizes an output
n no later than time t.
The goal of deriving TCF is to find an input vector which can sensitize
a circuit and make a given output stable before (after) a specified timing
requirement t. The local TCF for a gate output n is denoted as: T (n, t−).
T (n, t−) contains all input vectors which make the output n stable no later
than t. Local TCF at a given point can be derived recursively: starting
from node n and back-tracing the circuit in n’s fan-in cone until the primary
inputs are reached. Instead of using the TCF formulas from [42], where the
stable value (logic 0 or 1) is explicitly used as one parameter of the TCF, we
are only interested in the stable time for the gate output n, and the actual
stable value of n does not matter to us. Thus, we can derive the TCF in the
following forms:
40
The local T (c, t−) for an AND gate (c = a ∧ b) can be written as:
T (c, t−) =CF (a′)CF (b) ∗ T (a, (t− d)−)+ (4.3)
CF (a) ∗ CF (b′) ∗ T (b, (t− d)−)+
CF (a′) ∗ CF (b′) ∗ [T (a, (t− d)−) + T (b, (t− d)−)]+
CF (a) ∗ CF (b) ∗ T (a, (t− d)−) ∗ T (b, (t− d)−)
where CF (x) is the characteristic function for the output x. From the defini-
tion, a characteristic function CF (x) contains all primary input vectors that
make x stable at logic 1, while CF (x′) contains all primary input vectors that
eventually make x stable at logic 0. CF can be computed using BDD [43]
for each gate. In Equation 4.3, each term in the summation actually encodes
one possible input combination for the gate. For example, the first term
CF (a′) ∗ CF (b) ∗ T (a, (t− d)−)
encodes all primary input vectors which make output c stable at t− under
the condition a = 0 and b = 1. Since a has the controlling value of AND
gate (logic 0), if and only if a is stable at (t − d)−, output c can be stable
after a gate delay d, and then T (c, t−) is guaranteed. Similarly, the other
three input combinations are encoded in the other three terms, respectively.
As with AND gate, T (c, t−) for an OR gate (c = a ∨ b) can be written in
a similar way:
T (c, t−) =CF (a′) ∗ CF (b) ∗ T (b, (t− d)−)+ (4.4)
CF (a) ∗ CF (b′) ∗ T (a, (t− d)−)+
CF (a′) ∗ CF (b′) ∗ T (a, (t− d)−) ∗ T (b, (t− d)−)+
CF (a) ∗ CF (b) ∗ [T (a, (t− d)−) + T (b, (t− d)−)]
The TCFs of complex cells can be written in the same manner by first de-
composing complex cells into simple gates.
41
4.6 Dynamic behavior curve
The global TCF function for a whole circuit is defined as T (C, t−), which
contains all primary input vectors that make all outputs stable no later than
time t for the circuit C. We use lower case letter in local TCF and upper
case letter in global TCF. We can derive global T (C, t−) in the following
manner. First, early arrival time (min AT ) and late arrival time (max AT )
are calculated for every gate in circuit C. Then local TCF as shown in
Figure 4.4 for each PO is derived using recursive relations mentioned in sec-
tion 4.5. The recursion terminates at a gate g early if the timing requirement
t either can never be satisfied (t < min AT (g)) or can always be satisfied
(t ≥ max AT (g)).
Characteristic functio  and 
timed chara t ristic fu cti  
•  Characteristic function 
–  CF(n) contains all input vectors that 
make n ‘1’. 
–  CF(n’) contains all input vector that 
make n ‘0’.  
–  Is a function of PI’s 
•  Timed characteristic function 
–  Assume PI is applied (stable) at time 0. 
–  TCF(n, t-): a formula contains all input 
vectors which make the output n stable 
(can be 0 or 1) no later than t. 
–  Is a function of PI’s 
–  Local TCF: TCF(n, t-). 
–  Derived recursively from node inputs 
!"#"OR$%&'()"""CF$!(#%"*"'"""CF$!+(#%+","'+"
-#"AND$.&/()"CF$-(#.","/"CF$-+(#."*"/""
"
!"#$!&0(#12!.$!"#$%&034(&"!"#$'&034(("
!"#$-&0(#12!.$!"#$.&034(&"!"#$/&034(("
1ns
1ns
Real circuit
a
b
c
d
n
m
0ns C
Virtual gate:
AND all POsLocal TCF(n,t-)
Local TCF(m,t-) Global 
TCF(C,t-)
Figure 4.4: Local TCF and zero-delay virtual gate
At last, the local TCFs of all POs are ANDed together to produce the
global TCF for the circuit as shown in Figure 4.4. Note that a global TCF is
only used for deriving the behavior curve to guide DynaTune optimization.
None of the aforementioned TCF functions with the global AND operation
will become part of the final optimized circuit. The procedure of deriving
TCF is shown in Algorithm 1.
The probability of T (C, t−) being evaluated to logic 1 (the static probabil-
ity of T (C, t−)) is equivalent to the probability that the circuit can produce
all stable outputs no later than t. T (C, t−) is represented in a BDD as an
example shown in Figure 4.5.
The algorithm proposed in [44] is used to calculate the static probability
of T (C, t−) represented as a BDD. By varying the operating point t, the
42
Algorithm 1 TCF(g,t): Construct TCF recursively
Input: g : gate
Output: t : operating point (timing requirement)
1: if TCF (g, t) exists then
2: return TCF(g,t)
3: end if
4: if t < min AT (g) then
5: return bdd zero
6: end if
7: if t ≥ max AT (g) then
8: return bdd one
9: end if
10: if gateType(g)=AND then
11: recursively call TCF using Equation 4.3
12: return the top BDD node for the above TCF
13: end if
14: if gateType(g)=OR then
15: recursively call TCF using Equation 4.4
16: return the top BDD node for the above TCF
17: end if
18: if other types then
19: . . .
20: return the top BDD node for the above TCF
21: end if
Probability of circuit producing results 
with no timing error at cycle time T 
!    TCF as function of primary inputs and cycle time t. 
–    TCF(C,t-) = f(a,sel,b,t) 
!    f(a,sel,b,t) can be represented with ROBDD 
–    TCF(C,t-) = f(a,sel,b,t=1.55) = sel’+a!sel 
!    Calculate probability of getting correct outputs at t 
–    P(a)=0.5, P(b)=0.5 and P(sel)=0.1  
–    Pr ( sel’+a!sel ) = 0.95 
a: AT=0
sel:AT=0
b:AT=0
o
0.5
0.5
0.5
0.5
0.5
0.1
0.1
sel
a
1 0
0.9
0.1
0.5
0.5
Pr = 0.95
Figure 4.5: BDD produced from timed characteristic function
43
desired (t, P ) pairs can be computed for the behavior curve, where P =
Prob(T (C, t−)).
Figure 4.6 shows the derived behavior curves and the simulation data for a
32-bit carry-look-ahead (CLA) adder. The NCSim simulator from Cadence
is used to do timed simulation for this adder in floating mode as suggested
in Devadas et al. [45], under which the internal gates and nets take arbitrary
value ‘X’ before a single input vector is applied to settle their values. We
set the static probability for PIs as 0.5 and 0.1 for this study. Intuitively,
when the static probability is 0.5, the carryout signal has more chance to
propagate for a longer distance on the critical path along the carry chain,
thus making the whole circuit stable late. Conversely, when it is 0.1, there
is more chance the carry bit will be terminated early, enabling the circuit to
stabilize early. This is verified by the data shown in Figure 4.6. Also, the
derived behavior curves match the simulation data very closely.
DynaTune analyzes circuit dynamic behavior 
with b havior urve (BC)!
•  !"!#$%&'()#(*%+!#,%&-./0!,#1/2!
3124!(5()%!.6%!7%2!2#"!#"
•  $!""!,&#8-81)125!#9!!"!.61/0!%&&#&!
:,&#+;(1/0!(#&&%(2!&%7;)27<!-2!
#,%&-./0!,#1/2!!#"
•  #$%&!"=!:>??!@6-A<!
–  B#6,;2%!0)#8-)!.6%+!
(4-&-(2%&17.(!9;/(.#/!@BC:BD2'<!
–  E7%!FGG!2#!(#6,;2%!,&#8-81)125!
#9!0)#8-)!@BC:BD2<!-7!$!"?!!
•  H)#2!%!&"$!"'!,-1&7!-7!8%4-$1#&!
(;&$%! 0% 
10% 
20% 
30% 
40% 
50% 
60% 
70% 
80% 
90% 
100% 
Pr
ob
ab
ili
ty
 o
f n
o 
tim
in
g 
er
ro
r 
Clock period (ns) 
Behavior curve matchs simulation data 
Sim (0.5) Behavior curve (0.5) 
Sim (0.1) Behavior curve (0.1) 
Computed BC vs. Simulation BC 
for a 32bit CLA adder results 
Figure 4.6: Dynamic behavior curve for a 32bit adder
44
4.7 Stepwise optimization
DynaTune optimizes a circuit by iterating three steps:
1. Scan the current behavior curve.
2. At the operating point t where the behavior curve shows a noticeable
probability drop, DynaTune investigates the cells’ individual contribu-
tions to such a probability drop at t. The motivation behind this is
that we will use lowVt for those cells that cause non-trivial probability
drops so these cells can be sped up to work against the probability
drops to improve the overall throughput and frequency of the design.
3. A min-cut max-flow algorithm is used to find the cells mentioned above.
Given the behavior curve for C, an outer loop optimization procedure
scans the behavior curve with a fixed interval ts starting from one end of the
operating point (thigh = max delay) to the other end of the operating point:
tlow = max delay/2. Setting tlow to max delay/2 is because TU needs to
guarantee all signal propagations to finish within at most 2 cycles when fh
is asserted. At each scan step tcur, an inner loop optimization, described in
Algorithm 2, is called to investigate individual cells’ contributions and to find
the node cut-set at a proper optimization point t. To achieve such a goal, a
do loop is used to find t that has a probability drop on the behavior curve
larger than a threshold min 4 (min 4 = 0.2 in our experiment). To save
runtime, Algorithm 2 filters out two kinds of cells through the filter candidate
procedure:
• cells that have been assigned lowVt, and
• cells whose local TCFs do not change when the circuit is clocked at tcur
and t.
This is because those cells with no local TCF changes do not contribute to
the probability drop. After a flow network is constructed with the remain-
ing cells by procedure create network, a cell’s individual contribution in this
network is investigated by setting it to lowVt temporarily and updating the
local TCFs of the affected fan-in and fan-out cones. Then the probability
difference on the behavior curve between the two cases (i.e., before and after
setting lowVt) is recorded as the cells probability contribution. We perform
45
such evaluation for each cell in the network. Next the capacity of each cell
(i.e., a network node) is set using the inverse of the square root of each node’s
contribution. Then a mincut procedure is called to find a minimal cut.
If the mincut procedure returns a non-empty node cut-set (a set of can-
didates that most likely caused the probability drop), lowVt assign proce-
dure assigns lowVt cells next. This is done by comparing a candidate cell’s
probability contribution to a filtering factor α ∗ avg contri. Only cells with
contribution larger than this threshold are assigned lowVt. Here, α is an em-
pirically determined value (0.25 in this case), and avg contri is calculated as
a moving average of the probability contribution from all lowVt cells assigned
by DynaTune so far. Such a filtering process is very important because it
only picks the cells that are on commonly exercised critical paths for lowVt
assignment and allows rarely exercised paths to go beyond one cycle timing
requirement. Since it is not worth spending precious lowVt budget on those
cells who contribute insignificantly even if they are in the cut-set, filtering
them out can save leakage power budget for others who could contribute
more in later iterations.
This procedure is illustrated in Figure 4.7, where the solid circles represent
cells assigned lowVt. On one cut, one cell is filtered out because of its small
contribution. Note that when α is set to zero (no filtering), DynaTune will
degenerate into a traditional timing optimization technique which tries to
reduce delays for all paths.
Each DynaTune optimization step 
In each iteration: 
1.  Estimate individual gate’s 
contribution to global 
probability (no error) change 
using TCF. 
2.  Find a set of candidate for 
lowVt assignment by MAX-
FLOW MIN-CUT. 
3.  Filter out ones with low 
contribution according to a 
moving average threshold. 
4.  Assign gates to lowVt. 
5.  Update the threshold. 
Find a node cutset
!"#$"%"&'("
!#$"%"&')"
*+,-./"0,1$0/12"!"#$""%"&'3"
!"!"#$#%&%'#
(%&)#*+,-./"/0#12.3"
!"!"#$#%&4#
566,78#-19:."
)"Figure 4.7: Dynamic stepwise optimization using min-cut
After these steps, timing information as well as the cost of leakage power
are updated. The outer loop continues to move to a lower operating point
tcur = tcur − ts until it reaches tlow. The optimization loop also ends when
the leakage power cost exceeds the leakage power budget.
46
Algorithm 2 stepwise: DynaTune stepwise optimization
Input: C : a circuit netlist
tcur : operating point (timing requirement)
Output: {lowVt cells}: a set of cells for lowVt assignment
1: p=probability(C,tcur);
2: t′ = tcur;4 = 0;
3: while 4 < min triangle do
4: t′ = t′ − L/100;
5: 4 = p− probability(C, t′);
6: end while
7: {candidates} = filter candidate(C);
8: network = create network({candidates});
9: for all n in {network} do
10: contri = contibution(C, n, t′);
11: capacity = 100/sprt(contri);
12: assign capacity(n, capacity);
13: end for
14: {candidate cells}=mincut(network);
15: {lowVt cell}=LowVT assign({condidate cells},α ∗ avg contri);
16: update(avg contri);
17: return {lowVt cells}
The circuit in Figure 4.2 can be used as an example to explain the al-
gorithm. Initially, all cells are set as highVt cells. Two delay values are
shown above each cell for highVt / lowVt in unit of nanosecond. Assume
that independent inputs a, b and sel have static probabilities of P (a) = 0.5,
P (b) = 0.5 and P (sel) = 0.1. When all cells are highVt, the critical path
delay is 1.6ns, consisting of {b → U2 → U3 → U6 → U7 → o}. Let us
compute the behavior curve using all highVt cells.
Assume the first operating point of interest is t=1.6ns. Since this is the
critical path delay, Algorithm 1 terminates early for all POs with the return
value bdd one (logic 1 in the BDD package). Then, the global T (C, 1.6−) = 1,
indicating that all input vectors stabilize the circuit by 1.6ns. Thus we get a
(T, P ) pair (1.6, 1).
Assume later on we reach t=1.55ns. Starting from output o with timing
requirement t=1.55ns, one branch for tracing towards the PI in Algorithm 1
is
{o→ U7→ U5→ ....}
47
When we reach U5 along this path, we find its late arrival time max AT (U5)
= 1.0ns which is smaller than its required arrival time 1.05ns. Then this
branch of Algorithm 1 recursive call can be terminated early with a return
value of bdd one, meaning that any controlling value of U7 (logic 1 in this
case) produced by U5 can make o stable by t=1.55ns. This condition is
((a + b)× sel′). Similarly, after tracing back from other branches and oper-
ating on those local TCFs, the global T (C, 1.55−) produced by DynaTune
is (sel′ + a× sel) meaning whenever (sel′ + a× sel) == true, o can become
stable no later than 1.55ns. Since P (sel′ + a × sel) = 0.9+0.5*0.1 = 0.95,
the associated P for 1.55ns is 0.95. In this way, we get another (T, P ) pair
(1.55, 0.95) on the behavior curve.
Assume the next operating point of interest is t= 1.35ns. Algorithm 1
terminates early with bdd zero when back-tracing reaches U1, because the
min AT(U1) (0.5ns) is greater than RAT of U1 (0.35ns=1.35-0.5-0.5). Sim-
ilarly, it also terminates early at U3. In fact the global
T (C, 1.35−) = sel × sel′ = ∅
meaning no input vector can make o stable by 1.35ns. Thus a (T, P ) pair
of (1.35, 0) is produced. By varying t, we can derive the complete behavior
curve.
Now, let us see how Algorithm 2 works. At t=1.35ns, we have a significant
probability drop (bigger than min 4), so we need to recover from this prob-
ability drop by turning some critical cells into faster lowVt cells. Because
turning U2 or U4 into lowVt makes no difference for the overall probability
drop, procedure filter candidate filters them out first. Then a flow network
(dashed curves in Figure 4.2) is constructed. Initially, every cell is in highVt.
Turning any gate on the top path (delay of 1.5ns):
{a→ U1→ U5→ U7→ o}
to lowVt can reduce this path delay to 1.35ns. In addition, turning U7 into
lowVt can also make the output o stable by 1.35ns whenever the path {a→
U3 → U6 → U7} is activated with the condition (a × sel = true) (with a
probability of 0.05). Combining with the probability 0.9 of activating the top
path, turning U7 to lowVt makes the largest probability contribution of 0.95.
48
And since the capacity in the network flow is the inverse of the square root
of contribution, U7 will have the smallest capacity. Because none of the cells
has been assigned lowVt so far, the avg contri is 0. Then {U7} is returned
by Algorithm 2 and assigned to lowVt. Then avg contri is updated.
Later on, the flow network will exclude U7 because it has already been
assigned lowVt. And a cut-set {U5, U6} will be returned by ‘mincut ’, since
signal sel rarely activates the path with P (sel) = 0.1:
{b→ U2→ U3→ U6→ U7→ o}
U6 will be filtered out by ‘lowVt assign’ when its probability contribution
is compared with α × avg contri, that is (0.25 × 0.95). At the end of the
DynaTune optimization, U1, U5 and U7 will eventually be assigned lowVt.
4.8 Experimental results
DynaTune was implemented in SIS [46]. Following the TSMC reference de-
sign flow [47], the baseline dualVt benchmark circuits are compiled by Syn-
opsys Design Compiler (DC) with a TSMC 65nm library (TRA). (Refer to
Table 4.1 for various definitions used in this subsection.) The same cir-
cuits were used as inputs to DynaTune, but with all lowVt cells converted to
highVt. DynaTune then optimizes the circuit and selectively turns cells to
lowVt within the same leakage power budget used by Synopsys DC. We ap-
ply DynaTune on Leon3 modules and MCNC benchmark circuits and report
various comparison results.
4.8.1 DynaTune on Leon3 processor
A benchmark suit is compiled, including a variety of applications:
• three sorting programs (quick, bubble, tree);
• two matrix multiplication programs (intmm, mm);
• a FFT application (oscar);
• a permutation program (perm); and
49
• several puzzle programs (tower, puzzle and queens).
One application is picked from each category to form the training set consist-
ing of quick, intmm, perm, oscar, and queens. These five training programs
are running on a Leon3 processor at RTL level with Cadence simulation tool
NCSim to characterize initial static probabilities for PIs. Then DynaTune
optimizes these modules based on the extracted static probability informa-
tion. All ten benchmark programs are then used as the testing set on both
DC optimized circuit (syn) and DynaTune optimized circuit (dyn) to collect
performance results.
This study focuses on the logic surrounding the execution unit. Two
pipeline stages of the Leon3 - register access and execution - are shown in
Figure 4.8. The ALU performs addition/subtraction and a variety of logic
operations. In the register access stage, the most time consuming combi-
national logic parts are the two lanes of bypassing logic. These two lanes
of bypassing logic have identical physical design. Functionally, the bypass-
ing module chooses one signal to output from 7 possible sources from the
following;
wr write-back register
im immediate data
rfd register file output data
ed execution data
xd exception data
md memory data
zero all ‘0’
we evaluate two aspects of the experimental data. Firstly, we allow individ-
ual modules to run at their individual peak throughput points to study the
full effect of DynaTune. As a result, each individual module may report a
different operating frequency F . Secondly, synchronized design methodol-
ogy requires that every pipeline stage runs under the same frequency. As a
result, the frequency of the most critical module (the minimum frequency
50
Example of throughput optimization 
!  Case Study of the bypassing logic 
4 
250.0 
260.0 
270.0 
280.0 
290.0 
300.0 
310.0 
320.0 
330.0 
340.0 
350.0 
im rfd zero ed wr md xd 
P
at
h 
de
la
y 
in
 p
ic
o 
se
co
nd
 
Source channels 
Delay distribution for operand2 
bypassing lane in LEON3 
processor 
Path delay from respective source 
Lu Wan, Deming Chen. “DynaTune”: ICCAD, 2009 
Output throughput: 2.9GOPS 
Err-detect 
Figure 4.8: Register access and execution stages of Leon3 processor
value among all the modules) will determine the speed and the throughput
of the entire processor.
A case study is shown in Figure 4.9 for the ‘bypass2’ lane. The static
probabilities are extracted from the training set. It is obvious that all appli-
cations tend to use im and rfd as the source for operand2. This application-
independent behavior bias of a circuit can be viewed as common-case prop-
erties, which provide the opportunity to optimize a module based on the
dynamic behavior. Each bar represents the longest path delay from a source
to the output.
The timing results of both syn and dyn circuits are reported in Figure 4.10
as bars to indicate the longest path delay from inputs to the common output.
The longest path of the syn circuit (optimized with DC) is 340ps (from
ed/wr to output), equivalent to a throughput of 2941 MOPS for a TRA
configuration. If TU is directly applied on this circuit (TU+syn), the peak
throughput can be achieved at tclk=315ps, and computation can finish in one
cycle whenever im, rfd, zero, md, and xd are selected with a total P=0.94.
The overall peak throughput is 3079 MOPS. This is merely 5% higher than
51
Example of throughput optimization 
5 
0.4099 
0.3828 
0.1102 
0.0341 0.0271 0.0252 0.0105 
250 
260 
270 
280 
290 
300 
310 
320 
330 
340 
350 
0.0 
0.1 
0.2 
0.3 
0.4 
0.5 
im rfd zero ed wr md xd 
P
ro
ba
bi
lit
y 
of
 b
ei
ng
 a
ct
iv
at
ed
 
Source channels 
Delay distribution for operand2 
bypassing lane in LEON3 
processor 
Average probability of the select 
signal for all benchmark programs 
!  Case Study of the bypassing logic 
Lu Wan, Deming Chen. “DynaTune”: ICCAD, 2009 
Synthesis with knowledge  
of exercise probabilities. 
Err-detect 
Figure 4.9: Delay distribution for operand2 bypassing lane
the peak throughput offered by the TRA configuration.
In contrast, as shown in Figure 4.10, DynaTune (TU+dyn) can signif-
icantly boost the throughput by selectively reducing the path delay from
those commonly used sources, i.e., im, rfo and zero, while sacrificing the de-
lay reduction on rarely exercised paths. When clocked at 270ps, the output
can become stable whenever they are activated with a total P=0.903. As a
result, the overall peak throughput is 3524 MOPS, which is 20% higher than
the peak throughput of (TRA).
Tables 4.2-4.6 report the details of the experimental results for Leon3’s
sub-modules. Table 4.2 shows the traditional design methodology (TRA)
performance without timing speculation. Comparing Table 4.4 with Table 4.2
and Table 4.3, on average, DynaTune (RZ+dyn) offers 13% and 9% better
throughput over (TRA) and (RZ+syn), respectively. Comparing Table 4.6
with Table 4.2 and Table 4.5, on average DynaTune (TU+dyn) offers 20%
and 13% better throughput over (TRA) and (TU+syn), respectively. t∗clk, P
∗,
and F ∗ are the tclk, P , and F values under the peak throughput condition.
In general, using TU can produce higher peak throughput than RZ because
of its smaller penalty factor.
The secondary effect of DynaTune is to enable the critical module to be
over-clocked and run faster. This can make certain critical modules no longer
52
Example of throughput optimization 
6 
0.4099 
0.3828 
0.1102 
0.0341 0.0271 0.0252 0.0105 
250 
260 
270 
280 
290 
300 
310 
320 
330 
340 
350 
0.0 
0.1 
0.2 
0.3 
0.4 
0.5 
im rfd zero ed wr md xd 
P
ro
ba
bi
lit
y 
of
 b
ei
ng
 a
ct
iv
at
ed
 
Source channels 
Delay distribution for operand2 
bypassing lane in LEON3 
processor 
Path delay from respective source (dyn) 
!  Case Study of the bypassing logic 
Lu Wan, Deming Chen. “DynaTune”: ICCAD, 2009 
Output throughput: 3.4GOPS (+17%) 
Err-detect 
Figure 4.10: Delay distribution for operand2 bypassing lane after
DynaTune optimization
the performance bottleneck of the whole design. As shown in Tables 4.3-4.6,
the most time-consuming module in the Leon3 execution stage is the adder.
Designing under the traditional way, the delay of the adder determines the
cycle time and sets an upper limit of 1942 MHz for the whole design. After
applying DynaTune, this upper limit is raised by 36% to 2632 MHz. And
the throughput gains for TU+dyn and RZ+dyn are 25% and 19% for the
adder, respectively. In other words, we manage to transform 70% (53%)
of the frequency increase into real throughput gain. Note that because the
processor is a synchronous design, with timing speculation the final frequency
of the processor can break the 1942 MHz barrier and be clocked up to 2632
MHz.
Table 4.2: Three function units in Leon3 synthesized with Synopsys Design
Compiler
Modules TRA
T(ps) F(Mhz) TP(MOPS)
adder32x32 515 1942 1942
op1 bypassing 340 2941 2941
op2 bypassing 340 2941 2941
53
Table 4.3: Three function units in Leon3 w/o DynaTune optimization
running in RZ mode
Modules RZ+syn
L(ps) T ∗(ps) F ∗(Mhz) 4F ∗ P ∗ TP ∗ 4TP
adder32x32 515 465 2151 11% 0.999 2149 11%
op1 bypassing 340 340 2941 0% 1.000 2941 0%
op2 bypassing 340 315 3175 8% 0.94 3022 3%
Average 6% 4%
Table 4.4: Three function units in Leon3 w/ DynaTune optimization
running in RZ mode
Modules RZ+dyn
L(ps) T ∗(ps) F ∗(Mhz) 4F ∗ P ∗ TP 4TP
adder32x32 555 380 2632 36% 0.847 2309 19%
op1 bypassing 355 260 3846 31% 0.754 3.089 5%
op2 bypassing 345 270 3704 26% 0.903 3416 16%
Average 31% 13%
Table 4.5: Three function units in Leon3 w/o DynaTune optimization
running in TU mode
Modules TU+syn
L(ps) T ∗(ps) F ∗(Mhz) 4F ∗ P ∗ TP 4TP
adder32x32 515 414 2415 24% 0.804 2179 12%
op1 bypassing 340 305 3279 11% 0.844 3023 3%
op2 bypassing 340 315 3175 8% 0.940 3079 5%
Average 15% 7%
Table 4.6: Three function units in Leon3 w/ DynaTune optimization
running in TU mode
Modules TU+dyn
L(ps) T ∗(ps) F ∗(Mhz) 4F ∗ P ∗ TP 4TP
adder32x32 555 380 2632 36% 0.847 2430 25%
op1 bypassing 355 260 3846 31% 0.754 3373 15%
op2 bypassing 345 270 3704 26% 0.903 3524 20%
Average 31% 20%
54
CHAPTER 5
SCALABLE DYNAMIC BEHAVIOR
ANALYSIS WITH TTDD
Dynamic behavior can be used to guide circuit optimization for common
cases. In Chapter 4, TCF is used to analyze circuit dynamic behavior. To
derive the dynamic behavior curve of a circuit, a global TCF needs to be
constructed and represented with a global BDD. One problem of using global
TCF and BDD is that it may run into scalability issues. In this chapter, I
present a novel way to analyze circuit dynamic behavior with timed ternary
decision diagrams (tTDD) that solves the scalability issue of using global
TCF.
5.1 Need for circuit dynamic behavior analysis
The idea behind timing speculation and BTWC design is based on the ob-
servation that even if two POs have the same static critical path delay, their
dynamic behaviors can be very different. For example, in Figure 5.1, two POs
have the same static critical path delay. But one PO (A) has large probabil-
ity Ps = 99% to be stabilized by primary inputs (PI) as early as t. For the
other PO (B), the probability of stabilization at time t is only Ps = 53%.
The difference in stabilization probabilities makes B more dynamically crit-
ical than A because when the circuit is over-clocked at t, B fails frequently
while A may still be able to produce the correct outputs 99% of the time.
Knowing such behavior is the key to optimize the BTWC circuits for timing
speculation.
Many previous works use razor logic [3] or other error correcting schemes
to enable timing speculation [4, 18, 32, 48]. The BlueShift work described
in Chapter 3 utilized a commercial design flow to optimize the dynamically
critical nodes to achieve higher throughput working with either razor logic
or error-checking processor. In BluShift, the dynamic behavior is collected
55
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 
Analysis of Digital Circuit Dynamic Behavior with Timed 
Ternary Decision Diagram
 
Lu Wan        Deming Chen 
Electrical and Computer Engineering Department,  
University of Illinois at Urbana Champaign 
{luwan2, dchen}@illinois.edu 
 
 
     ABSTRACT — Modern logic optimization tools tend to 
optimize circuits in a balanced way so that all primary outputs 
(POs) have similar delay close to the cycle time. However, 
certain POs will be exercised more frequently than the rest. 
Among these critical primary outputs, some may be stabilized 
very quickly by input vectors, even if their topological delays 
from primary inputs are very long. Knowing the dynamic 
behavior of a circuit can help optimize the most commonly 
activated paths and help engineers understand how resilient a 
PO is against dynamic environmental variations such as voltage 
fluctuations. In this paper, we describe a tool to analyze the 
dynamic behavior of a circuit utilizing probabilistic 
information. The techniques exploit the use of timed ternary 
decision diagrams (tTDD) to encode stabilization conditions for 
POs. To compute probabilities based on a tTDD, we propose 
false assignment pruning and random variable compaction to 
preserve probability calculation accuracy. To deal with the 
scalability issue, this paper proposes a new circuit partitioning 
heuristic to reduce the inaccuracy introduced by partitioning. 
Compared to the timed simulation results, on average our tool 
has a mean absolute error of 2% and a root mean square error 
of 5%. Compared to a state-of-the-art dynamic behavior 
analysis tool, our tool is much faster and can handle circuits 
that the previous tool cannot. 
     Index Terms —  Logic Circuit, Dynamic Behavior, Ternary 
Decision Diagram, Probability.  
 
I. Introduction and Motivation 
RADITIONAL circuit design optimizes the static critical 
paths even when these paths are rarely exercised dynamically. 
As a result, circuit optimization targets the worst-case 
conditions to guarantee error-free computation but may also lead to 
very pessimistic designs. Recently, there are design techniques to 
achieve higher performance that over-clock the chip to the point 
where timing errors occur, and then perform error correction either 
through circuit-level or microarchitecture-level techniques. This 
approach in general is referred to as Timing Speculation. 
The idea behind timing speculation and better than worst case 
(BTW) design is based on the observation that even if two POs have 
the same static critical path delay, their dynamic behaviors can be 
very different. For example, in Figure 1, two POs have the same 
static critical path delay. But one PO (A) has large probability 
Ps=99% to be stabilized by primary inputs (PI) as early as t. For the 
other PO (B), the probability of stabilization at time t is only 
Ps=53%. The difference in stabilization probabilities makes B more 
dynamically critical than A because when the circuit is over-clocked 
at t, B fails frequently while A may still be able to produce the 
correct outputs 99% of the times. Knowing such behavior is the key 
to optimize the circuits for timing speculation.  
 
 
Figure 1. Dynamic behavior curves of primary outputs 
Many previous works use Razor logic [1] or other error 
correcting schemes to enable timing speculation 
[13][14][17][18][19]. The Blueshift work [14] utilized a commercial 
design flow to optimize the dynamically critical nodes to achieve 
higher throughput working with either Razor logic or error-checking 
processor. In their work, the dynamic behavior is collected through 
timed simulation, which is very time-consuming. In [18], power-
aware slack redistribution was proposed to shift the slack of 
frequently exercised and near-critical timing paths in a power 
efficient manner. It requires knowledge of dynamic behavior not 
only of the whole circuit but also of individual POs, which again 
was achieved through simulation. To improve microprocessor 
performance and energy efficiency, Intel’s study [13] reduced the 
timing guard band by using embedded error-detection sequential 
(EDS) circuits to tolerate dynamic variations. Their work explored 
path-activation probabilities across various workloads and chose 
operating points based on the path-delay histogram and path-
activation probabilities. DynaTune [17] proposed an analytical 
T 
 
 
Manuscript received October 9, 2010. This work is partially supported by 
an SRC grant 2007-1592 and an NSF grant CCF 07-02501.  
Lu Wan, was with IBM China Research Lab. He is now pursuing Ph.D in 
Department of Electrical and Computer Engineering in University of 
Illinois at Urbana Champaign, IL 61820 USA. (e-mail: 
luwan2@illinois.edu).  
Deming Chen, is with the Department of Electrical and Computer 
Engineering, University of Illinois at Urbana Champaign, IL 61801 
USA. (e-mail: dchen@illinois.edu). 
 
Figure 5.1: Dynamic behavior curves of two primary outputs
through timed simulation, which is very time-consuming. In [48], power-
aware slack redistribution was proposed to shift the slack of frequently exer-
ci ed and n ar-critical timing paths in a powe -efficient manner. It requires
knowledge of dynamic behavior not only of the whole circuit but also of
individual POs, which again was achieved through simulation. To improve
microprocessor performance and energy efficiency, In el’s study [32] reduced
the timing guardband by using embedded error-detection sequential (EDS)
circuits to tolerate dynamic variations. Their work explored path-activation
probabilities across various workloads and chose operating points based on
the path-delay histogram and path-activation probabilities. DynaTune [18]
proposed an analytical approach to compute the dynamic behavior curve of a
circuit using a timed characteristic function and BDD. The dynamic behav-
ior is captured in the form of a behavior curve, which is similar to the error
rate versus clock frequency curve used in [32]. Figure 5.1 shows an example
of behavior curves. A behavior curve is a curve with axes tclk and P , where
tclk is the operating clock period and P is the probability that the circuit (or
a primary output - PO) can produce correct results within tclk. By varying
tclk, one can plot all (tclk, P ) pairs to get the behavior curve representing the
dynamic behavior of a circuit. Guided with this behavior curve, DynaTune
optimizes a circuit for higher throughput using dual threshold voltage (vt)
assignment. Understanding the dynamic behavior of a circuit can help these
BTWC tools to optimize dynamically critical paths. It can also be used to
guide circuit optimization for resilience to environmental variations, such as
56
voltage droop [32], by speeding up dynamically critical POs.
All of the works mentioned above require a mechanism to characterize the
dynamic behavior of individual gates, POs, or the whole circuit. Unfortu-
nately, most of the works achieved this through netlist simulation, which is
very time consuming. The behavior curves derived by DynaTune give good
accuracy, but they may not be able to handle large circuits because it uses a
global BDD to capture the behavior of the entire circuit.
To derive the dynamic behavior curve, a novel approach is proposed in this
chapter. It includes:
1. the use of a timed ternary decision diagram (tTDD) to represent sta-
bilization conditions;
2. two tTDD-associated rules to calculate the dynamic behavior curve of
a partitioned sub-circuit:
(a) false assignment pruning, and
(b) random variable compaction
3. a novel partitioning heuristic to produce sub-circuits that are suitable
for tTDD calculation. To achieve high accuracy during probability
calculations, a signal’s temporal correlation and a circuit’s structural
correlation are accounted during analysis.
This chapter is organized as follows. Section 5.2 summarizes related works.
Section 5.3 introduces preliminaries. Section 5.4 introduces encoding stabi-
lization conditions with tTDD. Section 5.5 shows the overall procedure of
computing stabilization probabilities. Section 5.6 presents two techniques
to deal with temporal correlation introduced by tTDD. Section 5.8 presents
a new partitioning heuristic to minimize structural correlation. Section 5.9
presents experimental results.
5.2 Works related to tTDD
tTDD proposed in [49] is a novel approach to evaluate a circuit’s dynamic
behavior using the concept of dynamic curve. Such a new angle is particu-
larly interesting for the applications of timing speculation and the analysis of
57
timing error. To the best of our knowledge, very few quantified studies exist
in this field for us to compare with. Nevertheless, there are proposed ideas
of different levels of similarity for solving a variety of problems. To deal
with timing properties associated with logic cells, timed Boolean function
(TBF) [50] proposed a systematical way to incorporate delay information
with the logic functionalities. Hence, it can represent arbitrary digital logic
waveforms and Boolean functions with timing properties. TCF used in [42]
in fact is a member of this family of techniques. In this work, besides incor-
porating timing properties in functionality, we also extend the use of timing
properties into decision diagrams and use TDD to model the indeterministic
unstable status explicitly.
Besides the traditional BDD, different decision diagram forms were pro-
posed in the literature to represent various forms of functions. As surveyed
in [51], ternary decision diagram (TDD) was proposed to explicitly express
a function where each variable has three decision outcomes. As a result, a
TDD can have up to three decision terminals. If more decision terminals are
required, algebraic decision diagrams (ADD) was proposed in [52] to visual-
ize such a decision process. As a tradeoff for this flexibility, the size of ADD
grows fast as the number of terminals increases. Multiple-value Decision Dia-
gram (MDD) [53] is the most generalized form to represent arbitrary decision
processes. To summarize, BDD is the most concise decision diagram form
for Boolean functions, and MDD is the most general form that can be used
as a container of any specialized decision diagram forms. For example, TDD
and ADD are of specialized decision diagram forms. To concisely express
dynamic behavior, we choose to use tTDD for behavior analysis.
Annotating decision edges with probability quantities was used in [44] dur-
ing calculating signal static probabilities to derive transition density. In that
work, the decision edges on a decision path were treated as independent. A
more general form of edge annotation was proposed in [54] as edge-valued de-
cision diagrams (EV-DD). In this chapter, besides annotating decision edges,
we also propose a complete set of rules to calculate probabilities of decision
paths considering correlations.
58
5.3 tTDD preliminaries and definitions
For convenience, let us denote AT as arrival time, RAT as required arrival
time, PI as primary input, PO as primary output, and tclk as the clock period.
In general, n is used represent the output of a node in a circuit or a PO.
If we observe the switching activities on an output n over a large number
of clock cycles in the duration of its operation, we can find that in some clock
cycles n is stabilized very early, and in other clock cycles n is stabilized very
late. This is because some input vectors applied on the PIs are hard w.r.t.
n, in that it takes more time to compute. On the other hand, some input
vectors are relatively easy w.r.t. n. As shown in Figure 5.1, the cumulative
distribution of n’s stabilization time can be quantified by consolidating n’s
scattered activities over its execution life into a single probabilistic cycle,
which captures the probabilistic behavior of the output in a single clock
cycle. The probability of n being stabilized within a delay t is denoted as
Ps(n, t) and is reflected as a point on n’s behavior curve with an x-coordinate
equal to t.
Given that a dynamic behavior curve can be useful in circuit optimization
to achieve the goal of BTWC design or dynamic variation resilience, it is
useful to derive the dynamic behavior curve of a circuit quickly without
going through time-consuming simulation. The problem of interest can be
described as below.
Given a netlist, static probabilities of its PIs and a specified time point
t, what is the stabilization probability Ps(n, t) that a specified node n can be
stabilized within a delay of t after applying input vectors to PIs according to
the static probabilities?
5.3.1 Definitions
First of all, we can define the term stabilization in a circuit as follows:
Definition 3. Given an output n and a timing requirement t, we say n is
stabilized no later than t if n has taken either logic 0 or logic 1 after a delay
of t since applying the input vectors at the rising clock edge, and n will not
change its value thereafter within the same clock cycle. n is stabilized-to-1
at t if it is stabilized no later than t and takes value of logic 1. Similarly, n
59
is stabilized-to-0 at t if it is stabilized no later than t and takes value of
logic 0.
The phrases “n is stabilized no later than t” and “n is stabilized at t” are
used interchangeably.
5.3.2 Introducing behavior graph
A behavior curve captures a node n’s stabilization time over its whole ex-
ecution life into a single probabilistic cycle with a cumulative distribution
function Ps(n, t). To compute Ps(n, t) without simulation, it is desirable to
do probabilistic reasoning based on behaviors of n’s inputs. To achieve this,
we introduce an auxiliary random variable Xn(t) defined within the proba-
bilistic cycle w.r.t node n and a temporal term t.
Xn(t) : a random variable used to model the status of n when n is observed
after a delay t since applying the input vectors at the clock rising edge
at time 0.
Xn(t) ∈ {0, 1, U}: observe n after a delay t, Xn(t) will be in one of the
three disjoint statuses {0, 1, U}:
• Status 0: Xn(t) = 0 - n is stabilized-to-0 at t
• Status 1: Xn(t) = 1 - n is stabilized-to-1 at t
• Status U: Xn(t) = U - n is un-stable at t
Definition 4. The timing tag of Xn(t) is the time point t of interest at
which we want to know the dynamic behavior of node n.
Definition 5. The phase tag of Xn(t) is the status in the set {0, 1, U}
where node n can be at the time t. Moreover, probabilities can be associated
with these observed statuses.
• Pr[Xn(t) = 0]: the probability of n being stabilized-to-0 at t
• Pr[Xn(t) = 1]: the probability of n being stabilized-to-1 at t
• Pr[Xn(t) = U ]: the probability of n being un-stable at t
60
For example, Pr[Xn3(0.5ns) = 1] = 36.4% means that if we apply a large
number of input vectors at the clock rising edge and then observe n3 at a
delay value of 0.5ns, we will find that n3 is stabilized-to-1 at 0.5ns with a
probability of 36.4%.
From the random variable’s definition, we have:
Pr[Xn(t) = 0] + Pr[Xn(t) = 1] + Pr[Xn(t) = U ] = 100% (5.1)
And the stabilization probability Ps(n, t) of interest is:
Ps(n, t) = Pr[Xn(t) = 0, 1] (5.2)
meaning that the probability of n being stabilized at t is the probability that
n is either stabilized-to-0 or stabilized-to-1 at t. And given that the statuses
{0, 1, U} of Xn(t) are disjoint, Ps(n, t) can be calculated as:
Ps(n, t) = Pr[Xn(t) = 0] + Pr[Xn(t) = 1] (5.3)
To understand the dynamic behavior of a circuit, we introduce a new con-
cept of behavior graph, which consists of stabilized-to-0 curve and stabilized-
to-1 curve.
As shown in Figure 5.2, we can plot behavior curves for function S0(t) :=
Pr[Xn(t) = 0] by varying t to get stabilized-to-0 curve (lower, curve going
up). Similarly, we can plot curves for function S1(t) := 1−Pr[Xn(t) = 1] by
varying t to get stabilized-to-1 curve (upper, curve going down). This pair
of behavior curves forms a behavior graph.
For example, if we observe n at time t=1.3ns in this behavior graph, the
probability of n being stabilized-to-1 is
Pr[Xn(1.3ns) = 1] = 29%
and the probability of n being stabilized-to-0 is
Pr[Xn(1.3ns) = 0] = 44%
With the behavior graph, n’s stabilization probability Ps(n, t) can be cal-
culated with Equation 5.3 as Ps(n, t) = 29% + 44% = 73%
61
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 
Given that a dynamic behavior curve can be useful in circuit 
optimization to achieve the goal of BTW design or dynamic 
variation resilience, we are interested in deriving the dynamic 
behavior curve of a circuit quickly without going through 
time-consuming simulation. Our problem of interest can be 
described as: given (1) a circuit; (2) static probabilities of PIs, 
and (3) a specified time point t, what is the stabilization 
probability Ps(n,t) that a specified node n can be stabilized 
within the delay t after applying input vectors to PIs according 
to the static probabilities. 
A. Definitions 
We define the term Stabilization in a circuit as follows: 
Definition 1: given an output n and a timing requirement t, 
we say n is Stabilized no later than t if n has taken either logic 
0 or logic 1 after a delay of t since applying the input vectors 
at the rising clock edge, and n will not change its value 
thereafter within the same clock cycle. n is Stabilized-to-1 at t 
if it is stabilized no later than t and takes value of logic 1. 
Similarly, n is Stabilized-to-0 at t if it is stabilized no later 
than t and takes value of logic 0.   
We use “n is stabilized no later than t” and “n is stabilized 
at t” interchangeably.  
B. Introducing behavior graph 
A behavior curve captures a node n’s stabilization time over 
its whole execution life into a single probabilistic cycle with a 
cumulative distribution function Ps(n,t). To compute Ps(n,t) 
without simulation, we want to do probabilistic reasoning 
based on behaviors of n’s inputs. To achieve this, we 
introduce an auxiliary random variable Xn(t) defined within 
the probabilistic cycle w.r.t node n and a temporal term t. 
Xn(t) : a random variable used to model the status of n when 
n is observed after a delay t since applying the input vectors at 
the clock rising edge at time 0.  
Xn(t)!{0,1,U}: observe n after a delay t, Xn(t) will be in one 
of the three disjoint statuses {0,1,U}:  
! Status 0: Xn(t) =0 "  n is stabilized-to-0 at t 
! Status 1: Xn(t) =1 "  n is stabilized-to-1 at t 
! Status U: Xn(t)=U "  n is un-stable at t 
Definition 2: The Timing tag of Xn(t) is the time point t of 
interest at which we want to know the dynamic behavior of 
node n.  
Definition 3: The Phase tag of Xn(t) is the status in the set 
{0,1,U} where node n can be at the time t. Moreover, 
probabilities can be associated with these observed statuses.  
  !  Pr[Xn(t) =0] :  the probability of n being stabilized-to-0 at t 
  !  Pr[Xn(t) =1] :  the probability of n being stabilized-to-1 at t 
  !  Pr[Xn(t)=U]:  the probability of n being un-stable at t 
From the random variable’s definition, we have 
 Equation 1: Pr[Xn(t) =0]+Pr[Xn(t) =1]+Pr[Xn(t)=U] = 100% 
And the stabilization probability Ps(n,t)  of interest is: 
Ps(n,t) = Pr[Xn(t)={0,1}] 
Meaning that the probability of n being stabilized at t is just 
the probability that n is either stabilized-to-0 or stabilized-to-1 
at t. And given that the statuses {0,1,U} of Xn(t) are disjoint, 
Ps(n,t) can be calculated as: 
 Equation 2:           Ps(n,t) = Pr[Xn(t)=0] + Pr[Xn(t)=1] 
 
Figure 2: Behavior graph of a node n 
 
To understand the dynamic behavior of a circuit, in this 
work, we introduce a new concept of Behavior Graph, which 
consists of stabilized-to-0 curve and stabilized-to-1 curve.  
As shown in Figure 2, we can plot behavior curves for 
function S0(t):=Pr[Xn(t)=0] by varying t to get stabilized-to-0 
curve (lower, curve going up). Similarly, we can plot curves 
for function S1(t):=1-Pr[Xn(t)=1] by varying t to get stabilized-
to-1 curve (upper, curve going down). This pair of behavior 
curves forms a behavior graph. 
For example, if we observe n at time t=1.3ns in this 
behavior graph, the probability of n being stabilized-to-1 is 
Pr[Xn(1.3ns)=1]=29%, and the probability of n being 
stabilized-to-0 is Pr[Xn(1.3ns)=0]=44%. With the behavior 
graph, n’s stabilization probability Ps(n,t) can be calculated 
with Equation 2 as Ps(n,t)=29%+44% =73%.  
In a behavior graph, two behavior curves of n converge as 
time goes toward n’s latest AT. The area between these two 
curves is unstable range. This range represents the probability 
that n has not been stabilized. For example, in Figure 2, 
Pr[Xn(1.3ns)=U]=27%, meaning that the probability of n 
switching after t is 27%. Note that if the given time point t is 
less than n’s earliest AT, the computation of Ps(n,t) is trivial in 
that Ps(n,t)=0% and Pr[Xn(t)=U]=100%. If t is larger than n’s 
latest AT, it is also trivial in that Ps(n,t)=100% and 
Pr[Xn(t)=U])=0%. 
Assuming we are given behavior graphs associated with a 
sub-circuit’s inputs, we can compute the stabilization 
probabilities of this sub-circuit’s outputs. We will show in the 
next section how to accomplish this with the help of timed 
characteristic function and timed inputs. 
IV. ENCODE STABILIZATION CONDITIONS WITH TTDD 
In this section, we will review the timed characteristic 
function in sub-section IV.A. In sub-section IV.B, we first 
Figure 5.2: Behavior graph of a node n
In a behavior graph, two b havi r curves of n converge as time goes toward
n’s late AT. The area between th se two curves is unstable range. This
range represents the probability that n has not been st b lized. For example,
in Figure 5.2,
Pr[Xn(1.3ns) = U ] = 27%
means that the probability of n switching after t is 27%. Note that if the
given time point t is less than n’s earliest AT, the computation of Ps(n, t) is
trivial in that
Ps(n, t) = 0%
and
Pr[Xn(t) = U ] = 100%
If t is larger than n’s latest AT, it is also trivial in that
Ps(n, t) = 100%
and
Pr[Xn(t) = U ]) = 0%
Assuming we are given behavior graphs associated with a sub-circuit’s
inputs, it is possible to compute the stabilization probabilities of this sub-
62
circuit’s outputs. I will show next how to accomplish this with the help of
TCF and timed inputs.
5.4 Encode stabilization conditions with tTDD
In this section, I will first review the concept of TCF and introduce the
concepts of timed input and timed support set. Then, I describe how to use
these to encode the stabilization conditions.
5.4.1 Value-specified timed characteristic function
In Chapter 4, TCF is defined and used to derive the circuit dynamic behavior
curve. Because a dynamic curve does not distinguish stabilized-to-0 from
stabilized-to-1, the TCF used in Chapter 4 is not value-specified, in that
n’s stabilization value can be 0 or 1. In this chapter, to use TCF to derive
behavior graph which consists a stabilized-to-0 curve and a stabilized-to-1
curve, we need to distinguish the stabilization value. Therefore, TCF used
in this chapter is defined as value-specified characteristic function as below:
Definition 6. A value-specified timed characteristic function T (n =
val, t) of a logic cone rooted at n is a Boolean function that evaluates to true
for the input vectors if and only if they stabilize an output n to value val no
later than time t.
Note that a value-specified TCF has to stabilize n to a specified value
val, as opposed to the TCF used in Chapter 4, where no stabilization value
is specified. In fact, a variable val in T() inherently distinguishes a TCF
(Chapter 4) from a value-specified TCF. Therefore, in this chapter TCF is
referred as value-specified TCF without confusion.
Given the delay of a cell and the required time t, the TCF of the cell
output can be written recursively as TCF functions of its immediate inputs
using sensitization criteria.
Take an AND gate “n=AND(a,b)” with cell delay d for example. Given a
required time t and the required logic value {0,1} that n should be stabilized
to, the TCFs for n can be written as:
T (n = 1, t) = T (a = 1, t− d) ∧ T (b = 1, t− d) (5.4)
63
T (n = 0, t) = T (a = 0, t− d) ∨ T (b = 0, t− d) (5.5)
The first formula states that to make n stabilized-to-1 no later than t
requires both inputs a and b being stabilized-to-1 at least d time units earlier,
‘t−d’. The second equation states that n can be stabilized-to-0 no later than
t if either input a or b has been stabilized-to-0 no later than (t− d).
Similarly, the TCFs for an OR gate “n=OR(a,b)” are:
T (n = 1, t) = T (a = 1, t− d) ∨ T (b = 1, t− d) (5.6)
T (n = 0, t) = T (a = 0, t− d) ∧ T (b = 0, t− d) (5.7)
The TCFs for an INV gate “n=INV(a)” are:
T (n = 1, t) = T (a = 0, t− d) (5.8)
T (n = 0, t) = T (a = 1, t− d) (5.9)
The TCFs of complex cells can be written in the same manner by first
decomposing complex cells into AND/OR/INV netlist.
5.4.2 Encode stabilization conditions
To calculate the stabilization probability, a new data structure is needed
to represent the stabilization conditions. First, I introduce the concepts of
timed input and timed support set, which are used in the proposed new data
structure.
A timed input vi(t) of n is an ordinary input vi of n coupled with a temporal
term t reflecting the timing relation between the input vi and the output n.
The support set of n, denoted as Sup(n) is the set of the inputs to the circuit
rooted at n. Furthermore, n’s timed support set, denoted as tSup(n, t), is
the set of all timed inputs {vi(t)} that n’s status at time t depends on. The
tSup(n, t) is determined by back-tracing the fan-in cone of n.
Take the circuit in Figure 5.3 for example. Assume we are working on a
sub-circuit rooted at an output n after the whole circuit has been partitioned.
We have Sup(n) = {a, b}. Assume we are interested in n’s dynamic behavior
at time t. We know that n’s status at t is influenced by following paths:
• a→ p→ n : with path delay 3ns,
64
• b→ p→ n : with path delay 3ns,
• b→ n : with path delay 1ns.> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 
introduce the concepts of timed input and timed support set 
and then present how to use these to encode the stabilization 
conditions.  
A. Review of timed characteristic function 
The timed characteristic function (TCF) is originally used 
in ATPG to find a test pattern that can sensitize an output at a 
given time. For convenience, we adopt the earlier-timed TCF 
as proposed in [7]. Similar to most previous work [7][11], the 
TCF is defined in floating mode where input vectors are 
applied at time t=0, and before time 0, the input vectors are 
treated as uninitialized. 
Definition 4: a Characteristic function CF(n=val) of a 
logic cone rooted at n is a Boolean function that evaluates to 
true for input vectors of this cone if and only if they stabilize 
the output n to the specified value val, where val can be logic 
0 or 1. 
With an additional temporal term t as the RAT of n, we 
have: 
Definition 5: a Timed characteristic function T(n=val,t) of 
a logic cone rooted at n is a Boolean function that evaluates to 
true for the input vectors if and only if they stabilize an output 
n to value val no later than time t. 
Given the delay of a cell and the required time t, the TCF of 
the cell output can be written recursively as TCF functions of 
its immediate inputs using sensitization criteria [7][11].  
Take an AND gate “n=AND(a,b)” with cell delay d for 
example. Given a required time t and the required logic value 
{0, 1} that n should be stabilized to, the TCFs for n can be 
written as:   
Equation 3:         T(n=1,t) = T(a=1,t-d)  !  T(b=1,t-d)               
Equation 4:         T(n=0,t) = T(a=0,t-d)  "  T(b=0,t-d)               
The first formula states that to make n stabilized-to-1 no 
later than t requires both inputs a and b being stabilized-to-1 at 
least d time units earlier, ‘t-d’, where ‘d’ is the delay of a 
timing path in the AND gate in nanosecond. The second 
equation states that n can be stabilized-to-0 no later than t if 
either input a or b has been stabilized-to-0 no later than (t-d).  
Similarly, we can derive TCFs for OR and NOT. The TCFs 
of complex cells can be written in the same manner by first 
decomposing complex cells into AND/OR/INV netlist. 
B. Encode the stabilization conditions 
To calculate the stabilization probability, we need a new 
data structure to represent the stabilization conditions. First, 
we introduce the concepts of timed input and timed support set, 
which are used in our new data structure.  
A Timed input vi(t) of n is an ordinary input vi coupled with 
a temporal term t reflecting the timing relation between the 
input vi and the output n. The Support set of n, denoted as 
Sup(n) is the set of the inputs to the circuit rooted at n. 
Furthermore, n’s Timed support set, denoted as tSup(n,t), is 
the set of all timed inputs {vi(t)} that n’s status at time t 
depends on. The tSup(n,t) is determined by back-tracing the 
fan-in cone of n. 
 
Figure 3: A sub-circuit example 
Take the circuit in Figure 3 for example. Assume we are 
working on a sub-circuit rooted at an output n after the whole 
circuit has been partitioned. We have Sup(n)={a,b}. Assume 
we are interested in n’s dynamic behavior at time t. We know 
that n’s status at t is influenced by paths: ‘a!p!n’(with path 
delay 3ns), ‘b!p!n’(with path delay 3ns) and ‘b!n’(with 
path delay 1ns). In other words, n’s status at time t is totally 
determined by a’s status at t-3, b’s status at t-3 and b’s status 
at t-1. Therefore, tSup(n,t)={a(t-3),b(t-1),b(t-3)}. Note that the 
internal structure of the circuit makes b in Sup(n) become 
multiple correlated timed inputs {b(t-1),b(t-3)} in tSup(n,t). 
For clarity, an unrealistic delay model is used in this 
example by assuming each cell has a fixed delay regardless of 
input pin and fanout load. However, in our experiments, we 
use a more realistic pin-to-pin delay model by distinguishing: 
(1) each pin-to-output delay, (2) rise and fall delay, (3) cell’s 
driving strength, (4) input transition time and (5) cell’s fanout 
load. 
1) Stabilization conditions for n 
TCF enables us to understand n’s stabilization conditions at 
time t by enumerating all sensitization conditions. This 
enumeration is done by recursive TCF rewriting [7]. For 
example, if we are interested in what makes n in Figure 3 
stabilized-to-1 at t, then we can recursively write n’s TCF 
T(n=1,t) through back-tracing n’s fan-in cone:  
   T(n=1,t) = T(p=1,t-1) "#T(b=1,t-1)                       
= (T(a=1,t-3)!T(b=0,t-3))"(T(a=0,t-3)!T(b=1,t-3))"T(b=1,t-1)         
= (a(t-3)!b’(t-3))"(a’(t-3)!b(t-3))"b(t-1)   
The first and the second derivative steps are by OR and 
XOR sensitization criteria, respectively. The last derivation is 
based on the fact that the TCF of a circuit input is just the 
timed input itself.  
Furthermore, we can evaluate the temporal term t in a TCF 
to any desired value. For example, if we are interested in 
output n’s dynamic behavior at 5ns after the rising clock edge, 
then we can substitute t with 5ns to get an Evaluated TCF: 
   T(n=1,5ns)=(a(2ns) ! b’(2ns)) " (a’(2ns) ! b(2ns)) " b(4ns).  
The evaluated TCF compresses information of all feasible 
input conditions that can stabilize n to logic 1 at 5ns into a 
single formula. For example, the above formula states that n 
can be stabilized-to-1 by 5ns if any of the following encoded 
stabilization conditions is satisfied:  
(1) a(2ns)!b’(2ns): a is stabilized-to-1 by 2ns and b is 
stabilized-to-0 by 2ns;  
Figure 5.3: A partitioned sub-circuit example
In other words, n’s status at time t is totally determined by a’s status at
t-3, b’s status at t-3 and b’s status at t-1. Therefore,
tSup(n, t) = {a(t− 3), b(t− 1), b(t− 3)}
Note that the internal structure of the circuit makes b in Sup(n) become
multiple correlated timed inputs b(t− 1), b(t− 3) in tSup(n, t).
For clarity, an unrealistic delay model is used in this example by assuming
each cell has a 1ns unit delay. However, in the experiments, a realistic pin-
to-pin delay model is used.
5.4.3 Stabilization conditions for n
TCF enables us to understan n’s stabilization conditions at time t by enu-
merating all sensitization conditions. This enumeration is done by recursive
TCF rewriting [18,42]. For example, if we are interested in what makes n in
Figure 5.3 stabilized-to-1 at t, we can recursively write n’s TCF T (n = 1, t)
65
through back-tracing n’s fan-in cone:
T (n = 1, t) = T (p = 1, t− 1) ∨ T (b = 1, t− 1)
= (T (a = 1, t− 3) ∧ T (b = 0, t− 3)) ∨
(T (a = 0, t− 3) ∧ T (b = 1, t− 3)) ∨
T (b = 1, t− 1)
= (a(t− 3) ∧ b′(t− 3)) ∨
(a′(t− 3) ∧ b(t− 3)) ∨
b(t− 1) (5.10)
The first and the second derivative steps are by OR and XOR sensitization
criteria, respectively. The last derivation step is based on the fact that the
TCF of a circuit input is just the timed input itself.
Furthermore, we can evaluate the temporal term t in a TCF to any desired
value. For example, if we are interested in output n’s dynamic behavior at
5ns after the rising clock edge, then t can be substituted with 5ns to get an
evaluated TCF :
T (n = 1, 5ns) = (a(2ns) ∧ b′(2ns)) ∨
(a′(2ns) ∧ b(2ns)) ∨ b(4ns) (5.11)
The evaluated TCF compresses information of all feasible input conditions
that can stabilize n to logic 1 at 5ns into a single formula. For example,
the above formula states that n can be stabilized-to-1 by 5ns if any of the
following encoded stabilization conditions is satisfied:
1. a(2ns) ∧ b′(2ns): a is stabilized-to-1 by 2ns and b is stabilized-to-0 by
2ns;
2. a′(2ns) ∧ b(2ns): a is stabilized-to-0 by 2ns and b is stabilized-to-1 by
2ns;
3. b(4ns): b is stabilized-to-1 at 4ns.
Next, with the stabilization conditions we can build a decision diagram to
facilitate the computation of the probability that n can be stabilized-to-1 at
t.
66
5.4.4 tTDD for stabilization conditions
In Chapter 4, the original recursive TCF construction procedure is trans-
formed into a different form, in which the characteristic function (CF) for
each node in a circuit is first constructed as a BDD. Then the sensitization
criteria are used to refine the CF. This transformation has the benefit that
the timing information is decoupled from the variables in the BDD during
construction so that an ordinary BDD can be used for probability compu-
tation. However, this necessitates the construction of a global BDD using
circuit PIs as variables, which may cause a scalability problem.
Moreover, another fundamental problem of using BDD is its inability to
correctly model the ‘U ’ status. Each BDD node has exactly two outgoing
edges to explicitly model only two possible assignments {0, 1} for a random
variable Xv(t). This ignores the effect of the ‘U ’ value of Xv(t) in Equa-
tion 5.1. As a result, it cannot correctly handle the unstable ranges of the
inputs of a sub-circuit due to circuit partitioning. This problem will be dis-
cussed in detail later in Section 5.5.
To solve the above problems, a new data structure is needed to work
on partitioned sub-circuits and handle unstable ranges. In fact, with the
existence of the unstable range in the behavior graph, the stabilization status
of each timed input vi(t) needs to be modeled as a three-value system. Hence,
we introduce timed ternary decision diagram (tTDD) as a solution.
A ternary decision diagram (TDD) is similar to a BDD with the difference
that each node has three possible outgoing branches [51]. Thanks to the
TDD’s similarity to BDD, a TDD can be built in an almost identical way as
building a BDD. The difference between a tTDD and an ordinary TDD is
that each decision node in a tTDD has a temporal term associated with it.
The tTDD for the example circuit in Figure 5.3 is shown in Figure 5.4. In
this tTDD, the temporal term t is kept as a variable without being evaluated.
Each node in the tTDD is associated with a timed input vi(t) marked on the
left side. For example, V0 is associated with timed input a(t − 3). Given
that a tTDD is a three-value system, the stabilization condition of a certain
input being unstabilized can now be explicitly modeled with the introduction
of the ‘U ’ edge.
With the enhanced tTDD model of the stabilization conditions, we can now
calculate the probability Pr[Xn(t) = 1] by collecting the probabilities of all
67
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 
the tTDD is associated with a timed input v(t) marked on the left 
side. For example, V0 is associated with timed input a(t-3). Given 
that a tTDD is a three-value system, the stabilization condition of a 
certain input being unstabilized can now be explicitly modeled with 
the introduction of the ‘U’ edge. 
 
 
Figure 4: Encoding stabilization conditions with tTDD  
With this enhanced tTDD model of the stabilization conditions, 
we can now calculate the probability Pr[Xn(t)=1] by collecting the 
probabilities of all stabilization conditions.  Note that the probability 
of n having been stabilized-to-0 at time t can be calculated in the 
same manner starting with constructing a TCF of n with required 
stabilization value 0, that is T(n=0,t). We can further distinguish 
tTDD of n stabilized-t-0/1 as tTDD-0/1. Finally, according to 
Equation 2, the stabilization probability Ps(n,t) is simply the sum of 
these two probabilities calculated on tTDD-0 and tTDD-1. 
Therefore, the behavior graph of n can be plotted by evaluating the 
variable t at different timing points. Notice that all rules apply on 
both tTDD-0 and tTDD-1 the same, so for conciseness we will refer 
to the tTDD-1 as tTDD for short if it will not introduce confusion.   
IV. Stabilization probability calculation 
To find a stabilization condition that makes n stabilized-to-1 at a 
given time t, is equivalent to solving a three-value satisfying 
assignment problem: we need to find a feasible combination of  
stabilization status for the timed support set that can evaluate the 
right hand side (RHS) of the TCF formula to true.  
The possible statuses a timed input can take are: {stabilized-to-0, 
stabilized-to-1, un-stabilized}. This exactly corresponds to the three 
statuses of our auxiliary random variable Xv(t), as defined in section 
B. The physical meaning between the tTDD and the auxiliary 
random variable Xv(t) can be viewed as follows:  each tTDD node is 
associated with a timed input v(t) and has three out-going edges 
{0,1,U} to reflect the three possible stabilization statuses of its 
associated input v at time t. The 0-edge taken from this node is 
equivalent to have input v at ‘Xv(t)=0’ status, meaning v is 
stabilized-to-0 at t. Similarly, 1-edge corresponds to ‘Xv(t)=1’ and 
U-edge corresponds to ‘Xv(t)=U’.  
From the TCF, we know that the stabilization status of a circuit 
output depends on the stabilization statuses of the circuit’s inputs. 
Let us assume v is an input of a circuit rooted at n. Assume that due 
to existence of multiple timing paths from v to n, v becomes two 
timed inputs v(t1) and v(t2) in n’s tTDD. As a result, n’s stabilization 
status at a clock cycle c depends on stabilization statues of v at both 
time t1 and t2 in the same clock cycle. We need to point out in this 
clock cycle c, the stabilization status of v(t1) and v(t2) are highly 
correlated. Their temporal correlations are stated as follows: 
Lemma 1: Pr[Xv(t1)=1 !!Xv(t2) =Val2]= 0, if t2!t1 and Val2 ! {0,U}.  
                  Pr[Xv(t1)=0 !!Xv(t2) =Val2]= 0, if t2!t1 and Val2 ! {1,U}. 
Proof: This is in fact the direct result of the definitions of 
stabilization in section A. If we observe that v has been stabilized to 
a specific value at time t1, ‘Xv(t1)=1/0’, then: (A) v will not switch to 
a different value in the same cycle c at a later time t2, ‘Xv(t2) =0/1’. 
(B) It is also impossible to observe that v is still unstable, ‘Xv(t2) 
=U’, at a later time t2!t1 in the same cycle c. Note that this lemma 
holds even for t1=t2 because of the fact that {0,1,U} are inherently 
disjoint statuses of Xv(t).   " 
Using a decision diagram to represent a TCF facilitates 
probability calculation, in that a satisfying assignment of the TCF is 
equivalent to a connected path from the root node to the true 
terminal. In the context of stabilization analysis, each stabilization 
condition simply For example, for the circuit in Figure 3, we know 
to stabilize n to 1 at t, one stabilization condition is ‘a is stabilized-
to-1 at t-3’ and ‘b is stabilized-to-0 at t-3’. This condition is 
represented as a satisfying path “V0#V1#T” in Figure 4. This can 
also be represented equivalently by auxiliary random variables as 
‘Xa(t-3)=1 !!Xb(t-3)=0’ (! indicates the joint probability of the two 
random variables). The associated satisfying assignment is: [Xa(t-3), 
Xb(t-3)]=[1,0] because one 1-edge and one 0-edge are taken. 
‘V0#V3#T’ is another satisfying path with its satisfying 
assignment: [Xa(t-3), Xb(t-1)] =[U,1]. We need to point out that the 
satisfying paths only exist in a tTDD to encode stabilization 
conditions and do not reflect any physical paths in the circuit.  
The probability of one satisfying path $j is the jointed probability 
of satisfying assignment associated with $j.  
 Equation 9:     
Where L is the length of the path $j and (Xvi(ti)=Vali) is the 
assignment to each random variable along this satisfying path. 
With all stabilization-to-1 conditions being encoded in the tTDD-
1 data structure, the probability (Pr[Xn(t)=1]) of stabilizing n to 1 no 
later than t is just the collection of probabilities of all possible 
stabilization conditions: 
Equation 10:          
Where $j is a satisfying path in n’s tTDD-1, S is the total number 
of satisfying paths. 
Next, we show that the above probability can simply be 
computed as the sum of probability of every distinct satisfying path. 
Theorem 1:          
 
Proof: we prove this by showing that any two satisfying paths in the 
tTDD are probabilistically disjointed. 
In a tTDD, for any two different path $i  and $j, i%j, they must 
have at least one node shared. Otherwise, if they do not share nodes 
at all, then the tTDD must have at least two root nodes, which 
contradicts the fact that a tTDD has only one root node. Secondly, 
among all their shared nodes there must be at least one shared node, 
Figure 5.4: Encoding stabilization conditions with tTDD
stabilization co ditions. Note that the probability of n being stabilized-to-0
at time t can be calculated in the same manner starting with constructing
a TCF of n with required stabilization value 0, that is T (n = 0, t). tTDD
can be further classified into tTDD-0 and tTDD-1 depending on the required
stabilization value val. Finally, according to Equation 5.3, the stabilization
probabili y Ps(n, t) is simply the s m of these two probabilities calculated
on tTDD-0 and tTDD-1. Therefore, the behavior graph of n can be plotted
by evaluating the variable t at different timing points. Notice that all rules
that I will discuss shortly apply on both tTDD-0 and tTDD-1 symmetrically,
so for conciseness I will refer to the tTDD-1 as tTDD for short if it will not
introduce confusion.
5.5 Stabilization probability calculation
To find a stabilization condition that makes n stabilized-to-1 at a given time t
is equivalent to solving a th ee-value satisfying assignment problem: we need
to find a feasible combination of stabilization status for the timed support
set that can evaluate the right-hand side (RHS) of the TCF formula to true.
68
The possible statuses a timed input can take are: {stabilized-to-0, stabilized-
to-1, un-stable}. They exactly correspond to the three statuses of the auxil-
iary random variable Xv(t), as defined in Section 5.3. The physical meaning
between the tTDD and the auxiliary random variable Xv(t) can be viewed
as follows: each tTDD node is associated with a timed input v(t) and has
three out-going edges {0, 1, U} to reflect the three possible stabilization sta-
tuses of its associated input v at time t. The 0-edge taken from this node is
equivalent to have input v at ‘Xv(t) = 0’ status, meaning v is stabilized-to-0
at t. Similarly, 1-edge corresponds to ‘Xv(t) = 1’ and U -edge corresponds to
‘Xv(t) = U ’.
From the TCF, we know that the stabilization status of a circuit output
depends on the stabilization statuses of the circuit’s inputs. Let us assume v
is an input of a circuit rooted at n. Assume that due to existence of multiple
timing paths from v to n, v can become two (or more) timed inputs v(t1)
and v(t2) in n’s tTDD. As a result, n’s stabilization status at a clock cycle
c depends on stabilization statuses of v at both time t1 and t2 in the same
clock cycle c. Note that in this clock cycle c, the stabilization status of v(t1)
and v(t2) are highly correlated. Their temporal correlations are stated as
follows:
Lemma 5.5.1. Pr[Xv(t1) = 1 ∩ Xv(t2) = V al2] = 0, if t2 ≥ t1 and Val2 ∈
{0, U}
Lemma 5.5.2. Pr[Xv(t1) = 0 ∩ Xv(t2) = Val2] = 0, if t2 ≥ t1 and Val2 ∈
{1, U}
Proof. This is in fact the direct result of the definitions of stabilization. If
we observe that v has been stabilized to a specific value at time t1, that is
‘Xv(t1) = 1/0’, then:
1. v will not switch to a different value in the same cycle c at a later time
t2, that is ‘Xv(t2) = 0/1’.
2. It is also impossible to observe that v is still unstable, that is ‘Xv(t2) =
U ’, at a later time t2 ≥ t1 in the same cycle. Note that this lemma
holds even for t1 = t2 because of the fact that {0, 1, U} are inherently
disjoint statuses of Xv(t).
69
Using a decision diagram to represent a TCF facilitates probability calcu-
lation, in that a satisfying assignment of the TCF is equivalent to a connected
path from the root node to the true terminal. In the context of stabilization
analysis, each stabilization condition simply corresponds to a path from the
tTDD’s root node to the true terminal.
Definition 7. A satisfying path Ψj in a tTDD is a path from the root
node to the true terminal, where j is the path ID. On a satisfying path Ψj, the
branch taken on each node is its satisfying assignment of the stabilization
statues, e.g. {0, 1, U}, for the corresponding timed input.
For exmaple, for the circuit in Figure 5.3, to stabilize n to 1 at t, one
stabilization condition is:
• ‘a is stabilized-to-1 at t− 3’ and
• ‘b is stabilized-to-0 at t− 3’
This condition is represented as a satisfying path “V 0 → V 1 → T” in
Figure 5.4. Equivalently, it can be represented with the random variables as
[Xa(t− 3) = 1] ∩ [Xb(t− 3) = 0]
indicating the joint probability of the two random variables. Because a 1-edge
and a 0-edge are taken, the associated satisfying assignment is:
[Xa(t− 3), Xb(t− 3)] = [1, 0]
Another satisfying path with is ‘V 0 → V 3 → T ’, corresponding to the
assignment:
[Xa(t− 3), Xb(t− 1)] = [U, 1]
Note that the satisfying paths only exist in a tTDD to encode stabilization
conditions and do not reflect any physical paths in the circuit.
The probability of one satisfying path Ψj is the jointed probability of
satisfying assignment associated with Ψj.
Pr[Ψj] = Pr[
L⋂
i=0
(Xvi(ti) = Vali)] (5.12)
70
where L is the length of the path Ψj and (Xvi(ti) = Vali) is the assignment
to each random variable along this satisfying path.
With all stabilization-to-1 conditions being encoded in the tTDD-1 data
structure, the probability (Pr[Xn(t) = 1]) of stabilizing n to 1 no later than
t is just the collection of probabilities of all possible stabilization conditions:
Pr[Xn(t) = 1] = Pr[
S⋃
j=0
Ψj] (5.13)
where Ψj is a satisfying path in n’s tTDD-1, and S is the total number of
satisfying paths.
Next, I show that the above probability can simply be computed as the
sum of probability of every distinct satisfying path.
Theorem 5.5.3.
Pr[
S⋃
j=0
Ψj] =
S∑
j=0
Pr[Ψj]
Proof. This can be proved by showing that any two satisfying paths in the
tTDD are probabilistically disjointed.
In a tTDD, for any two different paths Ψi and Ψj, i 6= j, they must have
at least one node shared. Otherwise, if they do not share nodes at all, then
the tTDD must have at least two root nodes, which contradicts the fact that
a tTDD has only one root node. Secondly, among all their shared nodes
there must be at least one shared node, from which Ψi and Ψj take different
branching edges. Otherwise, if Ψi and Ψj take the same branching edge for
every shared node, then it is not difficult to see that this will lead to the
conclusion that Ψi and Ψj are identical, which contradicts our assumption
that Ψi and Ψj are different paths.
Let us denote a particular shared node in the tTDD as G, and denote the
timed input associated with G as g(t). From node G, Ψi and Ψj take two
different branching edges Val1 and Val2, respectively. For Ψi, the satisfying
assignment at node G is Xg(t) = Val1; while for Ψj the satisfying assignment
at node G is Xg(t) = Val2.
The joint probability of these two satisfying paths is:
Pr[Ψi ∩Ψj] = Pr[[
Li⋂
i=0
(Xvi(ti) = Vali)] ∩ [
Lj⋂
j=0
(Xvj(tj) = Valj)]]
71
We can write it conditionally on the stabilization status of G:
P = Pr[[Xg(t) = Val1]
⋂
[Xg(t) = Val2]] ∗
Pr[{⋂Lii=0;vi6=g(Xvi(ti) = Vali)⋂
(
⋂Lj
j=0;vj 6=g(Xvj(tj) = Valj))}|
{(Xg(t) = Val1) ∩ (Xg(t) = Val2)}}]
Since Val1 and Val2 are different statues, one of them must not be U status.
Then from Lemma 5.5.1, we know:
Pr[[Xg(t) = Val1]
⋂
[Xg(t) = Val2]] = 0
Then the joint probability of two distinct satisfying paths is zero.
5.6 Probability of one satisfying path
In this section, I show how to compute the probability of an individual sat-
isfying path. This is done with two steps:
1. evaluate tTDD at a given t and annotate the evaluated tTDD with
probabilities,
2. calculate the probability honoring the temporal correlation due to the
use of timed inputs.
5.6.1 Annotate a tTDD with probabilities
Before we start to calculate the probability of a satisfying path, we need to
1. evaluate the tTDD at a specified t to get an evaluated tTDD, and then
2. annotate the probabilities onto the edges of this evaluated tTDD by
looking up the corresponding probabilities from the inputs’ behavior
graphs.
72
The tTDD constructed so far is a generic tTDD with a variable t. We can
substitute t with a specified timing term. For example, in the sub-circuit in
Figure 5.3, if we are interested in n’s dynamic behavior at t=5ns then we
evaluate the tTDD at t=5ns by substituting all t terms with 5ns. As a result,
a(t−3) becomes a(2ns), b(t−3) becomes b(2ns) and b(t−4) becomes b(4ns).
Apparently, n’s status at t = 5ns depends exactly on a’s status at 2ns and
b’s statuses at 2ns and 4ns.
Moreover, because the stabilization status of each timed input v(t) is mod-
eled with a random variable Xv(t), and a random variable has a value and an
associated probability, we need to annotate probabilities onto the tTDD af-
ter it is evaluated. This is done by assigning the corresponding probabilities
onto each node’s outgoing edges as edge weights. An automatic procedure is
used to look up and annotate a tTDD with the corresponding probabilities
from the inputs’ behavior graphs that are already computed in a topological
order.
For instance, assume we have the behavior graphs for the sub-circuit in-
puts {a, b} specified as Figure 5.5(a) and 5.5(b), respectively. For a(2ns), we
extract the probabilities for a being stabilizaed-to-0/1 at time point 2ns from
a’s behavior graph. Similarly, for b(2ns) and b(4ns), we extract probabilities
for b at time points 2ns and 4ns from b’s behavior graph. Then these ex-
tracted probabilities can be annotated onto the evaluated tTDD as the edge
weights. The tTDDs before and after evaluation and annotation are shown
in Figure 5.6(a) and 5.6(b), respectively.
Another advantage of this approach is that the structure of the generic
tTDD is fixed and can be reused for evaluation and annotation for different
t. For example, Figure 5.7(b) shows another evaluation and annotation of
the same tTDD with t = 3ns with only the edge weights changed, while the
decision diagram structure is intact. In contrast, the approach used in [18]
requires a unique decision diagram constructed for each different time point
t.
5.6.2 Why not tBDD
As mentioned before, BDD has trouble modeling correctly the ‘U ’ status
defined in Section 5.3. This problem is illustrated in Figure 5.7. On the
73
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 
b(4ns). Apparently, n’s status at t=5ns depends exactly on a’s 
status at 2ns and b’s statuses at 2ns and 4ns.  
Moreover, because we model the stabilization status of each 
timed input v(t) with a random variable Xv(t), and a random 
variable has a value and an associated probability, we need to 
annotate probabilities onto the tTDD after it is evaluated. 
This is done by assigning the corresponding probabilities onto 
each node’s outgoing edges as edge weights. We use an 
automatic procedure to look up and annotate a tTDD with the 
corresponding probabilities from the inputs’ behavior graphs 
that are already computed in a topological order. 
For instance, assume we have the behavior graphs for the 
sub-circuit inputs {a,b} specified as Figure 5(a) and (b), 
respectively. For a(2ns), we extract the probabilities for a 
being stabilizaed-to-0/1 at time point 2ns from a’s behavior 
graph. Similarly, for b(2ns) and b(4ns), we extract 
probabilities for b at time points 2ns and 4ns from b’s 
behavior graph. We then annotate the extracted probabilities 
onto the evaluated tTDD as the edge weights. The tTDDs 
before and after evaluation and annotation are shown in Figure 
6(a) and (b), respectively. 
One advantage of this approach is that the structure of the 
generic tTDD is fixed and can be reused for evaluation and 
annotation for different t. For example, Figure 7(b) shows 
another evaluation and annotation of the same tTDD with 
t=3ns with only the edge weights changed, while the decision 
diagram structure is intact. In contrast, the approach used in 
[11] requires a unique decision diagram constructed for each 
different time point t.  
 
Figure 5: Dynamic graphs of a partitioned sub-circuit’s inputs 
B. Why not tBDD 
As mentioned before, BDD has trouble to model correctly 
the ‘U’ status defined in Section III. This problem is illustrated 
in Figure 7. On the left side is a decision diagram modeled 
with tBDD, while on the right is the proposed decision 
diagram modeled with tTDD. Assume we want to know n’s 
stabilization-to-1 probability at t=3ns. Then we evaluate both 
tBDD and tTDD with t=3ns and annotate the edges with 
probabilities extracted from a(0ns), b(0ns) and b(2ns) from 
Figure 5.  
The resulting tBDD is shown in Figure 7(a), in which both 
zero and one-edge of the tBDD from V0 are annotated with 
0% because that a has not been stabilized at all at t=0ns. 
Though this annotation is correct, it produces a big problem 
when we attempt to calculate the probability out of the 
decision diagram. That is because V0 is the root of the 
decision diagram, all satisfying paths have to go through either 
its one edge or zero edge, resulting in an incorrect zero 
probability of n being stabilized-to-1 at t=3ns.  
 
Figure 6: Annotate a generic tTDD with probabilities 
 
 
Figure 7: tTDD models the stabilization conditions accurately with the “U” 
edges 
 
On the other hand, from the circuit structure in Figure 3, we 
know that even if a is not stabilized at all (the ‘U’ status), n 
can still be stabilized-to-1 as long as input b can be stabilized 
to logic 1. In fact, as long as it is possible to stabilize b to 1 at 
t=2ns, the output n is guaranteed to be stabilized to 1. Hence, 
the probability of stabilizing n at t=3ns is non-zero. This 
discrepancy points out the limitation of tBDD because the 
stabilization condition may be incorrectly modeled, due to its 
incapability of modeling the “U” status. In contrast, tTDD 
doesn’t have this drawback. For example, the resulting tTDD 
after annotation is shown in Figure 7(b), in which an auxiliary 
“U” edge exists between V0 and V3 with an annotated 
(a) Input a’s behavior graph
> REP  I  I  I   PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 
b(4ns).  t t 5ns depends exactly on a’s 
status at   t s and 4ns.  
ore r, l t e stabilization status of each 
ti ed in t  ariable Xv(t), and a random 
variable   l    iated probability, we n ed to 
annotate r iliti s t  t  t  after it is evaluated. 
This is done by assigning the corresponding probabilities onto 
each node’s outgoing edges as edge weights. We use an 
auto atic procedure to look up and annotate a tTDD with the 
corresponding probabilities from the inputs’ behavior graphs 
that are already computed in a topological order. 
For instance, assume we have the behavior graphs for the 
sub-circuit inputs {a,b} specified as Figure 5(a) and (b), 
respectively. For a(2ns), we extract the probabilities for a 
being stabilizaed-to-0/1 at time point 2ns from a’s behavior 
graph. Similarly, for b(2ns) and b(4ns), we extract 
probabilities for b at time points 2ns and 4ns from b’s 
behavior graph. We then annotate the extracted probabilities 
onto the evaluated tTDD as the edge weights. The tTDDs 
before and after evaluation and annotation are shown in Figure 
6(a) and (b), respectively. 
One advantage of this approach is that the structure of the 
generic tTDD is fixed and can be reused for evaluation and 
annotation for different t. For example, Figure 7(b) shows 
another evaluation and annotation of the same tTDD with 
t=3ns with only the edge weights changed, while the decision 
diagram structure is intact. In contrast, the approach used in 
[11] requires a unique decision diagram constructed for each 
different time point t.  
 
Figure 5: Dynamic graphs of a partitioned sub-circuit’s inputs 
B. Why not tBDD 
As mentioned before, BDD has trouble to model correctly 
the ‘U’ status defined in Section III. This problem is illustrated 
in Figure 7. On the left side is a decision diagram modeled 
with tBDD, while on the right is the proposed decision 
diagram modeled with tTDD. Assume we want to know n’s 
stabilization-to-1 probability at t=3ns. Then we evaluate both 
tBDD and tTDD with t=3ns and annotate the edges with 
probabilities extracted from a(0ns), b(0ns) and b(2ns) from 
Figure 5.  
The resulting tB D is show  in Figure 7(a), in w ich both 
zero and on -edge of the tB D from V0 are annotated with 
0% because that a has not been stabilized t all at t=0ns. 
Though this annotation is correct, it produces a big problem 
when we attempt to calculate the pro ability out of the 
decision diagram. That is because V0 is the root of the 
decision diagram, all satisfying paths have to go through either 
its one edge or zero edge, resulting in an incorrect zero 
probability of n being stabilized-to-1 at t=3ns.  
 
Figure 6: Annotate a generic tTDD with probabilities 
 
 
Figure 7: tTDD models the stabilization conditions accurately with the “U” 
edges 
 
On the other hand, from the circuit structure in Figure 3, we 
know that even if a is not stabilized at all (the ‘U’ status), n 
can still be stabilized-to-1 as long as input b can be stabilized 
to logic 1. In fact, as long as it is possible to stabilize b to 1 at 
t=2ns, the output n is guaranteed to be stabilized to 1. Hence, 
the probability of stabilizing n at t=3ns is non-zero. This 
discrepancy points out the limitation of tBDD because the 
stabilization condition may be incorrectly modeled, due to its 
incapability of modeling the “U” status. In contrast, tTDD 
doesn’t have this drawback. For example, the resulting tTDD 
after annotation is shown in Figure 7(b), in which an auxiliary 
“U” edge exists between V0 and V3 with an annotated 
(b) Input b’s behavior graph
Figure 5.5: Behavior graphs of a partitioned sub-circuit’s inputs
74
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 
from which !i  and !j take different branching edges. Otherwise, if 
!i  and !j take the same branching edge for every shared node, then 
it is not difficult to see that this will lead to the conclusion that !i  
and !j are identical, which contradicts our assumption that !i  and 
!j are different paths.    
Let’s denote a particular shared node in the tTDD as G, and 
denote the timed input associated with G as g(t). From node G, !i 
and !j take two different branching edges Val1 and Val2, 
respectively. For !i, the satisfying assignment at node G is 
Xg(t)=Val1; while for !j the satisfying assignment at node G is 
Xg(t)=Val2. 
The joint probability of these two satisfying paths is: 
 
We can write it conditionally on the stabilization status of G: 
   
 
 
Since Val1 and Val2 are different statues, one of them must not be 
‘U’ status. Then from Lemma 1, we know: 
 
Then the joint probability of two distinct satisfying paths is zero. ! 
V. Probability of one satisfying path  
In this section, we show how to compute the probability of an 
individual satisfying path. This is done with two steps: (1) evaluate 
tTDD at a given t and annotate the evaluated tTDD with 
probabilities, (2) calculate the probability honoring the temporal 
correlation due to the use of timed inputs.  
A. Annotate a tTDD with probabilities 
Before we start to calculate the probability of a satisfying path, 
we need to (1) evaluate the tTDD at a specified t to get an evaluated 
tTDD, and then (2) annotate the probabilities onto the edges of this 
evaluated tTDD by looking up the corresponding probabilities from 
the inputs’ behavior graphs.  
The tTDD we construct so far is a generic tTDD with a variable t. 
We can substitute t with a specified timing term. For example, in the 
sub-circuit in Figure 3, if we are interested in n’s dynamic behavior 
at t=5ns then we evaluate the tTDD at t=5ns by substituting all t 
terms with 5ns. As a result, a(t-3) becomes a(2ns), b(t-3) becomes 
b(2ns) and b(t-4) becomes b(4ns). Apparently, n’s status at t=5ns 
depend exactly on a’s status at 2ns and b’s statuses at 2ns and 4ns.  
Moreover, because we model the stabilization status of each 
timed input v(t) with a random variable Xv(t), and a random variable 
has a value and an associated probability, we need to annotate 
probabilities onto the tTDD after it is evaluated. This is done by 
assigning the corresponding probabilities onto each node’s outgoing 
edges as edge weights. The corresponding probabilities can be 
looked up from the inputs’ behavior graphs. 
For instance, assume we have the behavior graphs for the sub-
circuit inputs {a,b} specified as Figure 5(a) and (b), respectively. 
For a(2ns), we extract the probabilities for a being stabilizaed-to-0/1 
at time point 2ns from a’s behavior graph. Similarly, for b(2ns) and 
b(4ns), we extract probabilities for b at time points 2ns and 4ns from 
b’s behavior graph. We then annotate the extracted probabilities 
onto the evaluated tTDD as the edge weights. The tTDDs before 
and after evaluation and annotation are shown in Figure 6(a) and 
(b), respectively. 
 
 
Figure 5: Dynamic graphs of a partitioned sub-circuit’s inputs  
 
Figure 6: Annotate a generic tTDD with probabilities 
Another advantage of this approach is that the structure of the 
generic tTDD is fixed and can be reused for evaluation and 
annotation for different t. In contrast, the approach used in [17] 
requires an unique decision diagrams constructed for each different 
time point t. Figure 7(b) shows another evaluation and annotation of 
the same tTDD with t=3ns with only the edge weights are changed, 
while the decision diagram structure is intact.  
B. Why not tBDD 
As mentioned before, BDD has trouble to model correctly the ‘U’ 
status defined in Section II. This problem is illustrated in Figure 7. 
On the left side is a decision diagram modeled with tBDD, while on 
the right is the proposed decision diagram modeled with tTDD. 
Assume we want to know n’s stabilization-to-1 probability at t=3ns. 
Then we evaluate both tBDD and tTDD with t=3ns and annotate the 
edges with probabilities extract from a(0ns), b(0ns) and b(2ns) from 
Figure 5.  
(a) A generic tTDD representing the sen-
sitization conditions for an output n at
time t (1: solid line; 0: dashed line; U:
dotted line)
> REPLACE THIS L NE WITH YOUR PAP R IDENTIFICATION NUMBER (DOUBLE-CLICK HER  TO EDIT) < 
from which i  and !j take different branching edges. Otherwise, if
i  and !j take the s me branching edge for every shar d node, then 
it is not difficult o see that this will lead t  the conclusion that !i  
and !j are identical, which contradicts our assumption that !i  and 
!j a  different paths.    
Let’s denote a particular shared ode in the tTDD s G, and 
denote the timed input associated with G as g(t). From node G, !i 
and !j take two diffe e t branching edges Val1 and Val2, 
respectively. For !i, he satisfyin  assignment at node G is 
Xg(t)=Va 1; while for !j the satisfying assignment at node G is 
Xg(t)=Val2. 
The joint probability of the e two satisfying paths is: 
 
We can write  conditi ally on the stabilization status of G: 
   
 
 
Since Val1 and Val2 are different statues, on  of them must not be 
‘U’ status. Then fro  Lemma 1, e know: 
 
Then the joint probability of two distinct satisfying paths is zero. ! 
V. Probability of one satisfying path  
In this section, e show how to compute the probability of an 
individual satisfying path. This is done with two steps: (1) evaluate 
tTDD at a give  t a d annotate the evaluated tTDD with 
probabilities, (2) calculate the probability honoring the temporal 
c rrelation due to the use of timed inputs.  
A. Anno ate a tTDD with probabilities 
Before we start to calculate the probability of a satisfying path, 
we need to (1) valuate the tTDD at a specified t to get an evaluated 
tTDD, and the  (2) annotate the probabili ies onto the edges of this 
valuated tTDD by looking up the corresponding probabilities from 
the inputs’ behavior graphs.  
The tTDD we onstruct o far is a generic tTDD with a variable t. 
We can substitute t with a specif ed timing te m. For exampl , in the 
s b-circuit in Figure 3, if we are interested in n’s dynamic behavior 
at t=5ns th n we valua e the tTDD at t=5ns by substituting all t 
terms with 5ns. As a result, a(t-3) becomes a(2ns), (t-3) becomes 
b(2ns) and (t-4) becomes b(4ns). Apparently, n’ s a us at t=5ns 
depend exactly on a’s status at 2ns and b’s t use t 2ns and 4ns.  
Moreover, because w  model the stabilization status of each 
timed input v(t) with a random variable Xv(t),  a random variable 
h s a value and an associated probability, we need t  annotate 
probabili ies on o the tTDD after it is evaluated. This is done by 
assigning the corresponding probabili ies onto each node’s outgoing 
edg s as ed e weights. Th  corresponding probabilities can be 
looked up from the inputs’ behavior graphs. 
For instance, assume we have the behavior graphs for the sub-
circ it inputs {a,b} specified as Figure 5(a) and (b), respectively. 
For a(2ns), we extract the probabilities for a being stabilizaed-to-0/1 
at t me point 2ns from a’s behavior graph. Similarly, for b(2ns) and 
b(4ns), we extract probabilities for b at t me points 2ns and 4ns from 
b’s behavior graph. We then annotat  he xtracted probabilities 
onto the valuated tTDD as the ed e weights. The tTDDs before 
and after evaluation and an otation are shown in Figure 6(a) and 
(b), respectively. 
 
 
Figure 5: Dynamic graphs of a partitioned sub-circuit’s inputs  
 
Figure 6: Annotate a generic tTDD with probabilities 
Another dvantage of this approach is that the structure of the 
generic tTDD is fixed d can be reused for evaluation and 
annotation for differe t t. In con rast, the approach used in [17] 
requires an unique decision diagrams constructed for each different 
time point t. Figure 7(b) shows another evaluation and annotation of 
the same tTDD with t=3ns with only the edge weights re changed, 
whil  the decision diag am structure is intact.  
B. Why not tBDD 
As m ntioned before, BDD has trouble to m del correctly the ‘U’ 
status efined in Section II. This problem is illustrated in Figure 7. 
On he l ft side is a decision diagram modeled with tBDD, while on 
t e rig t is the proposed decision diagram modeled with tTDD. 
Assume we wa t to know n’s stabilization-t -1 probability at t=3ns. 
Th n we evaluate both tBDD and tTDD with t=3ns and annotate the 
edges with probabili ies extract from a(0ns), b(0ns) and b(2ns) from 
Figure 5.  
(b) A tTDD evaluated at t=5ns and with
stabilization probabilities annotated (1:
solid line; 0: dashed line; U: dotted line)
Figure 5.6: Annotate a generic tTDD with probabilities
left side is a decision diagram modeled with tBDD, while on the right is the
proposed decision diagram modeled with tTDD. Assume we want to know
n’s stabilization-to-1 probability at t=3ns. Then we evaluate both tBDD
and tTDD with t = 3ns and annotate the edges with probabilities extract
from a(0ns), b(0ns) and b(2ns) from Figur 5.5.
The resulting tBDD is shown in Figure 5.7(a), in which both zero and
one-edge of the tBDD from V 0 are annotated with 0% because that a has
not been stabilized at all at t = 0. Though this annotation is correct, it
produces a big problem when we attempt to calculate the probability out
of the decision diagram. That is because V 0 is th root of the decision
diagram; all sa isfying paths hav t go through either its one edge or zero
edge, resulting in an incorrect zero probability of n being stabilized-to-1 at
t = 3.
On the other hand, from the circuit structure in Figure 5.3, it obvious that
even if a is not stabilized at all (the ‘U ’ status), n can still be stabilized-to-1
as long as input b can be stabilized to logic 1. In fact, as long as it is possible
to stabilize b to 1 at t=2ns, the output n is guaranteed to be stabilized to 1.
Hence, the probability of stabilizing n at t=3ns is non-zero. This discrepancy
points out the limitation of tBDD because the stabilization condition may
75
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 
The resulting tBDD is shown in Figure 7(a), in which both zero 
and one-edge of the tBDD from V0 are annotated with 0% because 
that a hasn’t been stabilized at all at t=0. Though this annotation is 
correct, it gives big problem when we attempt to calculate the 
probability out of the decision diagram. That is because V0 is the 
root of the decision diagram, all satisfying paths have to go through 
either its one edge or zero edge, resulting in a 0 probability of n 
being stabilized-to-1 at t=3.  
On the other hand, from the circuit structure in Figure 3, we 
know that even if a is not stabilized at all (the ‘U’ status), n can still 
be stabilized-to-1 as long as input b can been stabilized to logic 1.  
This discrepancy points out the limitation of tBDD because the 
stabilization condition may be incorrectly modeled, due its 
incapability of modeling “U” status. 
In contrast, tTDD doesn’t have this drawback. For example, the 
resulting tTDD after annotation is shown in Figure 7(b), in which an 
auxiliary “U” edge exists between V0 and V3 with an annotated 
probability of 100%. Together with the one-edge from V3 to true 
terminal, the stabilization condition that tBDD fails to model is now 
modeled as “V0!V3!T” with a ‘U’ edge connecting from V0 to 
V3. The probability of stabilizing n to logic 1 may be calculated as 
100%*5% = 5%, that is exactly the probability of b itself being 
stabilized-to-1 at t=1ns. Hence, by introducing the “U” edge, tTDD 
can correctly model the stabilization conditions that implicitly 
contain ‘U” status.  
 
 
Figure 7: tTDD model the stabilization conditions accurately with 
the “U” edges 
C. Solving the temporal correlation problem 
With the probability-annotated tTDD, we can calculate the 
probability of each satisfying path. To calculate it accurately, we 
have to take care of the temporal correlations along the satisfying 
path. This is discussed next. 
Temporal correlation is introduced by those inputs that have 
multiple timing paths leading to the output n. To motivate why we 
need to treat it carefully, let us first take a look of the example 
circuit in Figure 3. We use its annotated tTDD-1 at t=5ns (Figure 
6(b)) as an example to explain the theorems in this section. One 
satisfying path is along ‘V0!V2!V3!T’, with V2 to V3 through 
the 0-edge. The corresponding satisfying assignment to [Xa(2ns), 
Xb(2ns), Xb(4ns)] is [0,0,1]. However, careful analysis shows that 
this is in fact an impossible assignment. This is because after we 
have ‘Xb(2ns)=0’, meaning b has been stabilized-to-0 at 2ns, we 
cannot have b flip again and be stabilized-to-1 at t=4ns in the same 
clock cycle, that is ‘Xb(4ns)=1’. We call such a physically 
impossible satisfying assignment a false assignment. To avoid the 
potential miscalculation of probability due to the existence of false 
assignments, we introduce the technique false assignment pruning, 
which eliminates the false assignments in a tTDD from probability 
calculation.  
1) False assignment pruning 
Given a satisfying assignment along the satisfying path "j 
contained in the tTDD, denoted as [Xv1(t1), Xv2(t2), … 
Xvm(tm)]=[Val1, Val2, … Valm], we want to decide whether it is a 
false assignment. Since the temporal correlation is due to the 
existence of multiple timing paths to the output from the same input, 
we can first group random variables according to inputs, that 
correspond to the subscripts of Xvi(ti)’s. Grouping Xvi(ti)’s according 
to their subscripts (i.e., the same input) is called grouping of 
random variables. We apply the grouping operation to have random 
variables corresponding to the same input grouped together into an 
input status group. Such a group reflects the input stabilization 
status required for the same input. Recall the concepts of phase tag 
and timing tag in section B. We prune false assignment by 
analyzing the phase and timing tag relations among Xvi(ti)’s in each 
input status group. We have the following theorem. 
Theorem 2: Within an input status group, if any of the following 
conflicting conditions are detected, the entire satisfying assignment 
is declared as a false assignment. 
1. Xvi(t1)=1 # Xvi(t2)=0; 
2. Xvi(t1)=1 # Xvi(t2)=U, with t1$t2 ; 
3. Xvi(t1)=0 # Xvi(t2)=U, with t1$t2 ; 
Proof: Without loss of generality, we assume t1$t2 for condition 1. 
For t1>t2, we can just reverse t1 and t2. Then the above three 
conflicting conditions can only occur with a probability of zero as 
stated in Lemma 1: 
1. Pr[Xvi(t1)=1 # Xvi(t2)=0] = 0;      if t1$t2 ; 
2. Pr[Xvi(t1)=1 # Xvi(t2)=U] = 0,     if t1$t2 ; 
3. Pr[Xvi(t1)=0 # Xvi(t2)=U] = 0,     if t1$t2 ; 
Therefore, if any of the above conflicting conditions are observed 
in an input status group, the corresponding assignment results in 
zero probability.   ! 
Consider the example false assignment mentioned in the 
beginning of this sub-section. The grouping operation results in two 
input status groups: a group associated with a, that is 
‘[Xa(2ns)]=[0]’ and a group associated with b, that is ‘[Xb(2ns), 
Xb(4ns)]=[0,1]’. Checking conflicting rules for each group, we find 
group [Xb(2ns), Xb(4ns)] hits the first conflicting condition. Then 
the whole assignment is declared as a false assignment and is 
excluded.  
After pruning away false assignments, we can calculate the 
probabilities of the remaining true satisfying assignments. 
2) Random variable compaction in a group 
After the false paths are pruned, we need to calculate the 
probability Pr["j]  of each remaining satisfying paths and then sum 
them up. This section provides a way to calculate an individual 
Pr["j]. The overall procedure is as follows. We first apply 
grouping. We then work internally in each input status group and 
(a) A tBDD evaluated at t=3ns and with
stabilization probabilities annotated (1:
solid line; 0: dashed line; U: dotted line)
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 
The resulting tBDD is shown in Figure 7(a), in which both zero 
and one-edge of the tBDD from V0 are annotated with 0% because 
that a hasn’t been stabilized at all at t=0. Though this annotation is 
correct, it gives big problem when we attempt to calculate the 
probability out of the decision diagram. That is because V0 is the 
root of the decision diagram, all satisfying paths have to go through 
either its one edge or zero edge, resulting in a 0 probability of n 
being stabilized-to-1 at t=3.  
On the other hand, from the circuit structure in Figure 3, we 
know that even if a is not stabilized at all (the ‘U’ status), n can still 
be stabilized-to-1 as long as input b can been stabilized to logic 1.  
This discrepancy points out the limitation of tBDD because the 
stabilization condition may be incorrectly modeled, due its 
incapability of modeling “U” status. 
In contrast, tTDD doesn’t have this drawback. For example, the 
resulting tTDD after annotation is shown in Figure 7(b), in which an 
auxiliary “U” edge exists between V0 and V3 with an annotated 
probability of 100%. Together with the one-edge from V3 to true 
terminal, the stabilization condition that tBDD fails to model is now 
modeled as “V0 V3!T” with a ‘U’ edge connecting from V0 to 
V3. The probability of stabilizing n to logic 1 may be calculated as 
100 *5% = 5%, that is exactly the probability of b itself being 
stabilized-to-1 at t=1ns. Hence, by introducing the “U” edge, tTDD 
can correctly model the stabilization conditions that implicitly 
contain ‘U” status.  
 
 
Figure 7: tTDD model the stabilization conditions accurately with 
the “U” edges 
C. Solving the temporal correlation problem 
With the probability-annotated tTDD, we can calculate the 
probability of each satisfying path. To calculate it accurately, we 
have to take care of the temporal correlations along the satisfying 
path. This is discussed next. 
Temporal correlation is introduced by those inputs that have 
multiple timing paths leading to the output n. To motivate why we 
need to treat it carefully, let us first take a look of the example 
circuit in Figure 3. We use its annotated tTDD-1 at t=5ns (Figure 
6(b)) as an example to explain the theorems in this section. One 
satisfying path is along ‘V0!V2!V3!T’, with V2 to V3 through 
the 0-edge. The corresponding satisfying assignment to [Xa(2ns), 
Xb(2ns), Xb(4ns)] is [0,0,1]. However, careful analysis shows that 
this is in fact an impossible assignment. This is because after we 
have ‘Xb(2ns)=0’, meaning b has been stabilized-to-0 at 2ns, we 
cannot have b flip again and be stabilized-to-1 at t=4ns in the same 
clock cycle, that is ‘Xb(4ns)=1’. We call such a physically 
impossible satisfying assignment a false assignment. To avoid the 
potential miscalculation of probability due to the existence of false 
assignments, we introduce the technique false assignment pruning, 
which eliminates the false assignments in a tTDD from probability 
calculation.  
1) False assignment pruning 
Given a satisfying assignment along the satisfying path "j 
contained in the tTDD, denoted as [Xv1(t1), Xv2(t2), … 
Xvm(tm)]=[Val1, Val2, … Valm], we want to decide whether it is a 
false assignment. Since the temporal correlation is due to the 
existence of multiple timing paths to the output from the same input, 
we can first group random variables according to inputs, that 
correspond to the subscripts of Xvi(ti)’s. Grouping Xvi(ti)’s according 
to their subscripts (i.e., the same input) is called grouping of 
random variables. We apply the grouping operation to have random 
variables corresponding to the same input grouped together into an 
input status group. Such a group reflects the input stabilization 
status required for the same input. Recall the concepts of phase tag 
and timing tag in section B. We prune false assignment by 
a alyzing the phase and timing tag relations among Xvi(ti)’s in each 
inp t stat s group. We ave the following theorem. 
Theorem 2: Withi  an inp t stat s group, if any of the following 
o fli ti g c ditions are detected, h  entire satisfying assignment 
is declared a  a false assignment. 
1. Xvi(t1)=1 # Xvi(t2)=0; 
2. Xvi(t1)=1 # Xvi(t2)=U, with t1$t2 ; 
3. Xvi(t1)=0 # Xvi(t2)=U, with t1$t2 ; 
Proof: Without loss of generality, we assume t1$t2 for condition 1. 
For t1>t2, we can just reverse t1 and t2. Then the above three 
o fli ti g c ditions can only occur with  probability of zero as 
stated in Lemma 1: 
1. Pr[Xvi(t1)=1 # Xvi(t2)=0] = 0;      if t1$t2 ; 
2. Pr[Xvi(t1)=1 # Xvi(t2)=U] = 0,     if t1$t2 ; 
3. Pr[Xvi(t1)=0 # Xvi(t2)=U] = 0,     if t1$t2 ; 
Th refore, if any of the above conflicting conditions are observed 
in an input status group, the c rresponding assignment results in 
zero probability.   ! 
Consider the example false assign ent mentioned in the 
beginning of this sub-section. The grouping operation results in two 
input status groups: a group associated with a, that is 
‘[Xa(2ns)]=[0]’ and a group associated with b, that is ‘[Xb(2ns), 
Xb(4ns)]=[0,1]’. Checking conflicting rules for each group, we find 
group [Xb(2ns), Xb(4ns)] hits the first conflicting condition. Then 
the whole assignment is declared as a false assignment and is 
excluded.  
After pruning away false assignments, we can calculate the 
probabilities of the remaining true satisfying assignments. 
2) Random variable compaction in a group 
After the false paths are pruned, we need to calculate the 
probability Pr["j]  of each remaining satisfying paths and then sum 
them up. This section provides a way to calculate an individual 
Pr["j]. The overall procedure is as follows. We first apply 
grouping. We then work internally in each input status group and 
(b) A tTDD evaluated at t=5ns and with
stabilization probabilities annotated (1:
solid line; 0: dashed line; U: dotted line)
Figure 5.7: Compare tBDD with tTDD, tTDD models the stabilization
conditions accurately with the ‘U ’ edges
be incorrectly modeled, due its incap b lity of modeling ‘U ’ status.
In contrast, tTDD does not h v this drawback. For exampl , t result-
ing tTDD after annotation is shown in Figure 5.7(b), in which an auxiliary
‘U ’ edge exists between V 0 and V 3 with an annotated probability of 100%.
Together with the one-edge from V 3 to true terminal, the stabilization con-
dition that tBDD fails to model i now modeled as “V 0→ V 3→ T” with a
‘U ’ edge connecting from V 0 to V 3. The probability of stabilizing n to logic
1 may be calculated as: 100% ∗ 5% = 5%, which is exactly the probability of
b itself being stabiliz d-to-1 at t=1ns. Hence, by introdu ing the ‘U ’ edge,
tTDD ca correctly model the stabilization conditions that implicitly contain
‘U ’ status.
As shown in the above example, representing ‘U ’ implicitly in BDD can
cause the decision paths in the decision diagrams to be probabilistically cor-
related. This is because ‘1’ is no longer the complement of ‘0’ with the
introduction of ‘U ’ status in the dynamic behavior analysis. Therefore, cal-
culation of joint probability of those correlated decision paths is complicated
when ‘U ’ is implicit. On the other hand, using the proposed tTDD and The-
orem 5.5.3 can eliminate the correlation among decision paths. Therefore,
76
representing ‘U ’ explicitly enables the probability calculation being simplified
significantly.
5.6.3 Solving the temporal correlation problem
With the probability-annotated tTDD, we can calculate the probability of
each satisfying path. To calculate it accurately, we have to take care of
the temporal correlations along the satisfying path. Temporal correlation is
introduced by those inputs that have multiple timing paths leading to the
output n. To understand why we need to treat it carefully, let us first take
a look of the example circuit in Figure 5.3. I use its annotated tTDD-1 at
t = 5ns (Figure 5.6(b)) as an example to explain the theorems in this section.
For ease of reading, it is shown again in Figure 5.8.
One satisfying path in Figure 5.8 is along ‘V 0 → V 2 → V 3 → T ’, with
V 2 to V 3 through the 0-edge. The corresponding satisfying assignment to
[Xa(2ns), Xb(2ns), Xb(4ns)] = [0, 0, 1]
However, careful analysis shows that this is in fact an impossible assignment.
This is because after we have ‘Xb(2ns) = 0’, meaning b has been stabilized-
to-0 at 2ns, we cannot have b flip again and be stabilized-to-1 at t = 4ns
in the same clock cycle, that is ‘Xb(4ns) = 1’. This physically impossible
satisfying assignment is called a false assignment.
To address the issues related to such temporal correlation, we propose two
techniques in Section 5.6.4 and 5.6.5, respectively:
1. False assignment pruning : a technique to take care of physically un-
satisfyable stabilization conditions.
2. Random variable compaction: a technique to reduce the probability
computation into a closed form.
77
False assignment pruning 
•  False assignment: 
–  It is a physically impossible satisfying 
assignment 
–  V0!V2!V3!T (blue) encodes a 
stabilization condition 
–  Conflict:  
•  [Xa(2ns)]=0 :  
•  [Xb(2ns)]=0 : b is stabilized-to-0 at 2ns 
•  [Xb(4ns)]=1 : b is stabilized-to-0 at 4ns 
•  Theorem: Prune false assignment away 
according to a set of rules 
V0
V1 V2
V3
T F
11%
5%
6%
50%
9%
6%
5%
50% 0%
80%
89%
89%
U
tTDD-1: T(n=1,5ns)
A tTDD evaluated at t=5ns and 
with stabilization probabilities 
annotated
a(2ns)
b(2ns)
b(4ns)
1:
0:
U:
17 
a
b
A sub-circuit after partitioning
p
n
d=2
d=1
Figure 5.8: Example of false assignment in a tTDD
5.6.4 False assignment prunning
Given a satisfying assignment along the satisfying path Ψj contained in the
tTDD, denoted as
[Xv1(t1), Xv2(t2), . . . Xvm(tm)] = [Val1,Val2, . . .Valm]
we want to decide whether it is a false assignment. Since the temporal
correlation is due to the existence of multiple timing paths to the output from
the same input, we can first group random variables according to inputs, that
correspond to the subscripts of Xvi(ti)’s. Collecting all Xvi(ti)’s according to
their subscripts (i.e., the same input) is called grouping of random variables.
Grouping operation is applied to have random variables corresponding to
the same input grouped together into an input status group. Such a group
reflects the input stabilization status required for the same input. Recall the
concepts of phase tag and timing tag in section 5.3. False assignments are
pruned by analyzing the phase and timing tag relations among Xvi(ti)’s in
each input status group as described by the following theorem.
Theorem 5.6.1. Within an input status group, if any of the following con-
flicting conditions are detected, the entire satisfying assignment is declared
78
as a false assignment.
[Xvi(t1) = 1] ∩ [Xvi(t2) = 0] (5.14)
[Xvi(t1) = 1] ∩ [Xvi(t2) = U ], with t1 ≤ t2 (5.15)
[Xvi(t1) = 0] ∩ [Xvi(t2) = U ], with t1 ≤ t2 (5.16)
Proof. Without loss of generality, we assume t1 ≤ t2 as condition in Equa-
tion 5.14. If t1 > t2, we can just reverse t1 and t2. Then the three conflicting
conditions can only occur with a probability of zero as stated in Lemma 5.5.1
and 5.5.2:
1. Pr[Xvi(t1) = 1 ∩Xvi(t2) = 0] = 0; if t1 ≤ t2
2. Pr[Xvi(t1) = 1 ∩Xvi(t2) = U ] = 0, if t1 ≤ t2
3. Pr[Xvi(t1) = 0 ∩Xvi(t2) = U ] = 0, if t1 ≤ t2
Therefore, if any of the above conflicting conditions is observed in an input
status group, the corresponding assignment results in zero probability.
Moreover, I show in Section 5.7 that the above conditions are sufficient to
cover all possible false assignments.
Now consider the false assignment example in Figure 5.8. The grouping
operation results in two input status groups: a group associated with a, that
is:
[Xa(2ns)] = [0]
and a group associated with b, that is:
[Xb(2ns), Xb(4ns)] = [0, 1]
Checking conflicting rules for each group, we can find that group [Xb(2ns),
Xb(4ns)] hits the first conflicting condition check. Then the whole assignment
is declared as a false assignment and is excluded.
After pruning away false assignments, the probabilities of the remaining
true satisfying assignments will be calculated with random variable com-
paction, as described next.
79
5.6.5 Random variable compaction
After the false paths are pruned, we need to calculate the probability Pr[Ψj]
of each remaining satisfying path and then sum them up. This section pro-
vides a way to calculate an individual Pr[Ψj]. The overall procedure is as
follows. We first apply grouping. We then work internally in each input
status group and calculate the probability of each such group with the tech-
nique called random variable compaction that consists of two steps: unifying
the timing tags and unifying the phase tags. Then we work externally on all
groups in Ψj to calculate Pr[Ψj]. I introduce details next.
Within each group, there can be multiple random variables with different
timing tags and phase tags. To calculate the probabilities of each input status
group, we need first to unify the timing tags by applying the two rules:
1. Unify the timing tags in a group. For each phase tag, if there are
multiple random variables with different timing tags in the group, we
compact them into a single random variable with a unified timing tag
according Theorem 5.6.2.
2. Unify the phase tags in a group. If any group still has more than one
random variable, then we compact the random variables with different
phase tags into a closed form according to Theorem 5.6.3.
Theorem 5.6.2. The joint probability of random events can be calculated as
follows:
1. If random variables have common phase tag 1, their joint probability is
the probability of the random event with the smallest timing tag.
Pr[Xvi(t1) = 1 ∩Xvi(t2) = 1 · · · ∩Xvi(tm) = 1]
=Pr[Xvi(min(t1, t2 . . . tm)) = 1] (5.17)
2. If random variables have common phase tag 0, their joint probability is
the probability of the random event with the smallest timing tag.
Pr[Xvi(t1) = 0 ∩Xvi(t2) = 0 · · · ∩Xvi(tm) = 0]
=Pr[Xvi(min(t1, t2 . . . tm)) = 0] (5.18)
80
3. If random variables have common phase tag U , their joint probability
is the probability of the random event with the largest timing tag.
Pr[Xvi(t1) = U ∩Xvi(t2) = U · · · ∩Xvi(tm) = U ]
=Pr[Xvi(max(t1, t2 . . . tm)) = U ] (5.19)
Proof. I prove Theorem 5.6.2 for the first case in 5.17 and the other two cases
can be proven similarly. Assume the timing tags from t1 to tm are ordered
s.t.
t1 ≤ t2 ≤ · · · ≤ tm
then we can write it conditionally as:
Pr[Xvi(t1) = 1 ∩Xvi(t2) = 1 · · · ∩Xvi(tm) = 1]
=Pr[Xvi(t2) = 1 · · · ∩Xvi(tm) = 1|Xvi(t1) = 1] ∗ Pr[Xvi(t1) = 1]
=1 ∗ Pr[Xvi(t1) = 1]
The last step is by definition: if vi is stabilized-to-1 at t1, then it will not
switch and is still stabilized-to-1 later at t2, t3, . . . tm in the same clock cycle.
Note that after this step, it is guaranteed that for any phase tag, there is
at most one random variable associated with it in a group.
For example in Figure 5.9, one satisfying path is:
V 0→ V 1→ V 3→ T
with V1 going to V3 through the 1-edge. The corresponding satisfying
assignment is:
[Xa(2ns), Xb(2ns), Xb(4ns)] = [1, 1, 1]
After grouping, input b’s status is:
[Xb(2ns), Xb(4ns)] = [1, 1]
81
According to Theorem 5.6.2:
Pr[Xb(2ns) = 1 ∩Xb(4ns) = 1]
=Pr[Xb(min(2ns, 4ns)) = 1]
=Pr[Xb(2ns) = 1]
=5%
Random variable compaction  
•  Unify the timing(phase) tags  
–  calculating the probabilities of the true 
satisfying assignments 
–  On V0!V1!V3!T (blue) 
•  [Xa(2ns)],[Xb(2ns),Xb(4ns)]=[1],[1,1] 
–  Calculation 
     Pr[Xb(2ns)=1"Xb(4ns)=1] 
  =Pr[Xb(min(2ns,4ns))=1] 
  =Pr[Xb(2ns)=1]= 5%. 
 
V0
V1 V2
V3
T F
11%
5%
6%
50%
9%
6%
5%
50% 0%
80%
89%
89%
U
tTDD-1: T(n=1,5ns)
A tTDD evaluated at t=5ns and 
with stabilization probabilities 
annotated
a(2ns)
b(2ns)
b(4ns)
1:
0:
U:
18 
a
b
A sub-circuit after partitioning
p
n
d=2
d=1
Figure 5.9: An example of unifying timing tags in a tTDD
After grouping according to timing t gs using the technique of ‘unifying
the timing tags’, if any group still has more than one random variable, then
we compact the random variables with different phase tags into a closed form
according to Theorem 5.6.3.
Theorem 5.6.3. The joint probability of a group of random variables that
have the same phase tag can be calculated according to following rules:
1. If one has phase tag U and the other has phase tag 1:
Pr[Xvi(t1) = U∩Xvi(t2) = 1] = Pr[Xvi(t2) = 1]−Pr[Xvi(t1) = 1], if t1 < t2
82
2. If one has phase tag U and the other has phase tag 0:
Pr[Xvi(t1) = U∩Xvi(t2) = 0] = Pr[Xvi(t2) = 0]−Pr[Xvi(t1) = 0], if t1 < t2
3. If one has phase tag 1 and the other has phase tag 0:
Pr[Xvi(t1) = 1 ∩Xvi(t2) = 0] = 0
Proof. To prove the first case, the second case can be proven in the same
way. For t1 < t2, we can have:
Pr[Xvi(t2) = 1 ∩Xvi(t2) = U ]
=Pr[Xvi(t2) = 1]− Pr[Xvi(t2) = 1 ∩Xvi(t1) = 0]
− Pr[Xvi(t2) = 1 ∩Xvi(t1) = 1]
=Pr[Xvi(t2) = 1]− 0− Pr[Xvi(t2) = 1 ∩Xvi(t1) = 1] by Lemma 5.5.1
=Pr[Xvi(t2) = 1]− Pr[Xvi(t1) = 1]; by Theorem 5.6.2
The last case is based on Lemma 5.5.1
For example in Figure 5.10, one satisfying path is
V 0→ V 2→ V 3→ T
with V 2 to V 3 through the U-edge. The corresponding satisfying assignment
is:
[Xa(2ns), Xb(2ns), Xb(4ns)] = [0, U, 1]
After grouping, input b’s status group is
[Xb(2ns), Xb(4ns)] = [U, 1]
According to the first case in Theorem 5.6.3:
Pr[Xb(2ns) = U ∩Xb(4ns) = 1]
=Pr[Xb(4ns) = 1]− Pr[Xb(2ns) = 1]
=50%− 5% = 45%
83
•  Unify the phase tags according to theorem 
4: 
–  calculating the probabilities of the true 
satisfying assignments 
•  Example: 
–  On ‘V0!V2!V3!T’ (blue) 
•  ‘[Xa(2ns),Xb(2ns),Xb(4ns)]=[0,U,1]’ 
–  Calculation 
     Pr[Xb(2ns)=U"Xb(4ns)=1] 
  =Pr[Xb(4ns)=1]-Pr[Xb(2ns)=1] 
  =50%-5% = 45% 
V0
V1 V2
V3
T F
11%
5%
6%
50%
9%
6%
5%
50% 0%
80%
89%
89%
U
tTDD-1: T(n=1,5ns)
A tTDD evaluated at t=5ns and 
with stabilization probabilities 
annotated
a(2ns)
b(2ns)
b(4ns)
1:
0:
U:
Random variable compaction  
Figure 5.10: An example of unifying phase tags in a tTDD
5.7 Put it all together
We can verify that Theorem 5.6.1, 5.6.2 and 5.6.3 capture all possible com-
binations of random variables that can exist along a satisfying path.
After grouping, Theorem 5.6.2 unifies different timing tags for each phase
tag in each group. After this, a group has no more than three random
variables with each having a distinct phase tag in the set of {0, 1, U}.
Then for each random variable group, there can be three cases:
1. If only one random variable exists in a group, no special processing is
needed.
2. If two random variables co-exist in a group, all their possible phase/timing
combinations are covered and processed according to row 3-5 of Ta-
ble 5.1.
3. If all three phases still co-exist in a group, row 6 of Table 5.1 is applied
to prune away this false assignment (based on Theorem 5.6.1).
This verifies that computing probabilities according to Theorem 5.6.1, 5.6.2
and 5.6.3 is sufficient because they cover all possible timing/phase combina-
tions.
84
Table 5.1: Possible combinations of random variables in an input status
group
Phases Possible timing relations
0 1 U t1 < t2 t1 = t2 t1 > t2
- • • False (Eqn 5.15) False (Eqn 5.15) True (Thm 5.6.3)
• - • False (Eqn 5.16) False (Eqn 5.16) True (Thm 5.6.3)
• • - False (Eqn 5.14) False (Eqn 5.14) False (Eqn 5.14)
• • • False (Eqn 5.14) False (Eqn 5.14) False (Eqn 5.14)
Now I use Figure 5.6 to show a complete example of applying the above
theorems on a tTDD. In the tTDD, there are four decision diagram nodes,
denoted as V 0, V 1, V 2 and V 3. These nodes are also listed in the first row
in Table 5.2. From each node, there are three possible outgoing edges, e.g.,
0-edge, 1-edge and U -edge. These possible choices are modeled as a random
variable Xv(t)’s and the three outgoing edges are associated with a random
variable’s three phases, e.g., 0, 1 and U . These random variables and their
phases are listed just below the nodes. In this tTDD, there are seven different
satisfying assignments. They are labeled from 1 to 7 in the table. For each
satisfying assignment, starting from the root node, all paths that can reach
the true terminal are traced. As shown in the table, each row of the satisfying
assignment contains several dots, meaning the associated edges are taken to
form the satisfying assignment.
Table 5.2: A complete example of false assignment pruning
No.
V 0 V 1 V 2 V 3
Type
Xa(2ns) Xb(2ns) Xb(2ns) Xb(4ns)
0 1 U 0 1 U 0 1 U 0 1 U
1 ◦ ◦ True
2 ◦ ◦ ◦ True
3 ◦ ◦ ◦ True
4 ◦ ◦ True
5 ◦ • • False
6 ◦ ◦ True
7 ◦ ◦ True
Then the random variables Xv(t) are grouped according to the their sub-
scripts. As a result, Xa(2ns) itself is a group and Xb(2ns),Xb(2ns),Xb(4ns)
associated with V 1, V 2 and V 3 forms another group.
85
Next, for each random variable group, the false assignment rules are checked
and those rows containing false assignments are pruned away. In this ex-
ample, assignment number 5 is detected as a false assignment using Theo-
rem 5.6.1 and is removed from probability calculation.
With all the remaining satisfying assignments, random variable compaction
based on Theorem 5.6.2 and 5.6.3 is applied on each random variable group
to deduce the joint probability of each random variable group into easily
solvable form. During random variable compaction, the affected terms are
shown as solid dots and the appropriate theorems used on them are checked
on the right side as shown in Table 5.3.
After random variable compaction, the probability of each group is repre-
sented in a closed form and can be evaluated to a real number. The joint
probability of different groups can then be expressed as the product of the
probability of each individual group, similar to most previous work [44,55,56].
To maintain good accuracy, the inputs associated with each group need to
be kept as independent as possible. This is accomplished by a novel circuit
partitioning algorithm described next.
Table 5.3: A complete example of random variable compaction
No.
V0 V1 V2 V3
Timing Phase
Xa(2ns) Xb(2ns) Xb(2ns) Xb(4ns)
0 1 U 0 1 U 0 1 U 0 1 U
1 ◦ ◦
2 ◦ • • √
3 ◦ • • √
4 ◦ •
5 ◦ ◦ ◦
6 • • √
7 ◦ ◦
5.8 Partitioning for tTDD
The goal of circuit partitioning is to deal with the circuit scalability issue
and enable the tTDD-based probability calculation to handle large circuits
in reasonable runtime. Given that the worst case complexity of a TDD is of
O(3n/n) [51], we desire to limit the size of the TDD constructed from the
86
partitioned sub-circuit. Meanwhile, the partitioning algorithm should also
produce sub-circuits that are minimally affected by the structural correlation
introduced by re-convergent nets in the circuit.
A number of switching power estimation and testability analysis works
have attempted to address the structural correlation issue introduced by re-
convergent nets during signal probability calculation. Proposed methods to
preserve signal dependency during signal probability computation include the
use of Bayesian networks [57], conditional independence [58], and Boolean
approximation [59]. However, these are all limited to a zero-delay model.
Other proposed approaches to reduce the signal correlation include limited
depth re-convergent path analysis [60] and a method to find the independent
terms in the probability polynomial at super-gate by leveraging the concept of
graph dominator [56]. Unfortunately, these methods are only applicable when
the signal probability is represented by polynomials. When a BDD is used
to compute the signal probability, a partitioning heuristic was proposed [61]
to minimize the number of nodes shared between the partitioned sub-circuit
and the remaining circuit. I denote this method as Kapoor’s heuristic.
5.8.1 Partitioning preliminary
Kapoor’s heuristic minimizes the structural correlation of the partition’s in-
puts as well as controls the size of a BDD by enforcing that the size of each
partition has no more than K inputs. Given that timed inputs are intro-
duced, Kapoor’s heuristic also needs to be extended to control the number
of timed inputs.
Definition 8. A K-feasible fan-in cone of a node n is a sub-circuit rooted
at n with a timed support set of cardinality no more than K.
In other words, a K-feasible fan-in cone of n has at most K unique timing
paths from the inputs to the output n. As an example, we know the fan-in
cone of n4 in Figure 5.11(a) before partitioning is 6-feasible but not 5-feasible,
because:
tSup(n4, t) = {a(t− 3), b(t− 4), c(t− 4), b(t− 3), c(t− 3), d(t− 2)}
with a cardinality of 6. For a non-K-feasible fan-in cone, cut points need to
87
be chosen to make it K-feasible.
Definition 9. An internal node m in the fan-in cone rooted at n dominates
a timed input v(t) in n’s timed support set if all timing paths associated with
v(t) go through m to reach n.
Definition 10. The dominator factor D(m) of an internal node m w.r.t.
the output n is the total number of timed inputs that are dominated by m.
These definitions are directly extended from the concept of a graph domina-
tor. In Figure 5.11(a) w.r.t. the output g4, D(n1) = 4, because n1 dominates
the following timed inputs:
{b(t− 4), c(t− 4), b(t− 3), c(t− 3)}
And D(n2) = 3 because n2 dominates the following timed inputs:
{a(t− 3), b(t− 4), c(t− 4)}
Definition 11. A timed input v(t) leaks from m in the fan-in cone rooted
at n if m dominates some v(t’) with t′ 6= t but does not dominate v(t).
In Figure 5.11(a), timed variable b(t− 4) is dominated by n2 because the
only timing path associated with it goes through n2, as shown with a solid
path from b to n4. Similarly, c(t− 4) is also dominated by n2. On the other
hand, b(t−3), c(t−3) leak from n2 because the timing paths associated with
them do not go through n2, shown with the dashed path.
Definition 12. The leaking factor L(m) of an internal node m w.r.t. the
output n is the total number of timed inputs that leak from m.
For example, no timed input leaks from n1 and two leaks from n2. So
L(n1) = 0 and L(n2) = 2.
5.8.2 Calculate the dominator and leaking factors
The dominate factor and leaking factor can be calculated based on depth-
first-search (DFS). Firstly, let the node n under consideration inherit its
immediate predecessor’sK-feasible fan-in cones. By merging its predecessors’
88
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 
After random variable compaction, the probability of each group 
is represented in a closed form and can be evaluated to a real 
number. The joint probability of different groups can then be 
expressed as the product of the probability of each individual group, 
similar to most previous work [1][3][4][5][6][7]. To maintain good 
accuracy, the inputs associated with each group need to be kept as 
independent as possible. This is accomplished by our circuit 
partitioning algorithm.  
Table 2: A complete example of how to use false assignment 
pruning and random variable compaction for the example tTDD 
!"# !$# !%# !&#
'()%*+,# '-)%*+,# '-)%*+,# '-).*+,#
/0
1#
%#
/0
1#
&#
/0
1#
.#2#
"# $# 3# "# $# 3# "# $# 3# "# $# 3# # # #
$# # 4# # 4# # # # # # # # # # # #
%# # 4# # # 4# # # # # # 4# # # '# #
&# # 4# # # # 4# # # # # 4# # # # '#
.# 4# # # # # # # 4# # # # # # # #
5# 4# # # # # # 4# # # # 4# # '# # #
6# # # # # # # # # 4# # 4# # # # '#
7# # # 4# # # # # # # # 4# # # # #
VI. Circuit Partitioning 
The goal of circuit partitioning is to deal with the circuit 
scalability issue and enable the tTDD-based probability calculation 
to handle large circuits in reasonable runtime. Given that the worst 
case complexity of a TDD is of O(3n/n) [15], we desire to limit the 
size of the TDD constructed from the partitioned sub-circuit. 
Meanwhile, the partitioning algorithm should also produce sub-
circuits that are minimally affected by the structural correlation 
introduced by reconvergent nets in the circuit.  
A number of switching power estimation and testability analysis 
works have attempted to address the structural correlation issue 
introduced by reconvergent nets during signal probability 
calculation. Proposed methods to preserve signal dependency during 
signal probability computation include the use of Bayesian networks 
[8], conditional independence [9], and Boolean approximation [10]. 
However, these are all limited to a zero-delay model. Other 
proposed approaches to reduce the signal correlation include limited 
depth reconvergent path analysis [5] and a method to find the 
independent terms in the probability polynomial at super-gate by 
leveraging the concept of graph dominator [6][7]. Unfortunately, 
these methods are only applicable when the signal probability is 
represented by polynomials. When a BDD is used to compute the 
signal probability, a partitioning heuristic was proposed [16] to 
minimize the number of nodes shared between the partitioned sub-
circuit and the remaining circuit. We denote this method as 
Kapoor’s heuristic. 
A. Partitioning preliminary 
Kapoor’s heuristic minimizes the structural correlation of the 
partition’s input as well as controls the size of a BDD by enforcing 
that the size of each partition has no more than K inputs. Given that 
we introduced timed inputs, we extend Kapoor’s heuristic by 
controlling the number of timed inputs.  
Definition 6: A K-feasible fanin cone of a node n is a sub-circuit 
rooted at n with a timed support set of cardinality no more than K. 
In other words, a K-feasible fanin cone of n has at most K unique 
timing paths from the inputs to the output n. As an example, we 
know the fanin cone of n4 in Figure 8(a) before partitioning is 6-
feasible but not 5-feasible: tSup(n4,t)={a(t-3),b(t-4),c(t-4),b(t-3),c(t-
3),d(t-2)} with a cardinality of 6. For a non-K-feasible fanin cone, 
cut points need to be chosen to make it K-feasible.  
Definition 7: An internal node m in the fanin cone rooted at n 
dominates a timed input v(t) in n’s timed support set if all timing 
paths associated with v(t) go through m to reach n. 
Definition 8: The dominator factor D(m) of an internal node m 
w.r.t the output n is the total number of timed inputs that are 
dominated by m.   
These definitions are directly extended from the concept of a 
graph dominator. In Figure 8(a) w.r.t the output g4, n1 dominates 
{b(t-4),c(t-4),b(t-3),c(t-3)}; therefore, D(n1)=4. And n2 dominates 
{a(t-3),b(t-4),c(t-4)}. Thus, D(n2)=3. 
Definition 9: a timed input v(t) leaks from m in the fanin cone 
rooted at n if m dominates some v(t’) with t’!t but does not 
dominate v(t). 
In Figure 8(a), timed variable b(t-4) is dominated by n2 because 
the only timing path associated with it goes through n2, as shown 
with a solid path from b to n4. Similarly, c(t-4) is also dominated by 
n2. On the other hand, {b(t-3),c(t-3)} leak from n2 because the 
timing paths associated with them do not go through n2, shown with 
the dashed path. 
Definition 10: The leaking factor L(m) of an internal node m w.r.t. 
the output n is the total number of timed inputs that leak from m. 
For example, no timed input leaks from n1 and two leaks from 
n2. So L(n1)=0 and L(n2)=2. 
 
Figure 8: Partitioning example of a small circuit 
B. Calculate the dominator and leaking factors 
The dominate factor and leaking factor can be calculated based 
on depth-first-search (DFS). Firstly, let the node n under 
consideration  inherits its immediate predecessor’s K-feasible fanin 
cones. By merging its predecessors’ K-feasible fanin cones, we 
form a temporary fanin cone, called influential cone, for n.  To 
calculate a node m (a node in n’s influential cone)’s dominator and 
leaking factors w.r.t cone root n, that is D(m) and L(m), we can 
backtrace n’s fanin cone with DFS and limited the search within its 
(a) Partition at n2
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 
We show this procedure using the example in Figure 8(a). The 
influential cone of n4 is bounded by {a,b,c,d}. To calculate D(n2) 
and L(n2) w.r.t. node n4, we backtrace from n4 until reach the 
boundary of the influential cone. We will have R = (a, b, c, b, c, d); 
Q = (b, c, d). L t D = (R-Q) = (a, b, c), then D(n2)= |D|=3 and L(n2) 
= |Q!D| = |(b,c)|=2.  
 
 
!"
!#
!$!%
"
% %
%
&'()*+,-.-./!.!0)+-)0%
!"
!#
!$!%
%
"
%
%
&+()*+,-.-./!.!0)+-)0"
+
'
1
+
'
1
2
2
 
Figure 8: Partitioning example of a small circuit 
C. Partitioning cost function  
Kapoor’s heuristic has a limitation that may affect the quality of 
the partitioning. In that, if a node is not K-feasible, Kapoor’s 
heuristic limits the searching of cut point within the node’s 
immediate fanins. F r n4 in our example, only n2 and n3 will be 
investigated. And n2 is cut to form a new partition as marked with 
the check sign in Figure 8(a), resulting in a 5-feasible fanin cone of 
n with tSup(n4,t)={n2(t-1),b(t-3),c(t-3),d(t-2)}. Note that b,c are still 
correlated with n2 because they are transitive fanins of n2. This 
structural cor elation may introduce inaccuracy in the tTDD 
probability calculatio . 
Our partitioning algorithm extends the searching for potential cut 
points to all nodes in n’s current influential cone and assigns a cost 
to each potential cut point m according to: 
 Cost(m) = e
! ln(L(m)+0.001)+" ln(F (m))#$ ln(D(m))
 Where D(m) and L(m) are the dominator and leaking factors of 
node m, respectively. While a new term F(m) - the degree of m’s 
fanouts - is also used in the cost function. Its effect will be 
explained shortly. ", #, $ are positive coefficients used to weight 
each term. The intuition behind our cost function includes:  
(1) the “"ln(L(m)+0.001” part on the exponent of cost function 
favors those nodes that have less timed inputs leaking from m. 
Because we do not want those leaked inputs to introduce new 
structural correlation for their descendent nodes.  
(2) The part “-$ln(D(m)) biases our choice toward those nodes 
that dominate more timed inputs, as suggested in [6][7].    
(3) The part “#ln(F(m))” on the exponent is used to penalize 
cutting at those nodes who have a large number of fanouts. Because 
we observed a fact that is also observed in [5], that is the more 
fanouts a node have the more structural correlation it may introduce 
for its decedent nodes.   
The effects of first two terms L(m) and D(m) on partitioning have 
already been discussed in sub-section A.  The third term F(m) is 
also used to reduce structural correlation. Take the sub-circuit in 
Figure 9 as an example, suppose a high-fanout node n1 was chosen 
by one of n8’s predecessor as a cutpoint. Suppose the light area and 
the dark area are the influential cones before and after partitioning 
for n8, respectively. The newly formed partition boundary consists 
of {n5,n2,n3,n4}. The unwanted effect of choosing high-fanout 
node, such as n1, as partition boundary is that the newly formed 
partition boundary, i.e. {n2,n3,n4} may be highly correlated due to 
their all are decedents of the high-fanout node n1. To remedy this 
problem, in our cost function we prefer to keeping the high-fanout 
node within a partition rather than using them as the partition 
boundary by put a high penalty for using high-fanout node as 
partition boundary. 
 
!"
!#
!$
!%
!&
!'
!(
!)
 
 
Figure 9: Problem of high-fanout node on the partition boundary 
 
For an example of using {"=0.5,#=2,$=1} for Figure 8, cost(n1) 
:= 0.3, cost(n2) = cost(n3) : = 0.72. So n1 is chosen as the 
partitioning point, as shown with the check mark in Figure 8(b). The 
new 5-feasible timed support set of n4 is tSup(n4,t)={a(t-3),n1(t-
3),n1(t-2),d(t-2)}. This is actually the optimal partition to minimize 
structural correlation due to reconvergent net from n1. Note that the 
two temporally correlated terms, n1(t-3) and n1(t-2) introduced by 
cutting at n1, will be taken care of by theorems in section C. 
D. Overall partitioning algorithm 
The overall partitioning in Table 3 is performed as follows: first a 
PI’s timed support set is initialized to itself. Then the nodes are 
processed from PI to PO in a topological order. During processing, a 
node n will first inherit all timed support sets from its immediate 
fanin nodes. If the influential fanin cone of n is not K-feasible, 
partitioning points are chosen as described in section C until the 
fanin cone becomes K-feasible. Since the complexity of computing 
the cost function is bounded by a specified constant K, the overall 
partitioning algorithm runs very fast as shown in our experimental 
results.  
The overall partitioning procedure described in Table 3 can be 
illustrated with Figure 10 and Table 4, the first two column are 
execution order and node name. The third column is current gate’s 
predecessors from which the fanins are inherited. The forth column 
shows the origins of their timed inputs. The fifth column shows the 
(b) Partition at n1
Figure 5.11: Paritioning examples of a small circuit
89
K-feasible fan-in cones, a temporary fan-in cone is formed, called influential
cone of n. To calculate the dominator and leaking factors of a node m (a
node in n’s influential cone) w.r.t. cone root n, that is D(m) and L(m),
we can backtrace n’s fan-in cone with DFS and limit the search within its
influential cone and stop back-tracing at the influential cone’s boundary.
When a boundary node is reached, we add one instance of such boundary
node in a list R. Node that if this boundary node is already included in R,
we add it one more time. Moreover, if the boundary node is reached through
a back-tracing path NOT via node m, we also put it into a list Q. By using
list instead of set for R and Q, we allow duplicated instances to exist in the
same list. After the DFS search is finished, we set
D := (R−Q)
by removing the instances appearing in Q from R. Then we have:
D(m) = |R \Q|
and leaking factor is
L(m) = |Q ∩D|
Next, I show this procedure using the example in Figure 5.11(a). The
influential cone of n4 is bounded by a, b, c, d. To calculate D(n2) and L(n2)
w.r.t. node n4, we backtrace from n4 until reaching the boundary of the
influential cone. We will have
R ={a, b, c, b, c, d}
Q ={b, c, d}
Letting
D =R \Q = {a, b, c}
L =Q ∩D = {b, c}
then D(n2) = |D| = 3 and L(n2) = |L| = |{b, c}| = 2.
90
5.8.3 Partitioning cost function
Kapoor’s heuristic has a limitation that may affect the quality of the parti-
tioning. If a node is not K-feasible, Kapoor’s heuristic limits the searching of
cut point within the node’s immediate fan-ins. For n4 in our example, only
n2 and n3 will be investigated. And n2 is cut to form a new partition as
marked with the check sign in Figure 5.11(a), resulting in a 5-feasible fan-in
cone of n with
tSup(n4, t) = {n2(t− 1), b(t− 3), c(t− 3), d(t− 2)}.
Note that b and c are still correlated with n2 because they are transitive
fan-ins of n2. This structural correlation may introduce inaccuracy in the
tTDD probability calculation.
The proposed partitioning algorithm extends the searching for potential
cut points to all nodes in n’s current non-K-feasible fan-in cone and assigns
a cost to each potential cut point m according to:
Cost(m) = eα ln(L(m)+0.001)+β ln(F (m))−γ ln(D(m)) (5.20)
where D(m) and L(m) are the dominator and leaking factors of node m,
respectively, while a new term F (m) - the degree of m’s fan-outs - is also
used in the cost function. Its effect will be explained shortly. α, β, γ are
positive coefficients used to weight each term.
For an example of using {α = 0.5, β = 2, γ = 1} for Figure 5.11(b),
cost(n1) =0.3
cost(n2) =0.72
cost(n3) =0.72
So n1 is chosen as the partitioning point, as shown with the check mark
in Figure 5.11(b). The new 5-feasible timed support set of n4 is
tSup(n4, t) = {a(t− 3), n1(t− 3), n1(t− 2), d(t− 2)}.
This is actually the optimal partition to minimize structural correlation due
to reconvergent net from n1. Note that the two temporally correlated terms,
91
n1(t− 3) and n1(t− 2) introduced by cutting at n1, will be taken care of by
theorems in Section 5.6.
The intuition behind this cost function includes:
1. The “α ln(L(m))” part on the exponent of cost function favors those
nodes that have less timed inputs leaking from m, because we do not
want those leaked inputs to introduce new structural correlation for
their descendent nodes.
2. The part “−γ ln(D(m))” biases our choice toward those nodes that
dominate more timed inputs, as suggested in [56].
3. The part “β ln(F (m))” on the exponent is used to penalize cutting at
those nodes who have a large number of fan-outs. We observed a fact
that is also observed in [60]: the more fan-outs a node has, the more
structural correlation it may introduce for its decedent nodes.
The effects of first two terms L(m) and D(m) on partitioning have already
been discussed in sub-section 5.8.1. The third term F (m) is also used to
reduce structural correlation. Take the sub-circuit in Figure 5.12 as an ex-
ample; suppose a high-fan-out node n1 is chosen by one of n8’s predecessor
as a cut-point. Suppose the light area and the dark area are the influential
cones before and after partitioning for n8, respectively. The newly formed
partition boundary consists of {n5, n2, n3, n4}. The unwanted effect of choos-
ing high-fan-out node, such as n1, as partition boundary is that the newly
formed partition boundary, i.e. {n2, n3, n4}, may be highly correlated due to
the fact that they are all decedents of the high-fan-out node n1. To remedy
this problem, in this proposed partition cost function we prefer to keep the
high-fan-out node within a partition rather than using them as the partition
boundary.
5.8.4 Overall partitioning algorithm
The overall partitioning in Algorithm 3 is performed as follows: first a PI’s
timed support set is initialized to itself. Then the nodes are processed from
PI to PO in a topological order. During processing, a node n will first inherit
all timed support sets from its immediate fan-in nodes. If the influential fan-
in cone of n is not K-feasible, partitioning points are chosen as described in
92
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 
We show this procedure using the example in Figure 8(a). The 
influential cone of n4 is bounded by {a,b,c,d}. To calculate D(n2) 
and L(n2) w.r.t. node n4, we backtrace from n4 until reach the 
boundary of the influential cone. We will have R = (a, b, c, b, c, d); 
Q = (b, c, d). Let D = (R-Q) = (a, b, c), then D(n2)= |D|=3 and L(n2) 
= |Q!D| = |(b,c)|=2.  
 
 
!"
!#
!$!%
"
% %
%
&'()*+,-.-./!.!0)+-)0%
!"
!#
!$!%
%
"
%
%
&+()*+,-.-./!.!0)+-)0"
+
'
1
+
'
1
2
2
 
Figure 8: Partitioning example of a small circuit 
C. Partitioning cost function  
Kapoor’s heuristic has a limitation that may affect the quality of 
the partitioning. In that, if a node is not K-feasible, Kapoor’s 
heuristic limits the searching of cut point within the node’s 
immediate fanins. For n4 in our example, only n2 and n3 will be 
investigated. And n2 is cut to form a new partition as marked with 
the check sign in Figure 8(a), resulting in a 5-feasible fanin cone of 
n with tSup(n4,t)={n2(t-1),b(t-3),c(t-3),d(t-2)}. Note that b,c are still 
correlated with n2 because they are transitive fanins of n2. This 
structural correlation may introduce inaccuracy in the tTDD 
probability calculation. 
Our partitioning algorithm extends the searching for potential cut 
points to all nodes in n’s current influential cone and assigns a cost 
to each potential cut point m according to: 
 Cost(m) = e
! ln(L(m)+0.001)+" ln(F (m))#$ ln(D(m))
 Where D(m) and L(m) are the dominator and leaking factors of 
node m, respectively. While a new term F(m) - the degree of m’s 
fanouts - is also used in the cost function. Its effect will be 
explained shortly. ", #, $ are positive coefficients used to weight 
each term. The intuition behind our cost function includes:  
(1) the “"ln(L(m)+0.001” part on the exponent of cost function 
favors those nodes that have less timed inputs leaking from m. 
Because we do not want those leaked inputs to introduce new 
structural correlation for their descendent nodes.  
(2) The part “-$ln(D(m)) biases our choice toward those nodes 
that dominate more timed inputs, as suggested in [6][7].    
(3) The part “#ln(F(m))” on the exponent is used to penalize 
cutting at those nodes who have a large number of fanouts. Because 
we observed a fact that is also observed in [5], that is the more 
fanouts a node have the more structural correlation it may introduce 
for its decedent nodes.   
The effects of first two terms L(m) and D(m) on partitioning have 
already been discussed in sub-section A.  The third term F(m) is 
also used to reduce structural correlation. Take the sub-circuit in 
Figure 9 as an example, suppose a high-fanout node n1 was chosen 
by one of n8’s predecessor as a cutpoint. Suppose the light area and 
the dark area are the influential cones before and after partitioning 
for n8, respectively. The newly formed partition boundary consists 
of {n5,n2,n3,n4}. The unwanted effect of choosing high-fanout 
node, such as n1, as partition boundary is that the newly formed 
partition boundary, i.e. {n2,n3,n4} may be highly correlated due to 
their all are decedents of the high-fanout node n1. To remedy this 
problem, in our cost function we prefer to keeping the high-fanout 
node within a partition rather than using them as the partition 
boundary by put a high penalty for using high-fanout node as 
partition boundary. 
 
!"
!#
!$
!%
!&
!'
!(
!)
 
 
Figure 9: Problem of high-fanout node on the partition boundary 
 
For an example of using {"=0.5,#=2,$=1} for Figure 8, cost(n1) 
:= 0.3, cost(n2) = cost(n3) : = 0.72. So n1 is chosen as the 
partitioning point, as shown with the check mark in Figure 8(b). The 
new 5-feasible timed support set of n4 is tSup(n4,t)={a(t-3),n1(t-
3),n1(t-2),d(t-2)}. This is actually the optimal partition to minimize 
structural correlation due to reconvergent net from n1. Note that the 
two temporally correlated terms, n1(t-3) and n1(t-2) introduced by 
cutting at n1, will be taken care of by theorems in section C. 
D. Overall partitioning algorithm 
The overall partitioning in Table 3 is performed as follows: first a 
PI’s timed support set is initialized to itself. Then the nodes are 
processed from PI to PO in a topological order. During processing, a 
node n will first inherit all timed support sets from its immediate 
fanin nodes. If the influential fanin cone of n is not K-feasible, 
partitioning points are chosen as described in section C until the 
fanin cone becomes K-feasible. Since the complexity of computing 
the cost function is bounded by a specified constant K, the overall 
partitioning algorithm runs very fast as shown in our experimental 
results.  
The overall partitioning procedure described in Table 3 can be 
illustrated with Figure 10 and Table 4, the first two column are 
execution order and node name. The third column is current gate’s 
predecessors from which the fanins are inherited. The forth column 
shows the origins of their timed inputs. The fifth column shows the 
Figure 5.12: Problem of high-fan-out node on the partition boundary
Section 5.8.3 until the fan-in cone becomes K-feasible. Since the complexity
of computing the cost function is bounded by a specified constant K, the
overall partitioning algorithm runs very fast as shown in experimental results.
Notice that this algorithm naturally allows overlapping of nodes’ influential
cone when necessary. For example, to partition the circui in Figur 5.13
the algorithm runs for each internal nod in a topological ord r as listed in
Table 5.4. The origins of timing paths before and after partition are denoted
as a set Obefore and Oafter, respectively. This allows the influential fan-in
cones for g8 and g5 to overlap. With this capability of overlapping, both
g5 and g8’s stabilization probabilities can be calculated optimally. Because
the reconvergent nets r oted at g3 c n be t tally contained i both g5 and
g8’s influential cones, the structural correlation is minimized. However, if
overlapping is disallowed then either g5 or g8 can be calculated optimally
but not both.
5.9 Experimental results
The proposed algorithm (tTDD+partition) is implemented in C on top of
SIS [46] to analyze the dynamic behavior of digital circuits. The benchmark
circuits are first compiled into Verilog netlists using Synopsys Design Com-
piler (DC) ver. 2007-SP3 with a subset of TSMC 65nm cells. These netlists
are then fed into our tool to derive behavior curves for each node. The POs’
93
Algorithm 3 Algorithm of circuit partitioning for tTDD
Input: K : size controlling const.
Netlist : The netlist of the circuit
Output: *partition boundary : partition boundary array for each node
1: topo order = topology sort(Netlist);
2: n = head(topo order);
3: while n 6=end(topo order) do
4: p=predecessor(n); {n’s immediate predecessors}
5: tSup=inherit support(p); {inherit K-feasible supports}
6: while |tSup| > k do
7: for all m ∈ infcone do
8: cost[n] = cost func(m);
9: end forcutpoint=argmin(cost[m]); {pick lowest cost cutpoint}
10: infcone=update cone(infcone,cutpoint);
11: tSup=update support(tSup,cutpoint);
12: end while
13: partition boundary[n]=tSup;
14: n=next(topo order);
15: end while
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 
(2) The part “-!ln(D(m)) biases our choice toward those 
nodes that dominate more timed inputs, as suggested in [6].    
(3) The part “"ln(F(m))” on the exponent is used to penalize 
cutting at those nodes who have a large number of fanouts. 
Because the more fanouts a node has the more structural 
correlation it may introduce for its decedent nodes.   
 
Figure 8: Partitioning example of a small circuit 
 
The effects of the first two terms L(m) and D(m) on 
partitioning have already been discussed in sub-section A.  
The third term F(m) is used to reduce structural correlation. In 
our cost function, we prefer to keeping the high-fanout nodes 
within a partition rather than using them as the partition 
boundary by putting a high penalty for the latter case.  
For an example of using {#=0.5,"=2,!=1} for Figure 8(b), 
cost(n1) = 0.3, cost(n2) = cost(n3) = 0.72. So n1 is chosen as 
the partitioning point, as shown with the check mark in Figure 
8(b). The new 5-feasible timed support set of n4 is 
tSup(n4,t)={a(t-3),n1(t-3),n1(t-2),d(t-2)}. This is actually the 
optimal partition to minimize structural correlation due to 
reconvergent net from n1. Note that the two temporally 
correlated terms, n1(t-3) and n1(t-2) introduced by cutting at 
n1, will be taken care of by theorems in section VI. 
Notice that this algorithm naturally allows overlapping of 
nodes’ influential cone when necessary. For example, to 
partition the circuit in Figure 9 the algorithm runs for each 
internal node in a topological order as listed in Table 3. This 
allows the influential fan-in cones for g8 and g5 overlap. With 
this capability of overlapping, both g5 and g8’s stabilization 
probabilities can be calculated optimally. Because the 
reconvergent nets rooted at g3 can be totally contained in both 
g5 and g8’s influential cones, the structural correlation is 
minimized. However, if overlapping is disallowed then either 
g5 or g8 can be calculated optimally but not both.   
VIII. EXPERIMENTAL RESULTS 
The proposed algorithm (tTDD+Partition) is implemented 
in C on top of SIS [20] to analyze the dynamic behavior of 
digital circuits. The benchmark circuits are first compiled into 
Verilog netlists using Synopsys Design Compiler (DC) 
ver.2007-SP3 using TSMC65nm library cells. These netlists 
are then fed into our tool to derive behavior curves for each 
node. The POs’ behavior curves are compared with the golden 
results generated from simulation point by point. The details 
of the timing model and golden results are described below.  
 
 
 
Figure 9: An illustration example of partitioning procedure 
 
TABLE 3: PARTITIONING PROCEDURE SEQUENCE (K=5) 
 
! "#$%! &'(%)*+! ,)*-*'.!#/!+*0*'-!
12+(.!
34+!
1'+!
,)*-*'.!#/!
+*0*'-!12+(.!
2/+%)!52)+'6!
7! -7! '78'9! :'78'9;! ! !
9! -9! '<8'=! :'<8'=;! ! !
<! -<! -78-9! :'78'98'<8'=;! ! !
=! -=! '>8'?! :'>8'?;! ! !
@! -@! -<8-=! :'78'98'<8'=8'>8'?;! -<! :-<8'>6'?;!
>! ->! -<! :'78'98'<8'=;! ! !
?! -?! '@8->! :'@8'78'98'<8'=;! ! !
A! -A! -@8-?! :-<8'>8'?8'@8'78!
'98'<8'=;!
-<! :'@8-<8'>6'?;!
 
Timing model: To enable comparison with DynaTune 
which only supports a simple timing model, we first did 
experiments for both DynaTune and tTDD+Partition for the 
same simple timing model where cells of the same type have 
the same delay. The golden results for this test group are also 
from simulation with the simple timing model. 
Moreover, we also did experiments for tTDD+Partition with 
a detailed timing model where the timing information is 
directly extracted from design compiler’s timing engine. We 
use the DC’s “write_sdf” command to extract the timing in 
Standard Delay Format (SDF), which automatically 
considers: (1) each pin-to-pin delay, (2) the rise and fall delay, 
(3) the cell’s driving strength (4) input transition time and (5) 
the cell’s fanout load. This SDF file is then converted into a 
delay table to be used in our tool.  
Timed simulation: We use Cadence NCsim 5.7 as our 
digital circuit simulator. First, the SDF file extracted from DC 
is back-annotated to the netlist. Then, input vectors are 
generated according to specified probabilities. The timed 
simulation is carried out in floating mode [13], under which 
the internal gates and nets take value “X” before a single input 
vector is applied to settle down their values.  
According to NCSim’s user manual, its logic simulator 
algorithm is event driven in that each logic value change on a 
Figure 5.13: An example of partitioning procedure
94
Table 5.4: Partition procedure sequence (K=5)
Node Inherit Obefore Cut point Oafter
1 g1 n1, n2 {n1, n2}
2 g2 n3, n4 {n3, n4}
3 g3 g1, g2 {n1, n2, n3, n4}
4 g4 n6, n7 {n6, n7}
5 g5 g3, g4 {n1, n2, n3, n4,
n6, n7}
g3 {g3, n6, n7}
6 g6 g3 {n1, n2, n3, n4}
7 g7 n5, g6 {n5, n1, n2, n3,
n4}
8 g7 g5, g7 {g3, n6, n7, n5,
n1, n2, n3, n4}
g3 {n5, g3, n6, n7}
behavior curves are compared with the golden results generated from simula-
tion point by point. The details of the timing model and collection of golden
results are described as follows:
Timing model: To enable comparison with DynaTune which only sup-
ports a simple timing model, we first did experiments for both DynaTune
and tTDD+partition for the same simple timing model where cells of the
same type have the same delay. The golden results for this test group are
also from simulation with the simple timing model.
Moreover, we also did experiments for tTDD+partition with a detailed
timing model where the timing information is directly extracted from design
compiler’s timing engine. We use the DC’s “write sdf” command to extract
the timing in standard delay format (SDF). This SDF file is then converted
into a delay table to be used in our tool.
Timed simulation: Cadence NCSim 5.7 is used as simulator. First, the
SDF file extracted from DC is back-annotated to the netlist. Input vectors
are generated according to specified probabilities. The timed simulation is
carried out in floating mode as suggested in [45], under which the internal
gates and nets take value “X” before a single input vector is applied to settle
down their values.
According to NCSim’s user manual, its logic simulator algorithm is event
driven in that each logic value change on a net is scheduled as an event for
a future time point. By default, the simulator favors for simulation speed
so that some logic value changing event may not be scheduled precisely to
95
save simulation time. To enable logic simulation with precise timing, we
need to use “pathdelay enhanced” declaration in our library cell’s simulation
model [62]. This option makes it run slower but produces the most accurate
dynamic timing results.
Golden result: To determine POs’ dynamic activities during the sim-
ulation, a Verilog programming interfacing (VPI) program is developed to
record the timestamp of each PO’s last switching activity in every cycle. Af-
ter simulation of one million cycles, these timestamps are collected to form
the PO’s dynamic behavior curve.
Figure 5.14 shows an example of the output of our proposed algorithm
for a 32-bit carry-look-ahead (CLA) adder’s carry-out bit. The intermedi-
ate stabilized-to-0 and stabilized-to-1 dynamic curves for this carryout bit
computed by our proposed approach are drawn (dashed curves). The overall
dynamic behavior curve for this carryout bit is also shown as a solid curve.
The golden simulation results are shown as dots. This result shows that the
calculated dynamic behavior curve closely matches the simulation results.
Also it is interesting to notice that although this carryout bit’s true criti-
cal path has a length of 700ps, its stabilized-to-1 and stabilized-to-0 curves
converge very early at 450ps. In other words, this carryout bit can be stabi-
lized far before 700ps with a probability close to 100%. This result confirms
similar observations made in [1].
5.9.1 Experiments on ISCAS’85 circuits
ISCAS’85 benchmarks are widely used to study the performance of BDD-
based applications because these benchmark circuits tend to have a large
number of re-convergent nets. In this section, I show experimental results
collected for this set of benchmark circuit.
(1) Accuracy comparison:
Dynamic behavior curves derived from tTDD algorithm (K=6) and the
original DynaTune algorithm [18] are both compared with the golden behav-
ior curves generated from simulation. To quantify the accuracy, we use mean
absolute error (MAE) and root mean square error (RMSE) measures. Only
the data points between the earliest AT and latest AT on these dynamic be-
havior curves are compared, because the data points beyond this range are
96
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 
program using the callback mechanism to record the timestamp of 
each PO’s last switching activity in every cycle. After simulation of 
one million cycles, these timestamps are collected to form the PO’s 
dynamic behavior curve.  
 
Figure 11: Dynamic behavior of CLA 32b-adder’s carryout bit 
Figure 11 shows an example of the output of our proposed 
algorithm for a 32-bit CLA (Carry-Look-Ahead) adder’s carry-out 
bit. The intermediate stabilized-to-0 and stabilized-to-1 dynamic 
curves for this carryout bit computed by our proposed approach are 
drawn (dashed curves). The overall dynamic behavior curve for this 
carryout bit is also shown as a solid curve. The golden simulation 
results are shown as dots. It is obvious that the calculated dynamic 
behavior curve closely matches the simulation results.  
Also it is interesting to notice that although this carryout bit’s 
true critical path has a length of 700ps, its stabilized-to-1 and 
stabilized-to-0 curves converges very early at 450ps. In another 
word, this carryout bit can be stabilized far before 700ps with a 
probability close to 100%. This result confirms similar observations 
made in [2].   
A. Experiments on ISCAS’85 circuits 
ISCAS’85 benchmarks are widely used to study the performance 
of BDD-based applications because these benchmark circuits tend 
to have a large number of reconvergent nets. In this section, we 
show with experimental datum collected for this set of benchmarks 
that:  (1) our proposed dynamic behavior curve computation with 
tTDD and partitioning has very good accuracy comparing with 
simulation results, (2) it runs fast, (3) and the accuracy preserves 
across different input static probability settings.     
1) Accuracy comparison 
Dynamic behavior curves derived from our algorithm (K=6) and 
the original DynaTune algorithm [17] are both compared with the 
golden behavior curves generated from simulation. To quantify the 
accuracy, we use Mean Absolute Error (MAE) and Root Mean 
Square Error (RMSE) measures. Only the data points between the 
earliest AT and latest AT on these dynamic behavior curves are 
compared, because the data points beyond this range are trivial. 
Due to the limitation of DynaTune, the first two sets of datum - 
DynaTune (simple) and tTDD+Partition (simple) - are compared 
with simulations using an simple timing model. The last set of 
datum - tTDD+Partition (detailed) - are compared to simulations 
with a realistic (more difficult) detailed timing model directly 
extracted from the SDF files. From Table 5, we notice that with the 
simple timing model, the accuracy of both DynaTune and our 
algorithm are very high, and tTDD+Partition is just slightly worse 
than DynaTune. Our algorithm with the detailed timing model still 
maintains very good accuracy with a MAE of 2.2% and a RMSE of 
4.8%. Also note that three circuits cannot be finished by DynaTune 
due to BDD explosion.  
Table 5: Dynamic behavior curve accuracy comparison 
! "#$%&'$(! )&""*+%,)-)-.$!
! /-012(! /-012(! "()%-2(3!
! 456! 74/6! 456! 74/6! 456! 74/6!
89:! ;<! ;<! ;<! ;<! ;=;<! ;=;<!
8>?@! 9=:<! ?=A<! 9=:<! ?=:<! @=B<! C=B<!
8>DD! 9=><! ?=?<! 9=><! ?=C<! 9=?<! ?=D<!
8BB;! ;=A<! @=?<! ;=B<! @=><! 9=;<! @=;<!
89?CC! @=9<! >=D<! @=?<! A=;<! ;=B<! @=9<!
89D;B! 9=A<! ?=9<! 9=:<! ?=?<! 9=B<! ?=C<!
8@A:;! 9=C<! >=:<! 9=9<! >=;<! 9=;<! ?=?<!
8?C>;! E! E! A=9<! 99=9<! A=?<! 99=@<!
8C?9C! @=B<! >=D<! @=A<! >=:<! ?=><! C=D<!
8A@BB! E! E! 9=D<! C=@<! ?=D<! B=D<!
8:CC@! E! E! @=@<! C=?<! @=?<! C=:<!
!"#$% &$'(% )$*(% +$,(% *$'(% +$+(% *$-(%
 
2) Runtime comparison 
The experiments are done on a 2.8GHz Intel Xeon processor with 
4GB of memory. The runtimes for different configurations are 
shown in Table 6. With the detailed timing model, NCSim takes 
thousands of seconds to finish one million cycles of simulation. 
DynaTune may fail due to BDD size explosion. tTDD+Partition can 
finish computation in less than 2 minutes for every circuit with the 
partitioning algorithm taking insignificant portion of the total 
runtime. Compared to DynaTune with a simple timing model, 
tTDD+Partition with detailed timing model still achieves an average 
speedup of 65x. 
Table 6: Runtime comparison for the benchmark circuits  
! F8/-0!
!"()%-2!
"#$%&
'$(!
G/-012
(H!
)&""E+!G"()%-2(3H!
/1((
3'1!+%,)-)-.$! )&""! ).)%2!
89:! >D=:! ;=9! ;=;! ;=;! ;=;! 9;I!
8>?@! A9>=;! ?>=?! ;=9! @=C! @=A! 9?I!
8>DD! 9@>;=;! ?@A=D! ;=9! >=>! >=>! :>I!
8BB;! 99?:=>! >9=D! ;=9! ?=>! ?=>! 9@I!
89?CC! 9;D?=;! A>?=9! ;=;! ?=@! ?=?! 9DBI!
89D;B! 999;=;! D::=A! ;=9! C=?! C=>! 9B@I!
8@A:;! @:?C=;! 9B>=9! ;=@! A=:! A=B! @:I!
8?C>;! @DB;=;! E! 9=9! :C=;! :A=9! FJ5!
8C?9C! >A@@=;! B>=@! ;=C! 9A=;! 9A=C! CI!
8A@BB! :9?B=@! E! 9=;! 9;9=?! 9;@=?! FJ5!
8:CC@! CAC9=A! E! ;=A! @;=9! @;=:! FJ5!
!"#$% ! ! ! ! ! .'/%
3) Insensitive to input static probability 
  To understand how sensitive the proposed tTDD+Partition 
approach is w.r.t the PI’s initial static probabilities, we swipe the 
PI’s probability of being logic 1 from 10% to 90%. Table 7 shows 
the results for these experiments. We can see that across all initial 
static probability settings, the proposed approach maintains very 
good accuracy. Therefore, we can conclude that tTDD+Partition is 
not sensitive to PI’s initial static probabilities. 
Figure 5.14: Dynamic behavior of a 32-bit CLA adder’s carryout bit
trivial.
Due to the limitation of DynaTune, the first two sets of datum - DynaTune
(simple) and tTDD+partition (si ple) - are compared with simulations using
a simple timing model. The last set of datum - tTDD+partition (detailed)
- are compared to simulations with a more realistic timing model directly
extracted from the SDF files, denoted as “detailed”. From Table 5.5, we can
notice that with the simple timi g model, the accuracy of both DynaTune and
tTDD are very high. tTDD with the detailed timing model still maintains
very good accuracy with a MAE of 2.2% and a RMSE of 4.8%. Also note
that three circuits cannot be finished by DynaTune due its scalability issue.
(2) Runtime comparison
The experiments are done on a 2.8GHz Intel Xeon processor with 4GB of
memory. The runtimes (in seconds) for different configurations are shown
in Table 5.6. With the detailed timing model, NCSim takes thousands of
seconds to finish one million cycl s of simulation. DynaTune may fail due to
its scalability issue. tTDD+partition can finish computation in less than 2
minutes for every circuit with the partitioning algorithm taking insignificant
portion of the total runtime. Compared to DynaTune with a simple timing
97
Table 5.5: Dynamic behavior curve accuracy comparison
Config DynaTune tTDD+partition
Timing mode Simple Simple Detail
MAE RMSE MAE RMSE MAE RMSE
C17 0% 0% 0% 0% 0% 0%
C432 1.7% 3.6% 1.7% 3.7% 2.8% 5.8%
C499 1.4% 3.3% 1.4% 3.5% 1.3% 3.9%
C880 0.6% 2.3% 0.8% 2.4% 1.0% 2.0%
C1355 2.1% 4.9% 2.3% 6.0% 0.8% 2.1%
C1908 1.6% 3.1% 1.7% 3.3% 1.8% 3.5%
C2670 1.5% 4.7% 1.1% 4.0% 1.0% 3.3%
C3540 - - 6.1% 11.1% 6.3% 11.2%
C5315 2.8% 4.9% 2.6% 4.7% 3.4% 5.9%
C6288 - - 1.9% 5.2% 3.9% 8.9%
C7552 - - 2.2% 5.3% 2.3% 5.7%
Ave. 1.5% 3.4% 2.0% 4.5% 2.2% 4.8%
model, tTDD+partition with detailed timing model still achieves an average
speedup of 65x.
(3) Comparison with MFFC-based partitioning
The proposed partitioning algorithm significantly improves the dynamic
behavior analysis accuracy. To demonstrate this, the results are compared
with a different partitioning algorithm based on max-fanout-free cone (MFFC).
A fan-out free cone (FFC) for a node n is a set of nodes in its transitive fan-in
cone, where these nodes’ fan-outs are completely contained in the FFC. A
MFFC for a node n is such a FFC having the maximum number of nodes.
MFFC-based partitioning is interesting because it tends to form partitions to
have recovergent nets contained within the partition. As a result, the signal
correlation due to such recovergent nets is mitigated.
We implemented a MFFC-based partitioning algorithm to compare with
the proposed tTDD partitioning. MFFCs are generated based on a label-
ing algorithm [63] and MFFCs are decomposed when they are excessively
large. Replacing the proposed tTDD partitioning algorithm with MFFC-
based algorithm, a significant drop of dynamic behavior analysis accuracy
is observed. As shown in Table 5.7, the average error measured with MAE
increases by 190%. For some circuits, e.g. C1355, the error increases by up
to 12X because forming MFFCs may prevent reusing multi-fan-out nets in
behavior analysis across partitions as our tTDD partition does. In Table 5.7,
98
Table 5.6: Runtime comparison for the benchmark circuits
NCSim DynaTune tTDD+partition (Detail) Speedup
Detail Simple Partition tTDD Total
C17 49.7 0.1 0.0 0.0 0.0 10X
C432 614.0 34.3 0.1 2.5 2.6 13X
C499 1240.0 326.9 0.1 4.4 4.4 74X
C880 1137.4 41.9 0.1 3.4 3.4 12X
C1355 1093.0 643.1 0.0 3.2 3.3 198X
C1908 1110.0 977.6 0.1 5.3 5.4 182X
C2670 2735.0 184.1 0.2 6.7 6.8 27X
C3540 2980.0 - 1.1 75.0 76.1 N/A
C5315 4622.0 84.2 0.5 16.0 16.5 5X
C6288 7138.2 - 1.0 101.3 102.3 N/A
C7552 5651.6 - 0.6 20.1 20.7 N/A
Ave. 65X
We use ‘k’ to denote the number of timing paths contained in a partition.
The average ‘k’ and max ‘k’ values are shown for both partitioning algo-
rithms. From the results, we notice that tTDD partitioning can maintain
accuracy with a much smaller max ‘k’ value, which helps to archive large
speed-up.
Table 5.7: Accuracy compared with MFFC-based partitioning
k(tTDD) k(MFFC) MAE(%)tTDD vs. MFFC
Ave Max Ave Max tTDD MFFC Diff(%)
C17 3.7 6 3.1 6 0.0% 2.7% N/A
C432 4.5 6 3.4 10 2.8% 3.9% 39.2
C499 4.3 6 4.1 12 1.3% 12.3% 846.2
C880 4.0 6 4.4 15 1.0% 6.1% 510
C1355 3.8 6 4.6 10 0.8% 11.0% 1275.0
C1908 4.4 6 4.2 12 1.8% 3.3% 83.3
C2670 4.1 6 5.0 16 1.0% 5.2% 420.0
C3540 5.7 8 3.8 15 6.3% 7.8% 23.8
C5315 4.2 6 4.6 19 3.4% 5.5% 61.8
C6288 5.1 7 2.4 9 3.9% 8.0% 105.1
C7552 4.4 6 4.9 19 2.3% 5.6% 143.5
Ave. 2.2% 6.5% 190.2
(4) K’s effect on accuracy and runtime
Partition parameter K controls the overall quality of the experimental re-
99
sults. A small K value produces small partitions but limits the amount of
correlation that can be considered in a partition. A big K value, however,
may deteriorate the runtime significantly. Figure 5.15 shows K’s effect on
speed and accuracy. The runtime is normalized to the time needed for prob-
ability calculation on tTDDs for the case of K=6. The accuracy is also
normalized to the MSE with K=6. We can observe that the time for par-
titioning is insignificant and linear to K, while the tTDD calculation time
shows an exponential trend. The accuracy improves slightly for larger K.
Therefore, K=6 is chosen for the majority of experiments to balanced accu-
racy and speed.
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 
MFFCs may prevent reusing multi-fan-out nets in behavior 
analysis across partitions as our tTDD partition does 
(illustrated in Figure 9). In Table 6, we use “kave” and “kmax” to 
denote the average number of timing paths and the max 
number of timing paths that contained in a partition, 
respectively, for both partitioning algorithms. From the 
results, we note that tTDD partitioning can maintain accuracy 
with a much smaller “kmax” value, which helps to archive large 
speed-up.     
 
TABLE 4: DYNAMIC BEHAVIOR CURVE ACCURACY COMPARISON 
 
! "#$%&'$(! )&""*+%,)-)-.$!
! /-012(! /-012(! "()%-2(3!
! 456! 74/6! 456! 74/6! 456! 74/6!
89:! ;<! ;<! ;<! ;<! ;=;<! ;=;<!
8>?@! 9=:<! ?=A<! 9=:<! ?=:<! @=B<! C=B<!
8>DD! 9=><! ?=?<! 9=><! ?=C<! 9=?<! ?=D<!
8BB;! ;=A<! @=?<! ;=B<! @=><! 9=;<! @=;<!
89?CC! @=9<! >=D<! @=?<! A=;<! ;=B<! @=9<!
89D;B! 9=A<! ?=9<! 9=:<! ?=?<! 9=B<! ?=C<!
8@A:;! 9=C<! >=:<! 9=9<! >=;<! 9=;<! ?=?<!
8?C>;! E! E! A=9<! 99=9<! A=?<! 99=@<!
8C?9C! @=B<! >=D<! @=A<! >=:<! ?=><! C=D<!
8A@BB! E! E! 9=D<! C=@<! ?=D<! B=D<!
8:CC@! E! E! @=@<! C=?<! @=?<! C=:<!
!"#$% &$'(% )$*(% +$,(% *$'(% +$+(% *$-(%
 
TABLE 5: RUNTIME (SECONDS) COMPARISON FOR THE BENCHMARK CIRCUITS  
 
! F8/-0!!"()%-2!
"#$%!
&'$(!
G/-012(H!
)&""E+!G"()%-2(3H!
/1((3!
'1!+%,)-!)-.$GIH!
)&""!
GIH! ).)%2!
89:! >D=:! ;=9! ;=;! ;=;! ;=;! 9;J!
8>?@! A9>=;! ?>=?! ;=9! @=C! @=A! 9?J!
8>DD! 9@>;=;! ?@A=D! ;=9! >=>! >=>! :>J!
8BB;! 99?:=>! >9=D! ;=9! ?=>! ?=>! 9@J!
89?CC! 9;D?=;! A>?=9! ;=;! ?=@! ?=?! 9DBJ!
89D;B! 999;=;! D::=A! ;=9! C=?! C=>! 9B@J!
8@A:;! @:?C=;! 9B>=9! ;=@! A=:! A=B! @:J!
8?C>;! @DB;=;! E! 9=9! :C=;! :A=9! FK5!
8C?9C! >A@@=;! B>=@! ;=C! 9A=;! 9A=C! CJ!
8A@BB! :9?B=@! E! 9=;! 9;9=?! 9;@=?! FK5!
8:CC@! CAC9=A! E! ;=A! @;=9! @;=:! FK5!
!"#$% ! ! ! ! ! .'/%
TABLE 6: ACCURACY COMPARED WITH MFFC-BASED PARTITION 
 
! )&""! 4LL8! 456G<H!)&""!MI=!4LL8!
! N%M(! N0%J! N%M(! N0%J! )&""!! 4LL8! "-OOG<H!
89:! ?=:! A! ?=9! A! ;=;<! @=:<! FK5!
8>?@! >=C! A! ?=>! 9;! @=B<! ?=D<! ?D=?!
8>DD! >=?! A! >=9! 9@! 9=?<! 9@=?<! B>A=@!
8BB;! >=;! A! >=>! 9C! 9=;<! A=9<! C9;=;!
89?CC! ?=B! A! >=A! 9;! ;=B<! 99=;<! 9@:C=;!
89D;B! >=>! A! >=@! 9@! 9=B<! ?=?<! B?=?!
8@A:;! >=9! A! C=;! 9A! 9=;<! C=@<! >@;=;!
8?C>;! C=:! B! ?=B! 9C! A=?<! :=B<! @?=B!
8C?9C! >=@! A! >=A! 9D! ?=><! C=C<! A9=B!
8A@BB! C=9! :! @=>! D! ?=D<! B=;<! 9;C=9!
8:CC@! >=>! A! >=D! 9D! @=?<! C=A<! 9>?=C!
!"#$% ! ! ! ! @=@<! A=C<! &0,$+%
4) K’s effect on accuracy and runtime 
Partition parameter K controls the overall quality of the 
experimental results. A small K value produces small 
partitions but limits the amount of correlation that can be 
considered in a partition. A big K value however may 
deteriorate the runtime significantly. Figure 11 shows K’s 
effect on speed and accuracy. The runtime is normalized to the 
time needed for probability calculation on tTDDs for the case 
of K=6. The accuracy is also normalized to the MSE with 
K=6. We can observe that the time for partitioning is 
insignificant and linear to K, while the tTDD calculation time 
shows an exponential trend. The accuracy improves slightly 
for larger K. Therefore, we choose K=6 for the majority of our 
experiments for balanced trade-off. 
 
Figure 11: tTDD partition parameter K’s effect on accuracy and runtime 
5) Insensitive to input static probability 
  To understand how sensitive the proposed tTDD+Partition 
approach is w.r.t the PI’s initial static probabilities, we swipe 
the PI’s static probability from 10% to 90%. Table 7 shows 
the results for these experiments. We can see that across all 
initial static probability settings, the proposed approach 
maintains very good accuracy. Therefore, we can conclude 
that tTDD+Partition is not sensitive to PI’s initial static 
probabilities. 
TABLE 7: ERROR UNDER VARIOUS INPUT STATIC PROBABILITY 
 
! 4(%$!5PI.2')(!6,,.,!'$3(,!3-OO(,($)!-$1')!I)%)-Q!1,.P%P-2-)#!Q.$O-R',%)-.$I!
! +,S9;<! +,S?;<! +,SC;<! +,S:;<! +,SD;<!
89:! ;=A<! ;=D<! ;=;<! ;=><! ;=C<!
8>?@! @=B<! ?=@<! @=B<! 9=B<! >=;<!
8>DD! 9=;<! ;=D<! 9=?<! ;=B<! ;=D<!
8BB;! ;=><! ;=A<! 9=;<! 9=@<! ;=D<!
89?CC! 9=9<! ;=:<! ;=B<! ;=><! ;=@<!
89D;B! 9=><! @=9<! 9=B<! 9=@<! ;=C<!
8@A:;! 9=9<! 9=9<! 9=;<! 9=9<! 9=;<!
8?C>;! @=B<! C=:<! A=?<! :=;<! A=A<!
8C?9C! 9=B<! ?=?<! ?=><! @=C<! 9=9<!
8A@BB! ?=A<! >=B<! ?=D<! >=;<! C=;<!
8:CC@! @=?<! @=?<! @=?<! @=?<! @=B<!
!"#$! &$1(% +$)(% +$+(% +$,(% +$&(%
B. Experiments on MCNC circuits 
In this section, we apply the proposed dynamic behavior 
analysis algorithm on the largest MCNC benchmark circuits to 
understand how adequate our algorithm can be applied to 
different types of circuits. For MCNC benchmarks, we listed 
the results of runtime, speedup and accuracy for detailed 
timing model in Table 8. For some MCNC benchmarks, 
Figure 5.15: tTDD partition parameter K’s effect on accuracy and runtime
(5) Insensitive to input static probabili y
To understand how sensitive the proposed tTDD+partition approach is
w.r.t. the PI’ in tial static probabilities, we swipe the PI’s static pr babil-
ity from 10% to 90%. Table 5.8 show the results for t ese experiments.
We can see hat across ll initial static probability settings, the proposed
approach maintains very good accuracy. Th refore, can conclude that
tTDD+partition is no se sitive t PI’s in al static probabilities.
100
Table 5.8: Error under various input static probabilities
MAE under different input static probabilities
Pr=10% Pr=30% Pr=50% Pr=70% Pr=90%
C17 0.6% 0.9% 0.0% 0.4% 0.5%
C432 2.8% 3.2% 2.8% 1.8% 4.0%
C499 1.0% 0.9% 1.3% 0.8% 0.9%
C880 0.4% 0.6% 1.0% 1.2% 0.9%
C1355 1.1% 0.7% 0.8% 0.4% 0.2%
C1908 1.4% 2.1% 1.8% 1.2% 0.5%
C2670 1.1% 1.1% 1.0% 1.1% 1.0%
C3540 2.8% 5.7% 6.3% 7.0% 6.6%
C5315 1.8% 3.3% 3.4% 2.5% 1.1%
C6288 3.6% 4.8% 3.9% 4.0% 5.0%
C7552 2.3% 2.3% 2.3% 2.3% 2.8%
Ave. 1.7% 2.3% 2.2% 2.0% 2.1%
5.9.2 Experiments on MCNC circuits
The proposed dynamic behavior analysis algorithm is also applied on the
largest MCNC benchmark circuits to understand how adequate tTDD algo-
rithm can be applied to different types of circuits. For MCNC benchmarks,
the results of runtime, speedup and accuracy for detailed timing model are
listed in Table 5.9. For some MCNC benchmarks, constructing BDDs is rel-
atively easier than ISCAS benchmarks. So the speed difference between Dy-
naTune and tTDD+partition is not that drastic. Hence, the average speedup
is smaller than that for ISCAS benchmarks. However, for some large circuits,
DynaTune failed because it was not able to analyze stabilization conditions
with an efficient and small BDD. On the other hand, our proposed approach
can compute the dynamic behavior for all of them and still maintain a very
good accuracy. For this set of benchmarks, the error is smaller than ISCAS
benchmarks with an average MAE of 1.7% and an average RMSE of 3.9%.
The average speed up is 15x.
101
Table 5.9: Accuracy and runtime comparison for MCNC benchmarks
NCSim(s) DynaTune(s) tTDD(s) Speedup MAE RMSE
too large 1054.8 42.0 1.4 30x 1.4% 3.2%
x4 1192.5 7.4 0.8 9x 0.2% 0.8%
k2 2268.7 68.3 4.6 15x 2.4% 6.3%
misex3 3785.3 87.4 7.8 11x 2.8% 7.1%
i9 1474.4 23.9 1.9 12x 1.5% 3.1%
i8 2389.4 42.6 3.2 13x 2.0% 3.6%
pair 4738.0 257.4 7.8 33x 2.8% 4.7%
seq 6328.5 96.0 12.2 8x 1.2% 3.1%
apex2 5388.0 - 11.4 N/A 2.7% 5.0%
des 11078.4 - 17.5 N/A 1.9% 3.4%
ex1010 4485.0 81.8 12.6 6x 0.4% 1.5%
i10 7051.9 - 19.2 N/A 1.4% 4.5%
spla 17612.2 - 48.5 N/A 0.5% 2.9%
pdc 18994.0 - 54.2 N/A 0.9% 3.5%
Ave. 15x 1.7% 3.9%
102
CHAPTER 6
COMMON CASE PROMOTION
With the shrinking of technology, there is increasing interest in optimizing
circuits for timing error resilience using BTWC methodology. To accommo-
date this need, I propose common case promotion (CCP) to improve circuit
timing error resilience. CCP is a dynamic behavior driven logic synthesis
flow that manipulates the previously proposed dynamic behavior to improve
the circuit stabilization probability for common cases.
6.1 Improve timing error resilience by manipulating
dynamic behavior curve
With the shrinking technology, the speed of integrated circuits becomes sen-
sitive to dynamic environmental factors such as supply voltage fluctuation,
changing operating temperatures, etc. The delay of a gate increases during
voltage droop or increase of temperature. A traditional design methodology
takes into account such factors and exploits a conservative guardband by
operating the circuit in a frequency that ensures 100% correctness. On one
hand, reserving such a guardband becomes increasingly costly due to static
and dynamic variations. On the other hand, traditional logic optimization
tools tend to create a critical path wall that makes the circuit very vulnerable
to timing errors when the design-time guardband is consumed by dynamic
variations or aging effect.
Alternatively, better-than-worst-case (BTWC) design [1] was proposed to
alleviate this problem by removing the guardband and complementing a cir-
cuit with error detection and correction mechanisms. A BTWC-designed
circuit can operate in a traditional way if there is no timing error. Over
time, dynamic variations or aging effect can introduce timing errors, which
can be detected and mitigated by error detection techniques such as razor
103
logic [3] or EDS [32] circuitry, invoking error correction mechanisms to rectify
the error. Instead of using frequency, throughput (TU) is used to evaluate
the BTWC circuit performance. Instead of using cycle time t in Equation 2.8,
throughput can also be evaluated as a function of operating frequency f as
shown in Equation 6.1, where Ps is the probability of getting all primary
outputs (PO) stabilized by the cycle time 1/f , and r is the error correction
penalty.
TU = Ps × f + (1− Ps)× f
r
(6.1)
If using Razor-logic based error detection in a MIPS-like processor design,
the error correction penalty may involve flushing the pipeline and replaying
the instruction [3].
During normal operation, a circuit may slow down by dynamic variations
such as voltage droop, temperature fluctuation, or aging effect. As a result Ps
may drop, thus deteriorating the throughput. To sustain the performance,
we need to maintain a high probability Ps even if the circuit slows down.
This requires us to optimize a circuit so that Ps decreases gracefully as the
delay of a circuit deteriorates. Furthermore, this can be translated to the
requirement of optimizing a circuit at the design time to have a flat plateau
of Ps when the circuit operates close to the cycle time, denoted as tclk.
The relation between Ps and tclk is called a circuit’s dynamic behavior. To
quantify it, we can plot Ps as a function of tclk as a (dynamic) behavior curve
as introduced in Chapter 4: the x axis shows the target operating points tclk,
while the y axis shows the stabilization probabilities. A point (t, Ps(t)) on
the behavior curve can tell us the probability Ps(t) of a circuit or a single
PO being stabilized if operating at tclk = t. Intuitively, the smaller tclk is,
the lower Ps becomes. The benefit of utilizing a behavior curve is that it
enables quantifying the Ps plateau that is required for BTWC. As illustrated
in Figure 6.1, behavior curves of two different implementations (A/B) of the
same logic function of a PO ‘o0’ of a MCNC benchmark circuit apex2 are
shown: in an ideal environment, both A and B can operate at 450ps without
timing error (Ps = 100%). But A has a higher Ps at tclk=300ps while B
has a lower Ps at tclk=300ps. If dynamic variations slow down the circuit by
150ps, their behavior curves will shift toward the right. B will experience
a drastic drop of Ps, resulting in a significant throughput drop. On the
other hand, A still keeps a high Ps because its particular implementation
104
Circuit Common Case Promotion for Improved Timing 
Error Resilience 
 
 
 
 
 
 
ABSTRACT 
Traditional logic optimization tools tend to optimize a circuit in a 
balanced way so that all primary outputs (POs) have similar delay 
close to the cycle time, thus creating a critical path wall that 
makes the circuit very vulnerable to timing errors when the 
design-time guard-band is consumed by dynamic variations. 
Better-Than-Worst-case (BTW) design has been proposed as an 
alternative way to operate a circuit more efficiently by 
deliberately allowing timing error for rare cases and rectify them 
with error correction mechanisms. This new design methodology 
necessitates the consideration of signal probabilities as a new 
driving force during logic optimization. This paper proposes a 
novel technique, called Common Case Promotion (CCP), to 
optimize the circuit dynamic behavior as a new optimization 
dimension. Aiming at improving the circuit stabilization 
probability for common cases, this proposed CCP consists of 1) 
probability-driven re-synthesis that changes a digital circuit’s 
internal structure, 2) a dynamic behavior aware SAT-based 
redundancy remover that reduces area overhead, and 3) a Timed 
Characteristic Function (TCF) based circuit dynamic behavior 
analyzer that provides optimization convergence. The 
experimental results show that, on average, CCP can effectively 
improve circuits’ timing error resilience by 24% comparing with 
circuits without CCP optimization when they experience the same 
amount of delay deterioration with a very small area overhead.   
1. INTRODUCTION 
With the shrinking technology, the speed of integrated circuits 
becomes sensitive to dynamic environmental factors such as 
supply voltage fluctuation, changing operating temperatures, etc. 
The delay of a gate increases during voltage droop or increasing 
of temperature. A traditional design methodology takes into 
account of such factors and exploits a conservative guard band by 
operating the circuit in a frequency that ensures 100% correctness. 
On one hand, reserving such a guard band becomes increasingly 
costly due to static and dynamic variations. On the other hand, 
traditional logic optimization tools tend to create a critical path 
wall that makes the circuit very vulnerable to timing errors when 
the design-time guard-band is consumed by dynamic variations or 
aging effect. 
Alternatively, Better-Than-Worst-Case (BTW) design [1] was 
proposed to alleviate this problem by removing the guard band 
and complementing a circuit with error detection and correction 
mechanisms. A BTW-designed circuit can operate in a traditional 
way if there is no timing error. Over time, dynamic variations or 
aging effect can introduce timing errors, which can be detected 
and mitigated by error detection techniques such as Razor logic 
[2] or EDS [3] circuitry, invoking error correction mechanisms to 
rectify the error. Instead of using frequency, throughput (TR) is 
used to evaluate the BTW circuit performance as a function of 
operating frequency f, the probability (Ps) of getting all Primary 
Outputs (PO) stabilized by the cycle time 1/f, and the error 
correction penalty r as described below, 
Equation 1:                   TR = Ps ! f + (1" Ps)! f
r
 
If using Razor-logic based error detection in a MIPS-like 
processor design, the error correction penalty may involve 
flushing the pipeline and replaying the instruction [2]. 
During normal operation, a circuit may slow down by dynamic 
variations such as voltage droop, temperature fluctuation, or aging 
effect. As a result Ps may drop, thus deteriorating the throughput. 
To sustain the performance, we need to maintain a high 
probability Ps even if the circuit slows down. This requires us to 
optimize a circuit so that Ps decreases gracefully as the delay of a 
circuit deteriorates. Furthermore, this can be translated to the 
requirement of optimizing a circuit at the design time to have a 
flat plateau of Ps when the circuit operates close to the cycle time, 
denoted as tclk.  
The relation between Ps and tclk is called a circuit’s dynamic 
behavior. To quantify it, we can plot Ps as a function of tclk as a 
(dynamic) behavior curve as used in [5]: the x axis shows the 
target operating points tclk, while the y axis shows the stabilization 
probabilities. A point (t, Ps(t)) on the behavior curve can tell us 
the probability Ps(t) of a circuit or a single PO being stabilized if 
operating at tclk=t. Intuitively, the smaller tclk is the lower Ps 
becomes. The benefit of utilizing a behavior curve is that it 
enables quantifying the Ps plateau that is required for BTW. As 
illustrated in Figure 1, behavior curves of two different 
implementations (A/B) of the same logic function of a PO ‘o0’ of 
a MCNC benchmark circuit apex2 are shown: in an ideal 
environment, both A and B can operate at 450ps without timing 
error (Ps=100%). But A has a higher Ps at tclk=300ps while B has a 
lower Ps at tclk=300ps. If dynamic variations slow down the circuit 
by 150ps, their behavior curves will shift toward right. B will 
experience a drastic drop of Ps, resulting in a significant 
throughput drop. On the other hand, A still keeps a high Ps 
because its particular implementation provides a considerable 
wider plateau of Ps in the range of 300ps-450ps at design time. 
Therefore, error correction will kick in less often, resulting in 
sustainable throughput and better timing error resilience. Our 
proposed CCP optimization can manipulate the behavior curve of 
the circuit that is protected under Razor logic or EDS by 
providing a wider plateau of Ps by design, resulting in better 
throughput and timing error resilience.   
      
Figure 1: Dynamic behavior curves of two implementations of 
the same Boolean function ‘o0’ in apex2 
There are existing works targeting to improve a BTW circuit’s 
error resilience from various aspects. Blueshift [4] utilizes a 
commercial design flow to optimize the dynamically critical 
nodes to achieve higher throughput. DynaTune [5] analyzes a 
circuit and then improves timing for dynamic critical nodes, but 
!"#$%#&#'()*+#
,-./-012/3#45/26-1##
7892:-;<#4-=8<8/18#
>"#$%#&#?+#
?)?+#
@?)?+#
(?)?+#
A?)?+#
B?)?+#
*?)?+#
C?)?+#
D?)?+#
E?)?+#
'?)?+#
@??)?+#
'?# @(?# @*?# @E?# (@?# (B?# (D?# A??# AA?# AC?# A'?# B(?# B*?#!"
#$
%&%
'#
()
*+
,-
)$
#$
%&%
".
+
/.0&1+(21+"0&3+4,56+
F<-.-/2G#2H8I(#J;?J#%327-G-K2L;/#H<;727-G-35#
$<;6;384#2H8I(#J;?J#%327-G-K2L;/#H<;727-G-35#
Figure 6.1: Dynamic behavior curves of two implementations of the same
Boolean function ‘o0’ in apex2
provides a considerably wider plateau of Ps in the range of 300ps-450ps at
design time. Therefore, error correction will kick in less often, resulting in
sustainable throughput and b tter timing error resilience. The proposed
CCP optimization can manipulate the behavior curve of the circuit that is
protected under razor logic or EDS by providing a wider plateau of Ps by
design, resulting in better throughput and timing error resilience.
There are existing works aiming to improve a BTWC circuit’s error re-
silience from various aspects. In Chapter 3, PCT was proposed to utilize a
commercial design flow to optimize the dynamically critical nodes to achieve
higher throughput. In Chapter 4, DynaTune was proposed to analyze a cir-
cuit and then improve timing for dynamic critical nodes, but these techniques
were limited by lowVt assignment. Power-aware slack redistribution [48, 64]
was proposed to shift the slack of frequently exercised and near-critical tim-
ing paths in a power efficient manner using gate sizing. A static-probability-
based mux-decomposition method [65] was proposed to change the timing
distribution for multi-cycle design.
In contrast to existing works, CCP improves a circuit’s dynamic behavior
from a fresh re-synthesis perspective. I first show the existence of such an
optimization opportunity in Section 6.2, then I propose a set of logic syn-
thesis techniques (with an umbrella name CCP) in Section 6.3, including
105
probability-driven logic re-synthesis, TCF-based circuit dynamic behavior
analysis, and a SAT-based redundancy remover. The experimental results
are shown in Section 6.7. Since CCP improves a circuit’s dynamic behavior
from the logic re-synthesis perspective, it may be applied orthogonally to
other existing works.
6.2 Motivation and example
As mentioned in Section 2.2, modern synthesis tools implement Boolean func-
tions in a timing-balanced manner, thus forming a critical path wall. When
a circuit is operating close to a critical point, multiple timing errors can hap-
pen, resulting in drastic drop of output correctness probability. The proposed
CCP mitigates the critical path wall effect by re-synthesizing a circuit in a
probabilistic manner that creates shorter logic paths for commonly exercised
functions (so-called common cases), while shifting the rarely exercised func-
tions toward longer logic paths. One important concept used in CCP is the
controlling value.
Definition 13. The controlling value v ∈ {0, 1} to a logic cell is the value
such that: if taken by an input, the cell output can be stabilized after a delay
td, the delay between the corresponding input to the cell output, regardless of
when or what value the other inputs of this cell are assigned.
As an example, assume we are implementing the logic function for an adder
carry bit. For simplicity, the subscripts can be omitted by writing the carry
function as follows, where a and b are two bits from two operands, c is the
carry-in bit, n is the carry-out bit:
n = (a ∧ b) ∨ (b ∧ c) ∨ (a ∧ c)
From the observation made in [1], the higher bits of an adder’s operands
take 0’s more often than 1’s for real-life applications. So we can assume a and
b to have such biased behavior that Pr(a = 0) = 80% and Pr(b = 0) = 80%,
where Pr() denotes the probability of a variable being a specified value (0/1).
Assume c is unbiased, that is, Pr(c = 0) = 50%.
Next, we can generate a set of common-case input vectors according to
these given probabilistic characteristics. Figure 6.2(C) shows five randomly
106
generated input vectors satisfying the given probabilities.
In Figure 6.2, we have two implementations of this carry-out function.
They both consist of four gates. For simplicity in this example, we assume
each gate has a unit delay 1ns, though the experimental results are collected
for real delays. We know when an input vector is applied on the PI of a
circuit, it needs a certain amount of time ts to stabilize the PO. Factors that
affect ts include the internal structure of the circuit, the delay of each gate,
the sensitization condition of each gate, etc.
their technique was limited by low-Vt assignment. Power-aware 
slack redistribution [6] was proposed to shift the slack of 
frequently exercised and near-critical timing paths in a power 
efficient manner using gate sizing. A static-probability-based 
mux-decomposition method [7] was proposed to change the 
timing distribution for multi-cycle design.  
In this work, we improve a circuit’s dynamic behavior from a 
fresh re-synthesis perspective. We first show the existence of such 
an optimization opportunity in section 2 (we denote it as §2), then 
we propose a set of logic synthesis techniques (with an umbrella 
name CCP) in §3, including probability-driven logic re-synthesis, 
TCF-based circuit dynamic behavior analysis, and a SAT-based 
redundancy remover. At last, the experimental results are shown 
in §4. To our knowledge, there is very limited work looking into 
this problem from the logic re-synthesis perspective, therefore our 
work may be applied orthogonally to other existing works. 
2. MOTIVATION 
In this work, we mitigate the critical path wall effect by re-
synthesizing a circuit in a probabilistic manner that creates shorter 
logic paths for commonly exercised functions (so-called common 
cases), while shifts the rarely exercised functions toward longer 
logic paths. First, we introduce preliminaries: 
Definition 1: given an output n and a timing requirement t, we say 
n is stabilized no later than t if n has taken either logic 0 or logic 1 
after a delay of t since applying the input vector at the rising clock 
edge, and n will not change its value thereafter within the same 
clock cycle.  
Definition 2: the controlling value v!{0,1} to a logic cell is the 
value such that: if taken by an input, the cell output can be 
stabilized after a delay td, the delay between the corresponding 
input to the cell output, regardless of when or what value the other 
inputs of this cell are assigned. 
As an example, we implement the logic function for an adder 
carry bit. For simplicity, we omit the subscripts by writing the 
carry function as follows, where a and b are two bits from two 
operands, c is the carry-in bit, n is the carry-out bit: 
n = (a ! b)! (b! c)! (a ! c)  
From the observation made in [1], the higher bits of an adder’s 
operands take 0’s more often than 1’s for real-life applications. So 
we assume a and b to have such biased behavior that 
Pr(a=0)=80% and Pr(b=0)=80%, where Pr() denotes the 
probability of a variable being a specified value (0/1). Assume c is 
unbiased, that is Pr(c=0)=50%.  
Next, we can generate a set of common-case input vectors 
according to these given probabilistic characteristics. Figure 2.C 
shows five randomly generated input vectors satisfying the given 
probabilities. 
In Figure 2, we have two implementations of this carry function. 
They both consist of four gates. For simplicity in this example, we 
assume each gate has a unit delay 1ns, though our experimental 
results are collected for real delays. We know when an input 
vector is applied on the PI of a circuit, it needs a certain amount 
of time ts to stabilize the PO. Factors that affect ts include the 
internal structure of the circuit, the delay of each gate, the 
sensitization condition of each gate, etc. 
In the implementation A shown in Figure 2.A, when {0, 0} is 
applied at a and b, an internal net k will be stabilized after a unit 
delay by propagating the inputs through the AND gate g1. 
Similarly, OR gate g2 will be stabilized after 1ns. At this point the 
upper input to g4 has a non-controlling value 0 from k, so it has to 
wait until m is evaluated (after 2ns) before it can be stabilized. 
Therefore, n will be stabilized after 3ns. Now we consider 
implementation B shown in Figure 2.B. When {0, 0} is applied on 
a and b, q will be stabilized to logic 0 after 1ns through OR gate 
g1. Because q=0 is the controlling value of the trailing AND gate 
g4, within another 1ns, the PO n will be stabiliz d o logic 0, 
resulting in only 2ns to stabilize n.  
We can calculate the stabilization time ts for all five input vectors 
for circuits A and B, as shown in the right-most two columns in 
Figure 2.C respectively. We notice that A needs on average 
ts=2.6ns to be stabilized, while B only needs ts=2.2ns, thus more 
resilient to timing errors. 
 
Figure 2: Different dynamic behaviors of two implementations 
of n = (a ! b)! (b! c)! (a ! c)  
From this example, we observe that both implementations have 
the same area and worst-case delay. However, slightly different 
implementation can improve dynamic behavior in terms of timing 
error resilience. Motivated by this fact, the problem we are 
interested to solve can be stated as follows: to find a logic 
implementation of a circuit that can be stabilized faster for 
common cases. To achieve this goal, our logic synthesis technique 
should be able to take common case probabilities as an 
optimization factor, able to analyze the stabilization conditions for 
the circuit, as well as to guarantee implementation correctness. In 
following sections, we introduce the proposed techniques. 
3. COMMON CASE PROMOTION 
3.1 Overall flow 
To develop an efficient algorithm for optimizing circuit dynamic 
behavior, we take a divide-and-conquer approach. Sub-circuits are 
generated as optimization candidates first. Then common case 
optimization is applied on each sub-circuit. Figure 3 shows the 
overall flow. 
 Starting from a netlist, we first generate sub-circuits for timing 
critical nodes for a given timing distance Tth as follows: choose a 
timing critical node n, then back-trace through its fan-in nodes 
using a Depth-First Search (DFS) algorithm. Further DFS 
branching stops when it reaches such a node m that its Required 
Arrival Time (RAT) satisfies RAT(n)-RAT(m)>Tth. All visited 
nodes during this DFS traversal are kept in the sub-circuit, thus 
creates a timing window Cn rooted at n, or C for short.  
We feed C into “Common cases identification” procedure (§3.2) 
to identify its probabilistic characteristics. Then it goes through 
“promote common cases”, which changes structure for common 
cases using logic re-synthesis (§3.3) followed by dynamic 
behavior analysis using Timed Characteristic Function (TCF) to 
decide if the structure is valid for dynamic behavior improvement 
g3
g2
a
b
c
g4
g1
n
g4
g1
a
b
c
n
g2
g3
A.  Original circuit
B.  Circuit after promotion
k
l
m
q
k
n
!"#$"# % &
! " # $ %' %'
()$"#*+ , , + , - .
()$"#*. , , , , . .
()$"#*- , , + , - .
()$"#*/ + , + + - .
()$"#*0 , + , , . -
%123452 .67 .6.
()$"#*
C.  Input profile and stabilization time 
f
f
Figure 6.2: Different dynamic behaviors of two implementations
In the implementation A shown in Figure 6.2(A), when {0, 0, 1} is applied
at a, b and c, an internal net k will be stabilized after a unit delay by
propagating the inputs through the AND gate g1. Similarly, the OR gate
g2 will be stabilized after 1ns. At this point the upper input to g4 has a
non-controlling value 0 from k, so it has to wait until m is evaluated (after
2ns) before it can be stabilized. Therefore, n will be stabilized after 3ns.
Now we consider implementation B shown in Figure 6.2(B). When {0, 0, 1}
107
is applied on a, b and c, q will be stabilized to logic 0 after 1ns through OR
gate g1. Because q = 0 is the controlling value of the trailing AND gate g4,
within another 1ns, the PO n will be stabilized to logic 0, resulting in only
2ns to stabilize n.
We can calculate the stabilization time ts for all five input vectors for
circuits A and B, as shown in the right-most two columns in Figure 6.2(C)
respectively. We notice that A needs on average ts = 2.6ns to be stabilized,
while B only needs ts = 2.2ns, and thus is more resilient to timing errors.
From this example, we observe that both implementations have the same
area and worst-case delay. However, slightly different implementation can
improve dynamic behavior in terms of timing error resilience. Motivated
by this fact, the problem we are interested in solving can be stated as fol-
lows: to find a logic implementation of a circuit that can be stabilized faster
for common cases. To achieve this goal, the proposed logic synthesis tech-
nique should be able to take common case probabilities as an optimization
factor, analyze the stabilization conditions for the circuit, and guarantee im-
plementation correctness. In the following sections, I introduce the proposed
techniques.
6.3 Common case promotion
To develop an efficient algorithm for optimizing circuit dynamic behavior,
we take a divide-and-conquer approach. Sub-circuits are generated as op-
timization candidates first. Then common case optimization is applied on
each sub-circuit. Figure 6.3 shows the overall flow.
Starting from a netlist, sub-circuits for timing critical nodes are first gen-
erated for a given timing distance Tth as follows: choose a timing critical
node n, then back-trace through its fan-in nodes using a depth-first search
(DFS) algorithm. Further DFS branching stops when it reaches such a node
m that its required arrival time (RAT) satisfies
RAT (n)−RAT (m) > Tth
All visited nodes during this DFS traverse are kept in the sub-circuit, thus
creating a timing window Cn rooted at n, or C for short.
108
(§3.4). If the structure change is accepted we then try to control 
area overhead using an SAT-based approach (§3.5). Then we 
iterate for the next timing window. A convergence test (§3.7) will 
decide whether any further promotion is needed for a logic cone.  
 
Figure 3: Overall flow of common case promotion 
3.2 Identify common cases partial function 
In the common-case identification step, we utilize two-level 
expression forms complemented with a simulation method to 
generate the sub-circuit’s probability profile, which will be used 
to drive the dynamic behavior optimization.  
Each generated timing window C represents a Boolean function 
f:Bm !{0,1,*}, where m is the number of C’s inputs. We use the 
definitions commonly used for two-level logic optimization [8]: 
An implicant of f is a row vector consisting of the input vector of 
{0,1,*}m and the associated output {0,1,*} specified by f. The 
on/off/dc set is the subset of Bm that implies 1/0/* according to f. 
A on(off)-set minterm of f is an implicant whose input vector is 
{0,1}m and that implies a true(false) at the output. A cover 
Fon(Foff)of f is a set of implicants that cover all on(off)-set 
minterms of f. We denote on-set minterm as minterm, Fon as F for 
short when there is no confusion. A partial cover Fp of f is a set of 
implicants that covers some minterms of f. A prime is an 
implicant that is not contained by any other implicant of f. A 
prime cover is a cover consisting only of primes.  
To get a probabilistic profile for each timing window, we need 
firstly prepare a global behavior profile, similar to the table in 
Figure 2.C, for the whole circuit only once. Then we can reuse 
this global profile for each individual C. A global behavior profile 
is prepared using a zero-delay simulation-based method as 
follows: we generate a set of PI input vectors according to given 
typical-case characteristics [4][6]. Then we leverage AIG-based 
simulation engine in ABC [11] to propagate these typical input 
vectors for each cell in the netlist. Note that: 1) Both circuit 
structure and timing information are decoupled from logic 
simulation because they will be processed in a separate step called 
structure validation. Therefore, to generate the profile, the 
simulation can be performed using a zero-delay model, thus is 
much faster than timed-simulation used in [4][6]. 2) Rather than 
applying timed-simulation once for each optimization iteration 
[4], our behavior profile is prepared only once for the circuit and 
then is reused across all sub-circuits. 3) We observe from 
experiments that simulating a small number of input vectors (e.g., 
100) can still provide good guidance for the re-synthesis. 
Therefore, it is practical.  
From the behavior profile, we carry out Common Case 
Identification as follows: find an partial cover Fp, as a subset of F, 
such that the implicants in Fp are of high probability being 
exercised by the common-case input vectors. This Fp, though 
partially covers the minterms in f, is more often exercised than 
F\Fp, where ‘\’ is set subtraction operator. From a probabilistic 
point of view, the partial function Fp encodes the common-case 
portion of f, while F\Fp complement it by encoding the corner-
case of f. From Fp, we can later promote this partial function using 
multi-level logic synthesis techniques to improve circuit dynamic 
behavior (§3.3). In addition, from logic synthesis point of view, it 
is preferable to have Fp contain only primes so that unnecessary 
literals are eliminated to save area and to improve timing. 
Therefore, we identify Fp within a prime cover F as follows.   
Depending on C output’s tendency of being logic 1 or 0, we may 
perform phase 1 or phase 0 identification on C. Here we only 
describe the procedure for phase 1 for simplicity, while phase 0 
can be done similarly. For phase 1 optimization, we first generate 
a cover for the on set by creating a decision diagram of C using a 
DFS back-tracing. To produce the prime cover Fon for the on set, 
we use Zero-Suppressed Decision Diagram (ZDD) encoding, 
which suppresses “non-essential” literals along an evaluation 
paths (from root node to true terminal in the ZDD) with a special 
two-phase encoding for each variable [13]. Then we evaluate the 
probability of each prime in Fon being activated using the 
previously prepared global behavior profile. From the profile, we 
extract input vectors and the associated outputs for C as row 
entries. Each such entry is denoted as ". We say " can activate a 
prime p in Fon if "!p, that is any literal lp in p agrees with the 
associated literal l" in ". We count for each p the number of input 
vectors in the global behavior profile that can activate it. Then 
derive an activation probability.  
In each round, we look for the most frequently activated prime, 
denoted as pc. Note that to identify the next most frequently 
activated prime in the next round, we need to remove previously 
identified pc’s effect by starting the next round with Fon ^ !(pc) 
rather than Fon. We collect these common-case primes that will be 
promoted in a Boolean function fp and then implement it as the 
partial implementation of f.  Note that we only perform this 
common case identification within a timing window C, therefore 
ZDD doesn’t run into capacity issue, hence is practical.  
In Figure 2.C, for example, the input and output columns can be 
viewed as a global behavior profile of 5 entries. In this example, 
it’s preferable to identify common cases for off-set Foff because 
the output tends to be logic 0. Using the above approach, we can 
identify that a partial off-set cover Fpoff= a ! b  is exercised and 
produces an output 0 frequently. We inverse this off-set partial 
cover using De-Morgan’s law to get a regular on-set function fp 
that represents the common cases as: fp = a! b . Then we 
promote it (details in the next section). Note that sometimes, we 
may promote multiple primes besides the most activated ones 
according to a specified probability threshold.  
3.3 Promote common cases 
To improve dynamic behavior, we choose common case primes 
contained in Fp as the sub-function to get promotion. Promotion is 
a process of implementing a sub-function with a redundant logic 
cone and then merging it back into the original circuit. In general 
the sub-function fp associated with Fp has fewer literals than that 
of the f, so usually it can be implemented with a shallower logic 
depth. To control area overhead, a TCF-based structure validation 
process (§3.4) is applied to avoid unnecessary promotion and a 
SAT-based redundancy remover is applied to simplify the logic 
and reduce area (§3.5).  
This logic promotion process is mainly performed using multi-
level logic synthesis techniques: starting from the partial function 
fp, we invoke technology independent multi-level synthesis script 
“resyn” in ABC [11] followed by logic balancing and technology 
dependent mapping. As a result, a promoted circuit, denoted as Cp 
is generated to implement the partial function fp. 
A netlist
Timing window
Terminate?
Promote 
common cases
Common case 
identification
Area overhead 
control
CCP-optimized 
circuit
Promote Overhead control
Typical input 
vectors
Structure
Validation
No
Yes
Skip
OK
Figure 6.3: Overall flow of common case promotion
109
The sub-circuit C is then fed into common cases identification proce-
dure (Section 6.3.1) to identify its probabilistic characteristics. Then it goes
through promote common cases, which changes structure for common cases
using logic re-synthesis (Section 6.3.2) followed by dynamic behavior analysis
using timed characteristic function (TCF) to decide if the structure is valid
for dynamic behavior improvement (Section 6.4 ). If the structure change
is accepted, an area recovery phase is carried out to control area overhead
using an SAT-based approach (Section 6.5 ). Then the flow iterates for the
next timing window. A convergence test (Section 6.6 ) will decide whether
any further promotion is needed for a logic cone.
6.3.1 Identify common-case partial function
In the common-case identification step, we utilize two-level expression forms
complemented with a simulation method to generate the sub-circuit’s proba-
bility profile, which will be used to drive the dynamic behavior optimization.
Each generated timing window C represents a Boolean function
f : Bm → {0, 1, ∗}
where m is the number of the C’s inputs. We use the definitions commonly
used for two-level logic optimization [66]:
Definition 14. A literal is a variable that takes a value in {0/1}.
Definition 15. An implicant of f is a row vector consisting of the input
vector of {0, 1, ∗}m and the associated output {0, 1, ∗} specified by f .
Definition 16. The on/off/dc set is the subset of Bm that implies {0/1/∗}
according to f .
Definition 17. An on(off)-set minterm of f is an implicant whose input
vector is {0, 1}m and that implies a true(false) at the output.
Definition 18. A cover F on(F off) of f is a set of implicants that cover all
on(off)-set minterms of f .
When there is no confusion, on-set minterm is denoted as minterm and
F on is denoted as F for short.
110
Definition 19. A partial cover Fp is a set of implicants that cover some
minterms of f .
Definition 20. A prime is an implicant that is not contained by any other
implicant of f .
Definition 21. A prime cover is a cover consisting only of primes.
To get a probabilistic profile for each timing window, a global behavior
profile, similar to the table in Figure 6.2(C), needs to be prepared first for the
whole circuit only once. Then it can be reused for each individual C. A global
behavior profile is prepared using a zero-delay simulation-based method as
follows: a set of PI input vectors is generated according to given typical-case
characteristics [4,18]. Then we leverage the AIG-based simulation engine in
ABC [67] to propagate these typical input vectors for each cell in the netlist.
This simulation procedure is practical for the following reasons:
1. Both circuit structure and timing information are decoupled from logic
simulation because they will be processed in a separate step called
structure validation (Section 6.4). Therefore, to generate the profile,
the simulation can be performed using zero-delay model, and thus is
much faster than timed simulation used in [4, 18].
2. Rather than applying timed-simulation once for each optimization it-
eration [4], the behavior profile used in CCP is prepared only once for
the circuit and then is reused across all sub-circuits.
3. Experiments show that simulating a small number of input vectors
(100) can still provide good guidance for the re-synthesis.
From the behavior profile, common case identification is carried out as
follows: in C, find a partial cover Fp, as a subset of F , such that the implicants
in Fp are highly likely to be exercised by the common-case input vectors.
This Fp, though it partially covers f , is more often exercised than F \ Fp,
where “\” is set subtraction operator. From a probabilistic point of view,
the partial function Fp encodes the common-case portion of f , while F \
Fp complements it by encoding the corner-case of f . Fp will be promoted
later using multi-level logic synthesis techniques to improve circuit dynamic
behavior (Section 6.3.2). In addition, from a logic synthesis point of view, it
111
is preferable to have Fp contain only primes so that unnecessary literals are
eliminated to save area and to improve timing. Therefore, Fp is identified
within a prime cover F as follows.
Depending on C output’s tendency of being logic 1 or 0, phase-1 or phase-
0 identification is performed on C. Here I only describe the procedure for
phase-1 for simplicity, while phase-0 can be done similarly. For phase-1 opti-
mization, a cover for the on set is generated by creating a decision diagram of
C using a DFS back-tracing. To produce the prime cover F on for the on set,
we use zero-suppressed decision diagram (ZDD) encoding, which suppresses
non-essential literals along evaluation paths (from root node to true terminal
in the ZDD) with a special two-phase variable encoding scheme for each vari-
able [68]. Then we can evaluate the probability of each prime in F on being
activated using the previously prepared global behavior profile. From the
profile, each input vector and the associated outputs for C are extracted as
row entries. Each such entry is denoted as γ. We say γ can activate a prime
p in F on if γ ∈ p; that is, any literal lp in p agrees with the associated literal
lγ in γ. We can count for each p the number of input vectors in the global
behavior profile that can activate it. Then derive an activation probability.
In each round, the most frequently activated prime is identified, denoted
as pc. Note that to identify the next most often activated prime in the next
round, we need to remove the previously identified pc’s effect by starting
the next round with F on∧!(pc) rather than F on. These common-case primes
are collected and will be promoted in a Boolean function fp and then im-
plemented as a partial implementation of f . Note that this common case
identification is only performed within a timing window C; therefore ZDD
does not run into capacity issue, so it is practical.
In Figure 6.2(C), for example, the input and output column can be viewed
as a global behavior profile of 5 entries. In this example, it is preferable to
identify common cases for the off-set F off (phase-0) because the output tends
to be logic 0. Using the above approach, we can identify that a partial off-set
cover F offp = a¯∧ b¯ is exercised and produces an output 0 frequently. We can
inverse the off-set partial cover using De-Morgan’s law to get a regular on-set
function fp that represents the commonly cases as
fp = a ∨ b
112
This common case partial cover will be promoted next. Note that some-
times we may promote multiple primes besides the most activated one ac-
cording to a specified probability threshold.
6.3.2 Promote common cases
To improve dynamic behavior, common case primes contained in Fp are cho-
sen as the sub-function to get promotion. Promotion is a process of imple-
menting a sub-function with a redundant logic cone and then merging it back
into the original circuit.
In general the sub-function fp associated with Fp has fewer literals than
that of the f , so usually it can be implemented with a shallower logic depth.
To control area overhead, a TCF-based structure validation process is applied
to avoid unnecessary promotion (Section 6.4 ) and a SAT-based redundancy
remover is applied to simplify the logic and reduce area (Section 6.5).
This logic promotion process is mainly performed using multi-level logic
synthesis techniques: starting from the partial function fp, technology inde-
pendent multi-level synthesis script “resyn” is invoked in ABC [67] followed
by logic balancing and technology dependent mapping. As a result, a pro-
moted circuit, denoted as Cp, is generated to implement the partial function
fp.
Next, Cp is merged with the original implementation C using a merger cell
M . If the common-case partial function fp is generated for phase-1, it will be
correlated with the original function f with disjunctive normal form (DNF)
as:
f = f ∨ fp
As a result, Cp is merged with C using an OR-type merger cell. Because
‘1’ is the controlling value of an OR gate, as long as the common-case fp is
stabilized to 1, the whole Boolean function will be stabilized regardless of
other side inputs. Since this common-case 1 propagates through a shallower
logic cone Cp, the output can be stabilized sooner than through the original
C.
Similarly, to promote for phase-0, fp and f are correlated with conjunctive
normal form (CNF). Then an AND-type gate is used as the merger cell to
combine the partial implementation and the original implementation.
113
Take the circuit of Figure 6.2(A) as an example. After we identify the
common-case partial function fp = a∨ b, we can implement fp as Cp with an
OR gate p in Figure 6.4 and merge Cp with the original implementation C
using an AND merger cell M .
Next, we merge Cp with the original implementation C using a 
merger cell M. If the common-case partial function fp is generated 
for phase 1, we correlate it with the original function f with 
disjunctive normal form (DNF) as ‘f = f!! fp’. As a result, Cp is 
merged with C using an OR-type merger cell. Because ‘1’ is the 
controlling value of an OR gate, as long as the common-case fp is 
stabilized to 1, the whole Boolean function will be stabilized 
regardless of other side inputs. Since this common-case 1 
propagates through a shallower logic cone Cp, the output can be 
stabilized sooner than through the original C.  
Similarly, to promote for phase 0, we correlate fp with f with 
conjunctive normal form (CNF). Then use an AND-type gate as 
the merger cell to correlate the partial implementation and the 
original implementation. 
 Take the circuit of Figure 2.A as an example, after we identify 
the common-case partial function fp= a! b , we implement fp as Cp 
with an OR gate p in Figure 4 and merge Cp with the original 
implementation C using an AND merger cell M. 
 
Figure 4: Promote the common cases in
( ) ( ) ( )f a b b c a c= ! " ! " !  
3.4 Validate structure change 
We should note that, the intentionally introduced redundancy of 
Cp may evaluate the common cases’ input vectors faster than the 
original design F. It’s, however, possible that these identified 
common-case primes have already been implemented in the 
original implementation with very short delay to the output. Then, 
it becomes a waste of resource to re-implement those primes. To 
avoid introducing unnecessary redundancy, we introduce a TCF-
based structure validation technique to determine if the structural 
change indeed improves the circuit’s dynamic behavior. 
TCF was originally used for ATPG to find a test pattern that can 
sensitize an output at a given time. Here, we propose to use TCF 
for structure validation using its earlier-timed [9] form. As with 
previous work [5][9][10], the TCF is defined in floating mode 
where input vectors are applied at time t=0, and before time 0, the 
input vectors are treated as uninitialized.  
Given a node n, its stabilization value val!{0,1} and RAT t, we 
define TCF as: 
Definition 3: a timed characteristic function T(n=val,t) is a 
Boolean function that evaluates to true if and only if for the set of 
input vectors that stabilizes an output n to value val no later than 
time t. 
 TCF of a cell output can be written recursively as TCF of its 
immediate inputs using sensitization criteria [5]. Take an AND 
gate “n=AND(a,b)” with cell delay d for example:   
Equation 2:         T(n=1,t) = T(a=1,t-d)  "  T(b=1,t-d)               
Equation 3:         T(n=0,t) = T(a=0,t-d)  !  T(b=0,t-d)               
TCF for a circuit can be constructed by back-tracing and 
represented as a BDD [5]. The evaluation paths that lead to true 
terminal in the BDD encode all the possible input vectors that 
stabilize n to a required logic value val within a specified delay t. 
We intend to use Cp to stabilize common-case minterms faster 
than C does. If C is already capable of stabilizing the output n as 
quickly as Cp then implementing Cp doesn’t improve dynamic 
behavior. To determine this, our structure validation compares the 
TCFs of C and the partial implementation Cp. If the common 
cases Fp is contained in the TCF of C, then it means C is already 
as capable as Cp regarding stabilizing any minterm in Fp. Hence, 
we discard the Cp to reduce area overhead. The procedure is 
carried out as follows: 
1. Timing analysis for Cp. In this step, we evaluate the delay of 
Cp and M as tc = D(Cp) + D(M), where D() is the delay 
function from latest AT analysis. It is obvious that Cp 
guarantees to stabilize all minterms in Fp in tc.  
2. Test if the original circuit C can already stabilize those 
minterms of Fp in delay of tc. We construct a TCF for C as 
T(n=val,t=tc), which will contain all minterms that can 
stabilize C within a delay tc. 
3. Discard Cp if the following test is true: 
Equation 4:                   Fp! T (n = val,t = tc )  
This test being true means C can already stabilize Fp within tc. 
Then it’s unnecessary to promote Fp into Cp. Next we explain this 
procedure using the example in Figure 4. To validate the structure 
of Cp for the common-case off-set partial cover Fpoff= a ! b . We 
first calculate tc as 2ns=D(p)+D(M). Since we promote for phase 
0, we set val=0. Then we construct T(f=0,t=2) for C:  
T(f=0,t=2) = T(m=0,t=1) ! T(k=0,t=1) 
                = [T(c=0,t=0) ! T(l=0,t=0)] ! [T(a=0,t=0) ! T(b=0,t=0)] 
                = T(c=0,t=0) ! [T(a=0,t=0) ! T(b=0,t=0)] 
                = c ! (a " b )  
After recursively rewriting and expanding about inputs a, b, c, 
we get the final TCF T(f=0,t=2)= c ! (a ! b ) . Note that T(l=0,t=0) 
is depreciated (as a false) in above derivation because no input 
vector can stabilize l at t=0 due to the delay of cell g2.  Now test 
with Equation 4, we have: 
 a ! b " c ! (a # b )  
Now, we know the original implementation C cannot stabilize 
the common-cases in Fp within tc. Therefore the new structure Cp 
is valid for dynamic behavior improvement.  
3.5 Overhead control 
Merging the original implementation C with the promoted partial 
implementation Cp, we create a new implementation C’ that can 
substitute C. Considering that Cp may introduce redundancy, there 
exist opportunities to simplify C’. We show how we control the 
area overhead with redundancy removal. 
Traditionally, Automatic Test Pattern Generation (ATPG) 
techniques were used to identify redundancy. With the success of 
modern SAT technique, SAT is now widely applied in various 
areas including redundant wire addition and removal [15]. In this 
work, we take advantage of the knowledge of the inherent 
functional connection between C, Cp and C’ and then apply SAT 
to control area overhead in C’ without hurting its dynamic 
behavior.    
Given that Cp is a purely redundant partial function that is 
implemented for better dynamic behavior, applying a general 
redundancy removal technique [14][15] on it may totally remove 
this partial implementation, thus defeats our goal of improving 
dynamic behavior. Therefore, we need to restrict the effort of 
redundancy removal within the implementation of C rather than 
g3
g2 g4
g1
pa
b
M
c
fp
Cp: Dynamic behavior promoted partial function
C: Original implementation
f
n
m
k
l
Figure 6.4: Promote the common-cases in f = (a ∧ b) ∨ (b ∧ c) ∨ (a ∧ c)
6.4 Validate structure changes
We should note that the intentionally introduced redundancy of Cp may
evaluate the common-case input vectors faster than the original design F . It
is, however, possible that these identified common-case primes have already
been implemented in the original implementation with very short delay to
the output. Then, it becomes a waste of resource to re-implement hose
primes. To avoi introducing unnecessary redundancy, a TCF-based struc-
ture validation technique i proposed to de ermine if the structural change
indeed improves the circuit’s dynamic behavior.
Recall that TCF for a circuit can be constructed by back-tracing and rep-
resented as a decision diagram in Chapter 4 and Chapter 5. The evaluation
paths that lead to true terminal in the decision diagram encode the impli-
cants that stabilize n to a required logic value val within a specified delay
t.
We want to find a Cp such that it can be stabilized by common-case
114
minterms faster than C is. If the C is already capable of being stabilized
as quickly as Cp, then implementing Cp does not improve dynamic behavior.
To determine this, the structure validation compares the TCFs of C and the
partial implementation Cp. If the common cases Fp is contained in the TCF
of C, then it means C is already as capable as Cp regarding stabilizing any
minterm in Fp. Hence, we can discard Cp to reduce area overhead. The
procedure is carried out as follows:
1. Timing analysis for Cp. In this step, the delays of Cp and M are
evaluated:
tc = D(Cp) +D(M)
where D() is the delay function from latest AT analysis. It is obvious
that Cp guarantees to be stabilized by all minterms contained in Fp
within delay tc.
2. Test if the original circuit C can already be stabilized by those minterms
of Fp in delay of tc. In this step, a TCF is constructed for C as T (n =
val, t = tc), which will contain all minterms that can stabilize C within
a delay tc.
3. Discard Cp if the following test is true:
Fp ⊆ T (n = val, t = tc) (6.2)
The test being true means C can already be stabilized by Fp within tc.
Then it is unnecessary to promote Fp into Cp. Next let us look at an example
in Figure 6.4. To validate the structure of Cp for the common-case off-set
F offp = a¯∧ b¯, we first calculate tc as 2ns = D(P ) +D(M). Since we promote
for phase-0, we set val = 0. Then we construct T (f = 0, t = 2) for C as
follows:
T (f = 0, t = 2) = T (m = 0, t = 1) ∧ T (k = 0, t = 1)
= [T (c = 0, t = 0) ∨ T (l = 0, t = 0)]∧
[T (a = 0, t = 0) ∨ T (b = 0, t = 0)]
= T (c = 0, t = 0) ∧ [T (a = 0, t = 0) ∨ T (b = 0, t = 0)]
= c¯ ∧ (a¯ ∨ b¯)
115
After recursively rewriting and expanding about inputs a, b and c, we get
the final TCF as
T (f = 0, t = 2) = c¯ ∧ (a¯ ∨ b¯)
Note that T (l = 0, t = 0) is deprecated (as a false) in the above derivation
because no input vector can stabilize l at t = 0 due to the delay of cell g2.
Now testing with formula 6.2, we have:
a¯ ∧ b¯ * c¯ ∧ (a¯ ∨ b¯)
Now, we know that the original implementation C cannot be stabilized by
the common-cases contained in Fp within tc. Therefore the new structure Cp
is valid for dynamic behavior improvement.
6.5 Overhead control
Merging the original implementation C with the promoted partial imple-
mentation Cp, a new implementation C
′ that can substitute C is created.
Considering that Cp may introduce redundancy, there exists opportunity to
simplify C ′. Next, I show how area overhead control is done with a proposed
dynamic behavior redundancy remover.
Traditionally, automatic test pattern generation (ATPG) techniques were
used to identify redundancy. With the success of modern SAT technique,
SAT is now widely applied in various areas including redundant wire addition
and removal [69]. In this work, we take advantage of the knowledge of the
inherent functional connection among C, Cp and C
′. Then SAT redundancy
remover is applied to control area overhead in C ′ in a dynamic-behavior-
aware manner.
Given that Cp is a purely redundant partial function that is implemented
for better dynamic behavior, applying a general redundancy removal tech-
nique [69, 70] on it may totally remove this partial implementation, thus
defeating the goal of improving dynamic behavior. Therefore, we need to
restrict the effort of redundancy removal within the implementation of C
rather than Cp. Moreover, given that the support of Cp is a subset of the
support of C, it is unnecessary to apply SAT testing on all inputs to C.
The proposed redundancy removal is based on the following observation:
116
if a stuck-at-α fault, where α can be either 0 or 1, on one input is non-
observable at the output, we can propagate this constant α and thus simplify
the downstream logic. We also observe that both the controlling value and
the non-controlling value are potentially useful for redundancy removal. By
tying a gate’s input to the controlling value, a gate can be reduced to a
constant, while tying an input to the non-controlling value may reduce a
gate to a simpler gate or even a wire. For example, tying an input to AND2
gate to logic 1 can effectively reduce the AND2 gate to a wire.
To control area overhead, an XOR miter is constructed firstly from
1. a clone of the original implementation C as the reference, denoted as
R, and
2. the common-case promoted implementation C ′.
The output of R and the output of C ′ are fed into two inputs of the XOR
gate to form a miter output φ. R and C ′ share the same set of inputs -
the inputs of C - denoted as I = {i1, i2, . . . , in}. We say R and C ′ are
Boolean equivalent if and only if the miter is non-satisfiable by any Boolean
assignment to I. That means no input vector can differentiate R from C ′ by
evaluating them to different values.
After the miter is constructed we can solve the associated SAT formula to
test the stuck-at-α observability for both α = 0 and α = 1 on each picked
im ∈ I. An input im is picked in the miter’s inputs I, such that
im ∈ input(Cp)
Let FO(im) be the inputs of the cells that im directly connects to. More-
over, it is necessary to prevent this redundancy removal from impairing the
improved dynamic behavior by preserving the promoted sub-function Cp. So
the stuck-at-α tests are restricted at non-promoted portion jm, such that
(jm ∈ FO(im)) ∧ (jm ∈ C) ∧ (jm /∈ Cp)
Now we can disconnect jm from im and apply a stuck-at-α value as jm = α.
Then we solve the associated SAT problem stated as follows:
SAT problem: For the XOR miter with an output φ and an input vec-
tor V = {i1, i2, . . . , im−1, im, jm = α, im+1, . . . , in} of n + 1 variables, is it
117
satisfiable that a Boolean assignment of {0, 1}n to {i1 . . . in} can evaluate φ
true?
Obviously, if no input vector assignment to V exists that can satisfy the
above SAT formula, the stuck-at-α fault on jm is non-observable, so tying jm
to a constant α does not differentiate C ′ and R. Therefore we can simplify
jm to α and propagate this constant to further simplify logic downstream.
Because jm is carefully picked such that it only belongs to C, the simplifi-
cation applies only on C leaving Cp intact. Hence the promoted dynamic
behavior is preserved.
In the example of Figure 6.5, the upper part is the combined C ′ consisting
of C and Cp, while the lower part is of C’s clone R. Redundancy removal is
carried out on nets {a, b} for C. The SAT instance can prove that g2 ’s input
from a can be tied to a constant 0, which in effect removes the OR gate g2
shown in dashed line. Furthermore, the SAT instance also proves that g3 ’s
input from g2 can be tied to a constant 1, which further removes the AND
gate.
As a result, we end up with the circuit shown in Figure 6.2(B), which
has a better dynamic behavior while still maintaining the same size as the
original circuit. This illustrates that the proposed dynamic-behavior aware
SAT-based redundancy removal can effectively control area overhead from
intentionally introduced redundancy for common cases.
6.6 Correctness and convergence
Theorem 6.6.1. The proposed common case promotion technique is func-
tionally correct in Boolean domain.
Proof. I prove for phase-1, and phase-0 can be proven similarly. There are
two steps where we may change the logic structure of the original circuit
during CCP. I prove that neither of them alters functionality.
1. Promoting common cases does not alter functionality. Because the
common case primes encoded in Fp belong to set F , after merging with
an OR gate, we have Fp∨F == F . Because Fp, fp and Cp are different
forms of the same Boolean function, and F , f and C are also different
forms of the same Boolean function, we can have fp ∨ f == f and
118
Cp. Moreover, the support of Cp is a subset of the support of C. 
Therefore it’s unnecessary to apply SAT testing on all inputs to C.  
Our redundancy removal is based on the following observation: 
if a stuck-at-! error, where ! can be either 0 or 1, on one input is 
non-observable at the output, we can propagate this constant ! 
thus simplifying the downstream logic. We also observe that both 
the controlling value and the non-controlling value are potentially 
useful for redundancy removal. Because tying a gate’s input to the 
controlling value, a gate can be reduced to a constant. While tying 
an input to the non-controlling value, a gate may be reduced to a 
simpler gate or even a wire. For example, tying an input to AND2 
gate to logic 1 can effectively reduce the AND2 gate to a wire.  
 
Figure 5: An example of using SAT to remove circuit 
redundancy during CCP 
To control area overhead, we first construct a XOR miter from 1) 
a clone of the original implementation C as the reference, denoted 
as R, and 2) the common-case promoted implementation C’. The 
output of R and the output of C’ are fed into two inputs of the 
XOR2 gate to form a miter output ! . R and C’ share the same set 
of inputs -the inputs of C - denoted as I={i1, i2, … , in}. We say R 
and C’ are Boolean equivalent if and only if the miter is non-
satisfiable by any Boolean assignment to I. That means no input 
vector can differentiate R from C’ by evaluating R and C’ to 
different values.  
After the miter is constructed we solve the associated SAT 
problem to test the stuck-at-! observability for both ! =0 and ! =1 
on an im !I. We pick inputs im in the miter’s inputs I, such that 
im !input(Cp ) . Let FO(im) be the inputs of the cells that im 
directly connects to. Moreover, we prevent our redundancy 
removal from impairing the improved dynamic behavior by 
preserving the promoted sub-function Cp. So we restrict stuck-at-! 
tests at non-promoted portion jm, such that  ( jm !FO(im ))" ( jm !C)" ( jm #Cp )  
 We then disconnect jm from im by applying a stuck-at-! value as 
jm=!. Then we solve the associated SAT problem stated as 
follows:   
SAT problem: For the XOR miter with an output ! and an input 
vector V={i1, i2, … im-1, im, jm=!, im+1, … in} of n+1 variables, is it 
satisfiable that an Boolean assignment of {0,1}n to {i1…in} can 
evaluate! true. 
Obviously, if no input vector assignment to V exists that can 
satisfy the above SAT instance, then tying jm to a constant ! 
doesn’t differentiate C’ and R. Therefore we can simplify jm to ! 
and propagate this constant to further simplify logic downstream. 
Because jm is carefully picked such that it only belongs to C, the 
simplification applies only on C leaving Cp intact. Hence the 
promoted dynamic behavior is preserved. 
In the example of Figure 5, the upper part is the combined C’ 
consisting of C and Cp, while the lower part is the C’s clone R. 
We perform redundancy removal on nets {a, b} for C. The SAT 
instance can prove that g2’s input from a can be tied to zero, 
which in effect removes the OR gate g2 shown in dashed line.  
Furthermore, our SAT instance proves that g3’s input from g2 can 
be tied to one, which further removes the AND gate.  
As a result, we end up with the circuit shown in Figure 2.B, 
which has a better dynamic behavior but still maintains the same 
size as the original circuit. This illustrates that our SAT-based 
area recovery can effectively control area overhead from 
intentionally introduced redundancy for common cases. 
3.6 Correctness 
Theory: the proposed common case promotion technique is 
functionally correct in Boolean domain. 
Proof: we prove for phase-1, and phase-0 can be proven similarly. 
There are two steps that we may change logic structure of the 
original circuit during CCP. We prove that neither of them alters 
functionality: (1) promoting common cases doesn’t alter 
functionality. Because the common case primes encoded in Fp 
belong to set of F, after merging with an OR gate, we have Fp ! F 
== F. Because Fp, fp and Cp are different forms of the same 
Boolean function, and F, f and C are also different forms of the 
same Boolean function, we can have fp ! f == f and C’ = Cp ! C == 
C. That proves C’ is Boolean equivalent to C. Since R is a clone 
of C, C’ is also equivalent to R. (2) the SAT redundancy removal 
doesn’t change functionality either. Because we tie jm to ! only if 
a stuck-at-! fault is non-observable in a XOR miter of C’ and R, 
this will not differentiate C’ and R under all possible Boolean 
input assignments. Indeed, the SAT instance guarantees Boolean 
equivalence between the original C and the dynamic behavior 
optimized C’. Therefore, this combined common case promotion 
keeps Boolean equivalence between CCP optimized version and 
the original one.   
3.7 Convergence 
After common-case promotion finishes a round of optimization, 
we can test if further promotion is needed. This again can be done 
with the TCF test. For example, given the same common-case 
prime Fp= a b! covering off set, we test whether the redundancy-
removed CCP implementation in Figure 2.B needs further 
promotion by applying the same TCF test (§3.4). Going through 
the same TCF rewriting procedure, this time we have: 
T(f=0,t=2) = T(n=0,t=1) ! T(q=0,t=1) 
                =[T(c=0,t=0) ! T(k=0,t=0)] ! [T(a=0,t=0 ! T(b=0,t=0)] 
                = T(a=0,t=0) ! T(b=0,t=0) 
                = a ! b         
Testing with Equation 4, we have Fp ! T ( f = 0,t = 2)meaning 
that the identified common cases can already be stabilized in t=2 
in C’. Therefore, no further CCP step is needed.  
4. Experimental results 
The proposed algorithm is implemented in C on top of ABC 
[11]. MiniSAT [12] is used to perform the area overhead control 
task described in §3.5. Benchmark circuits are compiled into 
Verilog netlists using Synopsys Design Compiler (DC) ver. 2007-
SP3. We set tight delay and power constraints to allow DC to 
perform high-effort optimization. Then we apply the proposed 
CCP on these netlists to further improve dynamic behavior. We 
g3
g2 g4
g1
pa
b
M
g3
g2
a
b
c
g4
g1
0
1
R: Original implementation as the reference
C': Implementation with redundancy
c
fc
Cp: Dynamic behavior promoted
C: Original implementation
XOR
f
Figure 6.5: An example of using SAT to remove circuit redundancy during
CCP
119
C ′ = Cp ∨ C == C. That proves C ′ is Boolean equivalent to C. Since
R is a clone of C, C ′ is also equivalent to R.
2. The SAT redundancy removal does not change functionality either. Be-
cause we tie jm to α only if a stuck-at-α fault is non-observable in a
XOR miter of C ′ and R, this will not differentiate C ′ and R under all
possible Boolean input assignments. Indeed, the SAT instance guar-
antees Boolean equivalence between the original C and the dynamic
behavior optimized C ′.
Therefore, this combined common case promotion keeps Boolean equiva-
lence between the CCP optimized version and the original one.
After common-case promotion finishes a round of optimization, we can test
if further promotion is needed. This again can be done with a TCF test. For
example, given the same common-case prime F offp = a¯∧b¯ covering off set, we
test whether the redundancy-removed CCP implementation in Figure 6.2(B)
needs further promotion by applying the same TCF test (Section 6.4) on
the new promoted circuit structure. Going through the same TCF rewriting
procedure, this time we have:
T (f = 0, t = 2) = T (n = 0, t = 1) ∨ T (q = 0, t = 1)
= [T (c = 0, t = 0) ∧ T (k = 0, t = 0)]∨
[T (a = 0, t = 0) ∧ T (b = 0, t = 0)]
= T (a = 0, t = 0) ∧ T (b = 0, t = 0)
= a¯ ∧ b¯
Testing with formula 6.2, this time we have
F offp ⊆ T (f = 0, t = 2)
meaning that the identified common cases can already stabilize C ′ in t = 2.
Therefore, no further CCP step is needed.
120
6.7 Experimental results
The proposed algorithm is implemented in C on top of ABC [67]. Min-
iSAT [71] is used to perform the area overhead control task described in
Section 6.5. Benchmark circuits are compiled into Verilog netlists using Syn-
opsys Design Compiler (DC) version 2007-SP3. Tight delay and power con-
straints are used to allow DC to perform high-effort optimization. Then
the proposed CCP is applied on the generated netlists to further improve
dynamic behavior. A short profiling sequence (100) of typical case vectors
is generated to drive the optimization according to a given PI’s probability
profile. In this study, 0.5 is used as the input static probability for generic
MCNC benchmarks. The runtime of CCP ranges from seconds to several
minutes on a dual-core 2.4GHz Intel CPU.
To validate the performance gain produced by CCP, a large amount of
simulation data is collected with Cadence NCSim v5.7 for each circuit. A
Verilog program interface (VPI) program is developed as an NCSim’s plugin
to record timestamps of each PO’s last switching activity in every simulation
cycle. Each circuit runs simulation for one million cycles.
As shown in the example of Figure 6.2, CCP can do an excellent job by
improving dynamic behavior without scarifying timing or area. But some-
times some penalties are unavoidable. For such a case, area is traded off for
timing by sizing up the merger cell for better timing. To avoid excessive area
overhead, we only optimize those POs whose delays are within 80%-100%
of the longest path delay. CCP is iterated for a circuit using a stack-based
algorithm: Push critical POs into a stack, and CCP is applied on the node
popped out from the top of the stack. After applying CCP on a logic cone,
we push its critical fan-ins into the stack for further optimization.
The experimental results show that CCP can effectively improve a circuit’s
dynamic behavior. Figure 6.1 shows the effect of applying CCP on a single
PO for circuit apex2. Clearly, before applying CCP, Ps has decreased to 0
around t = 300ps. After CCP, Ps still maintains a high value of 92.5% at
the same time point.
However, optimizing Ps for a whole circuit is far more difficult than that for
individual POs because of the correlation among POs. But the experimental
results still show a significant boost of dynamic behavior as the average Ps
increases by 24%.
121
We collect Ps for the whole circuit rather than for individual POs for
MCNC benchmarks. Equation 6.1 is used to derive throughput performance
with a penalty r = 5 assuming those components are working in a shallow
pipeline processor [1, 4, 18]. Table 6.1 shows the CCP’s physical effects on
the benchmark circuits. The worst-case delay and area data are collected for
circuits without and with CCP. With CCP, the worst-case delay is unchanged
for most circuits. This is because along with improving dynamic behavior,
CCP also tries to preserve the worst-case delay using the timing window as a
constraint (Section 6.3). This may allow CCP-optimized circuits to be used
compatibly with the traditional way if the timing error correction mechanism
needs to be turned off.
The physical cost of applying CCP is very small. On average, it only
increases the area by about 3%. This is because of overhead control tech-
niques integrated in CCP, including: (1) structure validation (Section 6.4)
that skips unnecessary optimization, and (2) a SAT-based overhead control
(Section 6.5) that controls area overhead.
CCP-optimized circuits have better timing error resilience. Table 6.2 col-
lects the performance results associated with each circuit in Table 6.1 for
cases without and with CCP. Column 2 shows the amount of delay dete-
rioration. On average the delay deteriorates 12%. Columns 4 and 5 show
the associated probabilities Ps for circuits experiencing the delay deteriora-
tion in column 2 for the cases without and with CCP, respectively. With
CCP optimization, a circuit becomes more resilient to timing error and the
probability of producing correct results increases 24% as shown in column 6.
Columns 7 and 8 are the throughput results derived from Equation 6.1 for
cases without and with CCP optimization. The throughput performance of
CCP-optimized circuits is 22% higher when they both experience the same
amount of delay deterioration.
122
Table 6.1: Circuit profiles before and after CCP
Worst case delay(ps) Total area
PI PO noCCP CCP ∆% noCCP CCP ∆%
des 256 245 654 654 0.0% 14814 15212 2.7%
alu4 14 8 585 585 0.0% 2216 2222 0.3%
apex2 39 3 474 498 5.1% 598 602 0.7%
apex4 9 18 585 588 0.5% 6456 6551 1.5%
dalu 75 16 528 537 1.7% 2937 3005 2.3%
ex1010 10 10 600 600 0.0% 7083 7396 4.4%
ex5p 8 63 321 321 0.0% 923 925 0.2%
misex2 14 14 585 585 0.0% 2448 2454 0.2%
pdc 16 40 501 495 -1.2% 1667 1895 13.7%
seq 41 35 555 558 0.5% 5376 5403 0.5%
spla 16 46 447 441 -1.3% 1576 1720 9.1%
Ave. 0.5% 3.2%
Table 6.2: CCP optimization performance results
Delay var.(ps) Probability Ps Throughput(MOPS)
∆d % of tclk noCCP CCP ∆% noCCP CCP ∆%
des 24 3.7% 81.5% 97.0% 15.5% 1303 1492 12.7%
alu4 51 8.7% 89.2% 93.7% 4.5% 1562 1623 3.8%
apex2 96 19.3% 76.9% 94.2% 17.3% 1637 1915 14.5%
apex4 45 7.7% 79.3% 94.3% 15.0% 1419 1623 12.6%
dalu 90 16.8% 12.9% 74.0% 61.1% 565 1475 61.7%
ex1010 57 9.5% 38.1% 81.2% 43.1% 841 1416 40.6%
ex5p 24 7.5% 65.7% 86.2% 20.5% 2260 2771 18.4%
misex2 51 8.7% 84.4% 97.0% 12.6% 1496 1668 10.3%
pdc 120 24.2% 41.6% 93.1% 51.5% 1076 1909 43.6%
seq 114 20.4% 70.5% 80.2% 9.7% 1369 1508 9.2%
spla 66 15.0% 72.6% 90.4% 17.8% 1771 2093 15.4%
Ave. 12.9% 24.4% 22.1%
123
CHAPTER 7
CONCLUSION
Traditional VLSI design methodology is facing great challenges because tech-
nology scaling is approaching its limit. To alleviate these problems, better-
than-worst-case (BTWC) design methodology is proposed to relax the de-
sign constraints and to deliberately allow timing errors for rare cases. It is
a promising technique that can be used to improve throughput and reliabil-
ity of VLSI circuits. This thesis proposed several novel BTWC-aware CAD
techniques to support BTWC design methodology.
To pursue high performance, timing speculation is used as one of the
BTWC design methods for throughput optimization on processor modules
in BlueShift project. I proposed a novel path constraint tuning (PCT) tech-
nique (Chapter 3) which leverages commercial CAD tools to optimize circuits
for common cases. In each PCT iteration, a circuit’s activity profile is firstly
collected from simulating a set of training programs. Its timing constraints
are automatically updated to tightly constrain frequently exercised timing
paths and relax the infrequently exercised portion of the design. The effect
of tuning is then observed in the next iteration by measuring the error rate.
PCT is iteratively applied on a design module until the observed error rate
becomes small enough. PCT was applied on modules of the OpenSPARC T1
processor. Compared to a conventional way of BTWC design, PCT speeds
up applications by an average of 6% with an average processor power over-
head of 23% - providing a way to speed up logic modules that is orthogonal
to voltage scaling.
On the other hand, the lesson learned from BlueShift work revealed the
limitation of applying traditional CAD tools for BTWC designs. Due to its
heuristic nature, the conventional CAD tools tend to create a balanced de-
sign in that long paths may still be heavily penalized even if their timing
constraints are relaxed. To solve this problem in conventional CAD tools
and to enable efficient BTWC design optimization, I proposed an analyti-
124
cal approach called DynaTune (Chapter 4) to quantify timing error rate by
computing the error probabilities for digital circuits from input static prob-
abilities. A circuit’s error statistics are represented with a dynamic behavior
curve.
A dynamic behavior is a quantitative representation of the relation between
the cycle time t and the probability Ps of a circuit being stabilized while
operating at this cycle time. Ps can be plotted as a function of t: the x axis
shows the target operating points t, while the y axis shows the stabilization
probabilities Ps of a circuit or a single PO being stabilized at t. The dynamic
behavior curve is derived using timed characteristic functions (TCF) through
recursively rewriting of a cell’s TCF based on its immediate fan-ins.
Furthermore, from a dynamic behavior curve, a circuit’s BTWC through-
put curve can then be calculated accordingly. A point on the throughput
curve can tell us the actual throughput of a BTWC circuit when it is oper-
ating at t.
Utilizing this analytical timing-error-probability framework, I developed
a logic optimization algorithm that uses dualVt (threshold voltage) cells to
optimize a circuit in such a way that the most dynamically critical gates of a
circuit are detected, analyzed, and optimized for high throughput. This new
performance optimization technique exploits a min-cut algorithm and uses
the behavior curve derived from TCF to decide which part of the circuit is
the best candidate for dynamic timing optimization. Then DynaTune assigns
fast lowVt cells on the most commonly exercised critical paths. Experimen-
tal results show that DynaTune can provide 8% and 13% throughput gains
over conventional razor Logic and telescopic unit based timing speculative
techniques, respectively. The experimental results also show that DynaTune
can effectively transform 50-70% of frequency increase into real throughput
gain.
Though DynaTune can effectively increase BTWC circuit throughput, it
has a limitation due to the use of a global BDD to represent TCFs. Some-
times, DynaTune may run into scalability issues for complex circuits. To
make the circuit-level dynamic behavior analysis more scalable, I developed
the technique of timed ternary decision diagrams (tTDD) (Chapter 5) to
enable dynamic behavior analysis on partitioned sub-circuits. A random
variable Xn(t) ∈ {0, 1, U} is used to represent the random events of stabilize-
to-0, stabilize-to-1 and unstable. The stabilization conditions of a node are
125
represented in a tTDD. A tTDD, as with ordinary ternary decision diagrams
(TDD), has three outgoing edges from each decision node to explicitly encode
these three possible outcomes associated with Xn(t). Different from ordinary
TDD, a tTDD has a timing term associated with each decision node. This
timing term can be evaluated at a specified timing point t to produce an
evaluated tTDD for stabilization probability calculation.
To preserve stabilization probability calculation accuracy on an evaluated
tTDD, two novel techniques - false path pruning and random variable com-
paction - were proposed to handle the structural correlation and temporal
correlation induced by reconvergent nets in a circuit. A novel partitioning
algorithm was also proposed to produce sub-circuits that are suitable for
tTDD analysis.
Comparing with DynaTune, tTDD can achieve on average over 60× speedup
and can handle complex circuits that may cause DynaTune to fail. tTDD
can also preserve very good accuracy. Compared to the timed simulation
results, on average tTDD has a mean absolute error (MAE) of 1.7% and
a root mean square error of 3.9% for MCNC benchmarks and a MAE of
2.2% and a RMSE 4.8% for ISCAS benchmarks. The experimental results
also show that the tTDD approach is insensitive to the initial primary input
static probabilities.
With the shrinking of technology node, there is an increasing interest of
optimizing circuits for timing error resilience using BTWC methodology. To
accommodate this need, I proposed common case promotion (CCP) (Chap-
ter 6) to improve circuit timing error resilience. CCP is a dynamic behavior
driven optimization flow that manipulates the previously proposed dynamic
behavior to improve the circuit stabilization probability for common cases.
To sustain the BTWC throughput performance under delay deterioration,
we need to maintain a high probability Ps of outputs being stabilized even if
a circuit slows down due to dynamic variations, such as supply voltage droop
or temperature fluctuation. This requires a circuit to be optimized in such
a way that Ps decreases gracefully as the timing of a circuit deteriorates.
Furthermore, this is translated into the requirement of optimizing a circuit
at the design time to have a flat plateau of Ps on the dynamic behavior
curve so that the circuit can operate in a broad range of tclk to tolerate
delay deterioration. The proposed CCP consists of three major steps as
summarized below.
126
Firstly, a probability-driven re-synthesis procedure that changes a digital
circuit’s internal structure is applied on common cases. In this step, the
common primes in a partitioned sub-circuit are firstly identified and are then
re-implemented with a shallow redundant logic cone using a procedure called
“Promote”.
Secondly, to reduce area overhead and preserve the implementation for the
promoted common cases, a dynamic behavior aware SAT-based redundancy
remover is applied on the re-structured circuit. An XOR miter consisting of
the original circuit and the promoted circuit is constructed. Then a SAT-
solver is invoked on this miter to test if any stuck-at-α fault is non-observable.
Those non-observable faults are the points where redundancy removal can be
applied to reduce area. By carefully choosing test points, the SAT-based re-
dundancy remover can avoid removing logic that improves dynamic behavior.
Thirdly, a TCF based circuit dynamic behavior analysis is applied at the
end of each promotion step to avoid unnecessary promotion and also to pro-
vide optimization convergence. On one hand, the TCF test filters out unnec-
essary promotions of common cases which do not contribute to improvement
of overall dynamic behavior. On the other hand, this TCF test skips further
optimization on the already promoted part to ensure optimization flow con-
vergence. The experimental results show that CCP can effectively improve
a circuit’s timing error resilience by more than 20% at a very small increase
in area of 3%. CCP points out a new digital circuit optimization dimension
that considers a circuit’s timing error resilience.
To conclude, given the lack of appropriate CAD tools to support the needs
of BTWC design, this dissertation studied the requirements of BTWC design
from a CAD perspective and proposed to optimize for common cases as a way
to improve performance of BTWC designs. To quantify a circuit-level BTWC
design’s performance, throughput was proposed as a measurement. To de-
rive throughput, the novel concept of the dynamic behavior curve was used.
This study also pointed out methods of deriving dynamic behavior curves
analytically using TCF and tTDD. To optimize for common cases, derived
dynamic behavior curves were used as guidelines throughout the iterative
optimization procedures in PCT, DynaTune and CCP. These circuit-level
optimization techniques selectively reduce the delay on frequently exercised
paths and relax the part that is rarely exercised. The experimental results
show that the proposed methods of deriving the dynamic behavior curve
127
are very accurate and practical. It is also shown by the experiments that
CAD optimizations guided by the dynamic behavior curve can significantly
improve BTWC throughput performance and timing error resilience.
128
REFERENCES
[1] T. Austin, V. Bertacco, D. Blaauw, and T. Mudge, “Opportunities and
challenges for better than worst-case design,” in Proceedings of Asia
and South Pacific Design Automation Conference, vol. 1, Jan. 2005, pp.
I/2–I/7.
[2] B. Greskamp and J. Torrellas, “Paceline: Improving single-thread per-
formance in nanoscale CMPs through core overclocking,” in Proceedings
of International Conference on Parallel Architecture and Compilation
Techniques, Sept. 2007, pp. 213–224.
[3] D. Ernst, N. S. Kim, S. Das, S. Pant, R. Rao, T. Pham, C. Ziesler,
D. Blaauw, T. Austin, K. Flautner, and T. Mudge, “Razor: A low-
power pipeline based on circuit-level timing speculation,” in Proceedings
of Annual IEEE/ACM International Symposium on Microarchitecture,
Dec. 2003, pp. 7–18.
[4] B. Greskamp, L. Wan, U. Karpuzcu, J. Cook, J. Torrellas, D. Chen,
and C. Zilles, “BlueShift: Designing processors for timing speculation
from the ground up,” in Proceedings of IEEE International Symposium
on High Performance Computer Architecture, Feb. 2009, pp. 213–224.
[5] K. Bowman, C. Tokunaga, J. Tschanz, A. Raychowdhury, M. Khellah,
B. Geuskens, S.-L. Lu, P. Aseron, T. Karnik, and V. De, “Dynamic
variation monitor for measuring the impact of voltage droops on micro-
processor clock frequency,” in Proceedings of IEEE Custom Integrated
Circuits Conference, Sept. 2010, pp. 1–4.
[6] J. Bhasker and R. Chadha, Static Timing Analysis for Nanometer De-
signs: A Practical Approach, 1st ed. New York, NY: Springer Publish-
ing Company, Incorporated, 2009.
[7] A. Mishchenko, S. Chatterjee, and R. Brayton, “DAG-aware AIG rewrit-
ing: A fresh look at combinational logic synthesis,” in Proceedings of
Design Automation Conference, Jul. 2006, pp. 532–535.
[8] A. Mishchenko, R. Brayton, S. Jang, and V. Kravets, “Delay optimiza-
tion using SOP balancing,” in Proceedings of IEEE/ACM International
Conference on Computer-Aided Design, Nov. 2011, pp. 375–382.
129
[9] A. Mishchenko, B. Steinbach, and M. Perkowski, “An algorithm for bi-
decomposition of logic functions,” in Proceedings of Design Automation
Conference, Jun. 2001, pp. 103–108.
[10] L. Cheng, D. Chen, and M. Wong, “DDBDD: Delay-driven BDD syn-
thesis for FPGAs,” IEEE Transactions on Computer-Aided Design of
Integrated Circuits and Systems, vol. 27, no. 7, pp. 1203–1213, Jul. 2008.
[11] P. Pan, “Performance-driven integration of retiming and resynthesis,” in
Proceedings of Design Automation Conference, Jun. 1999, pp. 243–246.
[12] V. Tiwari, P. Ashar, and S. Malik, “Technology mapping for low power,”
in Proceedings of Design Automation Conference, Jun. 1993, pp. 74–79.
[13] R. Bahar, G. Hachtel, E. Macii, and F. Somenzi, “A symbolic method
to reduce power consumption of circuits containing false paths,” in Pro-
ceedings of IEEE/ACM International Conference on Computer-Aided
Design, Nov. 1994, pp. 368–371.
[14] H. Zhou and D. Wong, “An exact gate decomposition algorithm for low-
power technology mapping,” in Proceedings of IEEE/ACM International
Conference on Computer-Aided Design, Nov. 1997, pp. 575–580.
[15] J. Leijten, J. van Meerbergen, and J. Jess, “Analysis and reduction of
glitches in synchronous networks,” in Proceedings of European Design
and Test Conference, Mar. 1995, pp. 398–403.
[16] S. Augsburger and B. Nikolic, “Combining dual-supply, dual-threshold
and transistor sizing for power reduction,” in Proceedings of IEEE In-
ternational Conference on Computer Design: VLSI in Computers and
Processors, Sept. 2002, pp. 316–321.
[17] J. H. Patel, “CMOS process variations: A critical op-
eration point hypothesis,” Apr. 2008. [Online]. Available:
http://www.stanford.edu/class/ee380/Abstracts/080402.html
[18] L. Wan and D. Chen, “DynaTune: Circuit-level optimization for tim-
ing speculation considering dynamic path behavior,” in Proceedings of
IEEE/ACM International Conference on Computer-Aided Design, Nov.
2009, pp. 172–179.
[19] T. Austin, “DIVA: A reliable substrate for deep submicron microarchi-
tecture design,” in Proceedings of International Symposium on Microar-
chitecture, Nov. 1999, pp. 196–207.
[20] T. Liu and S.-L. Lu, “Performance improvement with circuit-level spec-
ulation,” in Proceedings IEEE/ACM International Symposium on Mi-
croarchitecture, Dec. 2000, pp. 348–355.
130
[21] F. Mesa-Martinez and J. Renau, “Effective optimistic-checker tan-
dem core design through architectural pruning,” in Proceedings of
IEEE/ACM International Symposium on Microarchitecture, Dec. 2007,
pp. 236–248.
[22] K. Sundaramoorthy, Z. Purser, and E. Rotenburg, “Slipstream pro-
cessors: Improving both performance and fault tolerance,” SIGARCH
Computer Architecture News, vol. 28, pp. 257–268, Nov. 2000.
[23] J. C. Smolens, B. T. Gold, B. Falsafi, and J. C. Hoe, “Re-
union: Complexity-effective multicore redundancy,” in Proceedings of
IEEE/ACM International Symposium on Microarchitecture, Dec. 2006,
pp. 223–234.
[24] N. R. Shanbhag, R. A. Abdallah, R. Kumar, and D. L. Jones, “Stochas-
tic computation,” in Proceedings of the Design Automation Conference,
Jun. 2010, pp. 859–864.
[25] R. Hegde and N. Shanbhag, “Soft digital signal processing,” IEEE
Transactions on Very Large Scale Integration (VLSI) Systems, vol. 9,
no. 6, pp. 813–823, Dec. 2001.
[26] R. Hegde and N. Shanbhag, “A voltage overscaled low-power digital filter
IC,” IEEE Journal of Solid-State Circuits, vol. 39, no. 2, pp. 388–391,
Feb. 2004.
[27] G. Varatkar and N. Shanbhag, “Error-resilient motion estimation archi-
tecture,” IEEE Transactions on Very Large Scale Integration (VLSI)
Systems, vol. 16, no. 10, pp. 1399–1412, Oct. 2008.
[28] R. Abdallah and N. Shanbhag, “Error-resilient low-power Viterbi de-
coder architectures,” IEEE Transactions on Signal Processing, vol. 57,
no. 12, pp. 4906–4917, Dec. 2009.
[29] S. Nowick, “Design of a low-latency asynchronous adder using specula-
tive completion,” IEE Proceedings of Computers and Digital Techniques,
vol. 143, no. 5, pp. 301–307, Sept. 1996.
[30] L. Benini, E. Macii, M. Poncino, and G. De Micheli, “Telescopic units:
A new paradigm for performance optimization of VLSI designs,” IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Sys-
tems, vol. 17, no. 3, pp. 220–232, Mar. 1998.
[31] Y.-S. Su, D.-C. Wang, S.-C. Chang, and M. Marek-Sadowska, “An ef-
ficient mechanism for performance optimization of variable-latency de-
signs,” in Proceedings of Design Automation Conference, Jun. 2007, pp.
976–981.
131
[32] K. Bowman, J. Tschanz, C. Wilkerson, S.-L. Lu, T. Karnik, V. De,
and S. Borkar, “Circuit techniques for dynamic variation tolerance,” in
Proceedings of Design Automation Conference, Jul. 2009, pp. 4–7.
[33] S. Sarangi, B. Greskamp, A. Tiwari, and J. Torrellas, “EVAL: Utiliz-
ing processors with variation-induced timing errors,” in Proceedings of
IEEE/ACM International Symposium on Microarchitecture, Nov. 2008,
pp. 423–434.
[34] H.-P. Su, A. C.-H. Wu, and Y.-L. Lin, “A timing-driven soft-macro
resynthesis method in interaction with chip floorplanning,” in Proceed-
ings of Design Automation Conference, Jun. 1999, pp. 262–267.
[35] E. Lehman, Y. Watanabe, J. Grodstein, and H. Harkness, “Logic de-
composition during technology mapping,” in Proceedings of IEEE/ACM
International Conference on Computer-Aided Design, Nov. 1995, pp.
264–271.
[36] A. Mishchenko, R. Brayton, and S. Chatterjee, “Boolean factoring and
decomposition of logic networks,” in Proceedings of IEEE/ACM Inter-
national Conference on Computer-Aided Design, Nov. 2008, pp. 38–44.
[37] Sun Microsystem, “OpenSPARC T1 RTL release 1.5,” Jan. 2012. [On-
line]. Available: http://www.opensparc.net/opensparc-t1/index.html
[38] UMC 130nm Standard Cell Library, UMC Inc., Taiwan, 2003.
[39] G. Sery, S. Borkar, and V. De, “Life is CMOS: Why chase the life after?”
in Proceedings of Design Automation Conference, Jun. 2002, pp. 78–83.
[40] S. Thoziyoor, N. Muralimanohar, J. H. Ahn, and N. Jouppi, CACTI
5.1. Technical Report HPL-2008-20, Hewlett Packard Labs, April 2008.
[41] R. Zlatanovici and B. Nikolic, “Power-performance optimal 64-bit carry-
lookahead adders,” in Proceedings of European Solid-State Circuits Con-
ference, Sept. 2003, pp. 321–324.
[42] Y.-M. Kuo, Y.-L. Chang, and S.-C. Chang, “Efficient Boolean charac-
teristic function for fast timed ATPG,” in Proceedings of IEEE/ACM
International Conference on Computer-Aided Design, Nov. 2006, pp.
96–99.
[43] R. Bryant, “Graph-based algorithms for Boolean function manipula-
tion,” IEEE Transactions on Computers, vol. C-35, no. 8, pp. 677–691,
Aug. 1986.
[44] F. N. Najm, “Transition density, a stochastic measure of activity in
digital circuits,” in Proceedings of Design Automation Conference, Jun.
1991, pp. 644–649.
132
[45] S. Devadas, K. Keutzer, S. Malik, and A. Wang, “Computation of float-
ing mode delay in combinational circuits: Practice and implementation,”
IEEE Transactions on Computer-Aided Design of Integrated Circuits
and Systems, vol. 12, no. 12, pp. 1924–1936, Dec. 1993.
[46] “SIS unofficial release 1.3,” Nov. 2010. [Online]. Available:
http://embedded.eecs.berkeley.edu/Alumni/pchong/sis.html
[47] TSMC Reference Flow 7.0, TSMC Inc., Taiwan, 2006.
[48] A. Kahng, S. Kang, R. Kumar, and J. Sartori, “Slack redistribution
for graceful degradation under voltage overscaling,” in Proceedings of
Asia and South Pacific Design Automation Conference, Jan. 2010, pp.
825–831.
[49] L. Wan and D. Chen, “Analysis of circuit dynamic behavior with timed
ternary decision diagram,” in Proceedings of IEEE/ACM International
Conference on Computer-Aided Design, Nov. 2010, pp. 516–523.
[50] W. K. C. Lam, R. K. Brayton, and A. L. Sangiovanni-Vincentelli, “Cir-
cuit delay models and their exact computation using timed Boolean
functions,” in Proceedings of Design Automation Conference, Jun. 1993,
pp. 128–134.
[51] T. Sasao, “Ternary decision diagrams - survey,” in Proceedings of Inter-
national Symposium on Multiple-Valued Logic, May 1997, pp. 241–250.
[52] R. I. Bahar, E. A. Frohm, C. M. Gaona, G. D. Hachtel, E. Macii,
A. Pardo, and F. Somenzi, “Algebraic decision diagrams and their ap-
plications,” in Proceedings of IEEE/ACM International Conference on
Computer-Aided Design, Nov. 1993, pp. 188–191.
[53] T. Kam, T. Villa, R. K. Brayton, and A. Sangiovanni-Vincentelli,
“Multi-valued decision diagrams: Theory and applications,” Interna-
tional Journal on Multiple-Valued Logic, vol. 4, no. 1, pp. 9–62, 1998.
[54] Y.-T. Lai and S. Sastry, “Edge-valued binary decision diagrams for
multi-level hierarchical verification,” in Proceedings of Design Automa-
tion Conference, Jun. 1992, pp. 608–613.
[55] C.-Y. Tsui, M. Pedram, and A. M. Despain, “Efficient estimation of
dynamic power consumption under a real delay model,” in Proceedings of
IEEE/ACM International Conference on Computer-Aided Design, Nov.
1993, pp. 224–228.
[56] D. Cheng, “Power estimation of digital CMOS circuits and the applica-
tion to logic synthesis for low power,” Ph.D. dissertation, University of
California, San Diego, 1995.
133
[57] S. Bhanja and N. Ranganathan, “Dependency preserving probabilistic
modeling of switching activity using Bayesian networks,” in Proceedings
of Design Automation Conference, Jun. 2001, pp. 209–214.
[58] R. Marculescu, D. Marculescu, and M. Pedram, “Probabilistic modeling
of dependencies during switching activity analysis,” IEEE Transactions
on Computer-Aided Design of Integrated Circuits and Systems, vol. 17,
no. 2, pp. 73–83, Feb. 1998.
[59] T. Uchino, F. Minami, T. Mitsuhashi, and N. Goto, “Switching activ-
ity analysis using Boolean approximation method,” in Proceedings of
IEEE/ACM International Conference on Computer-Aided Design, Nov.
1995, pp. 20–25.
[60] J. Costa, J. Monteiro, and S. Devadas, “Switching activity estimation
using limited depth reconvergent path analysis,” in Proceedings of Inter-
national Symposium on Low Power Electronics and Design, Aug. 1997,
pp. 184–189.
[61] B. Kapoor, “Improving the accuracy of circuit activity measurement,” in
Proceedings of Design Automation Conference, Jun. 1994, pp. 734–739.
[62] NCVerilog Simulator Help, Cadence Inc., San Jose, CA, Nov. 2007.
[63] J. Cong, Z. Li, and R. Bagrodia, “Acyclic multi-way partitioning of
Boolean networks,” in Proceedings of Design Automation Conference,
Jun. 1994, pp. 670–675.
[64] A. Kahng, S. Kang, R. Kumar, and J. Sartori, “Recovery-driven design:
A power minimization methodology for error-tolerant processor mod-
ules,” in Proceedings of Design Automation Conference, Jun. 2010, pp.
825–830.
[65] S. Ghosh, S. Bhunia, and K. Roy, “A new paradigm for low-power,
variation-tolerant circuit synthesis using critical path isolation,” in Pro-
ceedings of IEEE/ACM International Conference on Computer-Aided
Design, Nov. 2006, pp. 619–624.
[66] G. D. Micheli, Synthesis and Optimization of Digital Circuits, ser.
McGraw-Hill series in electrical and computer engineering: Electronics
and VLSI circuits. New York, NY: McGraw-Hill, 1994.
[67] Berkeley Logic Synthesis and Verification Group, “ABC: A system for
sequential synthesis and verification, release 20070911,” Sept. 2007.
[Online]. Available: http://www.eecs.berkeley.edu/∼alanmi/abc/
[68] A. Mishchenko, “An introduction to zero-suppressed binary decision,”
Portland State University, Tech. Rep., Jun. 2001.
134
[69] C.-A. Wu, T.-H. Lin, S.-L. Huang, and C.-Y. Huang, “SAT-controlled
redundancy addition and removal: A novel circuit restructuring tech-
nique,” in Proceedings of Asia and South Pacific Design Automation
Conference, Jan. 2009, pp. 191–196.
[70] P. Menon and H. Ahuja, “Redundancy removal and simplification of
combinational circuits,” in Proceedings of IEEE VLSI Test Symposium
on Design, Test and Application: ASICs and Systems-on-a-Chip, Apr.
1992, pp. 268–273.
[71] N. Ee´n and N. So¨rensson, “An extensible SAT-solver,” Theory and Ap-
plications of Satisfiability Testing, vol. 2919, pp. 333–336, 2004.
135
